Capstone Challenge — An analysis of the Starbucks app data

Matheus Such
15 min readMay 25, 2021

Project overview

Starbucks has made its promotional data available for analysis. These data simulate the behavior of customers in the company’s mobile application. The following article is about the Capstone Challenge project for Udacity’s Data Scientist Nanodegree.

The main objective is to be able to identify which offers and customers are the most adherent to the company’s campaigns and, subsequently, create a model to predict the success of an offer based on the demographic and categorical information contained in the available databases.

To achieve this goal, the project was divided into three steps:

  1. Investigation, cleaning and unification of databases
  2. Identify successful offers and analyze data
  3. Building classification models to predict whether the offer will be a success.

What will be seen?

There are three available databases:

Portfolio — It is the dataset with all available offers and their characteristics, including information as: offer type, difficulty and duration. There are three types of offers: BOGO (buy one get one), discount and informational and these vary according to duration, difficulty, channels and reward.

FIG 1 — Portofolio data set
  • id (string) — offer id
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) — time for offer to be open, in days
  • channels (list of strings)

Profile — In this database we have information relating to customers who have used the app during some period, including users information: age, gender, income and membership date.

FIG 2 — Profile data set
  • age (int) — age of the customer
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income

Transcript — Transactions that have been performed by users: transactions, offers received, offers viewed e offers completed. In this database we can do the following reading:
What was the action that occurred with that user at a certain time during the study?

  1. Offer received — When a user receives a offer
  2. Offer viewed — When a user sees a offer
  3. Transaction — When a user buy something in the app
  4. Offer completed — When a user completes that offer
FIG 3 — Transcript data set
  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer id
  • time (int) — time in hours since start of test. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

Problem Understanding

The main objective was to be able to segment customers and offers to understand which groups were most successful. The first challenge was to clean up each database so that it was possible to consolidate into a single dataset.

- Data processing

To start the analysis, it was necessary to understand each database and clean it properly so that it was possible to merge them. The following steps were performed in each database:

Portfolio

  • Created a new column to identify offers according to type + difficulty + reward. Afterwards the old offer_id is removed
  • Changed duration from days to hours
  • Renamed the column from ‘id’ to ‘offer_id’
  • One-hot-encode the channels and offer_type columns
FIG 4 — Portfolio data set after cleaning

Profile

  • Changed the format of the column “became_member_on” on date Y / M / D
  • One-hot-hides the gender column
  • Renamed column ‘id’ to ‘user_id’
FIG 5 — Profile data set after cleaning

Transcript

  • Normalized the ‘value’ column to separate the dict into two columns
  • Renamed the ‘person’ column to ‘user_id’
  • One-hot-encode the event column
FIG 6 — Transcript data set after cleaning

- Putting all together

After organizing the three previous databases, it was time to put all databases togheter by:

  • Transcript and profile merged by the ‘user_id’ column
  • Merged the portfolio by the ‘offer_id’ column
  • After removing the columns where ‘income’ were null — it was identified that all users who had their income null, have age equals to 118 and gender equals to ‘None’, so it was decided to remove that column.
  • Removed the old ‘offer_id’ column and replaced with the new one

- Identifying a successful offer

The biggest challenge was to understand how to structure all offers that are considered successful or not. From a business perspective, if the customer completed an offer without being influenced by the campaign, it was not beneficial to have sent the coupon. Therefore, the following premises were assumed in this study:

  1. An offer is considered a success if the following steps are made:
    - Offer Received -> View Offer -> Complete Offer
  2. An offer is considered unsuccessful in the following situations:
    - Offer Received
    - Offer Received -> View Offer
    - Offer Received -> Complete Offer -> View Offer
    - Offer Received -> Complete Offer
FIG 7 — User behavior

Analyzing the FIG 7,we can see that it is possible to identify the user’s behavior. In the image above, the user receives an offer and immediately views it, 60h later he makes a transaction and the offer is completed. However, in other offers, the user ends up not completing the expected journey.

Another observation is that the discount offers are cumulative until the final validation date, that is, the customer receives an offer of $ 10, makes several purchases in the app and in the last transaction that reaches this minimum value, he will win the reward.

As for the BOGO (buy one get one) offers, if the customer makes a transaction equals to the minimum amount required (difficulty), he activates the coupon at the same time as the purchase.

Finally, informational type offers were disregarded in this analysis as they do not generate a direct benefit to the customer and may or may not have a direct influence on the purchase in the app.

Data analysis

Users general distribution — Age and Income

From the Age Distribution chart (FIG 8), we can see a very similar behavior between genders and alto that we have a greater volume of male users. The vast majority of male users are in the 40–70 age group, while female users are more concentrated between 50–60 years old. The average for both genders is approximately 60 years old.

As for the Income Distribution, we also have a similar behavior between genders. Apparently, they have a higher volume of men who earn up to 80k and after that women start to be the majority.

FIG 8 — Income per gender

General distribution of offers by gender -Type and Event

As previously mentioned, there are three types of offers: Discount, BOGO (buy one and get one) and information. On the left chart (FIG 9), we can see that these were distributed proportionally among each gender. As for the right chart we have all events by gender, when we look at the amount of offers received vs. complete, it is clear that women have a higher conversion rate than men by comparing all received offers x completed offers.

FIG 9—Offer type by Gender and Event by Gender

Sucess rate by offer — Discount and BOGO

General Analysis
Considering only the discount and BOGO offers and separating them according to their difficulty, we were able to identify between each offer which one was successful (Received -> View -> Complete). The offers with the highest success rate are: Discount-10–10, Discount-7–7 and BOGO-5–5.

FIG 10 — Sucess rate by offer

Analyzing by gender
As expected due to the higher conversion rate of received offers to completed offers, women also have a higher success rate than men in all offers.

FIG 10 — Sucess rate by offer and gender

Analyzing by age
Below (FIG 11) we can see which age group has the best success rate by gender:
- Women: 36–45 e 86–95 years old
- Men: 76–85, 76–55 e 66–75 years old

FIG 11 — Sucess rate by age

Analyzing the return over investment
In order to make an analysis on a partial return over investment (only with successful offers), offers that had a total value with a difference of + — 1.5 from the standard deviation on the “amount” column due to some outliers were withdrawn.

Note in the FIG 12 that although they do not have the highest success rate, the offers types BOGO-5–7 and BOGO-5–5 are the ones that brought the most financial return.

FIG 12 — Return over Investment for each offer

Correlation Matrix

With the general features correlation matrix (FIG 13), it was not possible to abstract relevant information from the first moment. Bringing the importance of obtaining more information about customers to achieve better segmentation and measure the impact on offers and success rate.

FIG 13 — Correlation matrix

In other hand, analyzing the matrix more closely (FIG 14)with the offers seen and complete vs. the communication channels, we can see that there is a positive trend for visualization in the social and mobile channels.

FIG 14 — Channels correlation matrix

Metrics

Metrics are used to evaluate your classification model. Basically, it mesures how good is your model. For this project I used the confusion matrix’s, accuracy, precision, Recall, F1-score as the performance metrics and also a cross validation using Stratified k fold.

Confusion Matrix — The confusion matrix provides a more insightful picture which is not only the performance of a predictive model, but also which classes are being predicted correctly and incorrectly, and what type of errors are being made.

It is a common way of presenting true positive (tp), true negative (tn), false positive (fp) and false negative (fn) predictions. Those values are presented in the form of a matrix where the Y-axis shows the true classes while the X-axis shows the predicted classes.

FIG 15 — Confusion Matrix

Accuracy — one of the common evaluation metrics in classification problems, that is the total number of correct predictions divided by the total number of predictions made for a dataset.

Where:

  • tp = true positive
  • tn = true negative
  • fp = false positive
  • fn = false negative

Precision — Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all offers that labeled as sucess, how many actually succeeded?

Recall — Recall is the ratio of correctly predicted positive observations to the all observations in actual class . The question recall answers is: Of all the offers that truly succeeded, how many did we label?

F1-Score — F1 Score is the weighted average of Precision and Recall. A good F1 score means that you have low false positives and low false negatives, so you’re correctly identifying real threats, and you are not disturbed by false alarms. Its good for imbalanced classifications

Classification Models

To find a model that will tell us whether an offer is successful or not, it is necessary to follow the following steps:

  • Select the features for analysis (create dummies)
  • Define the variable
  • Split the database into Training and Testing
  • Select, fit and predict the model

There are several models that can be used in binary classifications like this project, here we evaluated the following algorithms:

  1. Random Forest Classifier
  2. Gradient Boosting Classifier
  3. Ada Boost Classifier

Using the function (classification_model) described above, the following results were found for the algorithms:

Random Forest Classifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Below (FIG 16) are the results for the Random Forest Classifier, this model presented the worst result among the three:

FIG 16 — RFC Results

Gradient Boosting Classifier

The idea behind Gradient boosting is to take a weak hypothesis or weak learning algorithm and make a series of adjustments that will improve the strength of the hypothesis. Its purpose is to minimize the loss or difference between the actual value of the class in the training example and the expected value of the class.

GBC Results:

FIG 17 — GBC Results

Ada Boost Classifier

An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

AdaBoost Results

FIG 18 — Ada Results

This model, as well as the Gradient Boosting Classifier, presented the best results. Therefore, later an analysis will be made to evaluate some parameters to identify possible improvements.

Model Evaluation and Validation

Finding best parameters

Initially, I had selected different parameters and ranges to evaluate the best options. However, due to the processing time I opted for those options shown in the code below:

Ada Boost Classifier — Best Params:

Among the previously defined parameters, for Ada boost the best results are: Algorithm: SAMME.R, learning_rate: 0.2, n_estimators: 1500

FIG 19 — Ada Best Results

n_estimators is the number of models to iteratively train.

Learning Rate is the contribution of each model to the weights and defaults to 1. Reducing the learning rate will mean the weights will be increased or decreased to a small degree, forcing the model train slower (but sometimes resulting in better performance scores)

Evaluation — Gradient Boosting Classifier

Among the previously defined parameters, we can see by the FIG 20 for the Gradient Boosting Classifier the best results are: min_samples_split: 2, n_estimators: 100, learning_rate: 0.045, max_depth: 4, min_samples_leaf: 1

min_samples_split — Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.

min_samples_leaf — The minimum number of samples required to be at a leaf node.

max_depth — The maximum depth of the individual regression estimators.

FIG 20 — GBC Best Params

Using the best parameters slightly increase the overall model Accuracy to 0.6670 and the F1-Score to 0.5406, Kfold Accuracy 0.6664 with SD 0.0266

FIG 20 — GBC Best Results

Below we have in the Fig 21 the confusion matrix for the gradient boosting classifier.

FIG 21 — GBC Confusion Matrix

Validation — Gradient Boosting Classifier

To validate this model, I used a Stratified k-fold cross validation to check for any variance in the model. I choose the stratified because of the success proportion in the df with more than 60% diff between successful and unsuccessful offers (FIG 22)

FIG 22 — Sucess proportion

Use stratified cross validation to enforce class distributions when there are a large number of classes or an imbalance in instances for each class.

FIG 23 — Cross validation

To check out you model’s bias, find out the mean of all thes. If this value is low, it basically means that your model gives low error on an average– indirectly ensuring that your model’s notions about the data are accurate enough.

As we can see below (FIG 24), by the cross validation we got a accuracy of 66.64% with a standard deviation of 2.66% and for the purpose of this project it’s considered a stable performance and the model seems to be robust enough against small perturbations.

FIG 24 — Cross validation results

Justification

In order to achieve this project goals — identify which offers and customers are the most adherent to the company’s campaigns — it was necessary to follow a very careful data cleaning process and define a good strategy to identify real successful offers. This was a very important step since a offer completed does not necessary implies that it was successful. The result of this data cleaning made it possible to find the best costumer types by gender and age.

By doing the steps above, it was possible to find the best performing model to classify the successful offers, by comparing the results with 3 distinct algorithms showed in this project. GBC seems to be a valid model and a start point for business purposes and decision making.

FIG 25 — The Gradient Boosting Classifier have the best performance

Conclusion

The biggest challenge of this project was to do the correct data cleaning and understand the best way to define a successful offer. Separating the offers to identify 100% of the cases that actually followed the expected pattern for success previously defined is quite complex, especially in the way in which the data are arranged.

Analyzing the users, income and offers distributions, we can see that despite the female audience being smaller than the male, women tend to be more attentive to offers and have an average conversion rate higher than men. In addition, the offers that had the highest success rate did not necessarily bring a greater return for the company. It’s important to know the campaign strategy — focusing on increasing revenue, increase the app traffic, etc…

Also, by analyzing the correlation matrix more deeply, we can identify that the Social and Mobile channels were most related with offers visualization than others channels. Therefore, it is recommended that the company focus more on then in future campaigns.

Moving to the classification models, the focus was on evaluating which of the three proposed models would present the best solution. Among then, the one that obtained the best result was the Gradient Boosting Classifier with an overall accuracy of 0.67% and F1-Score of 54%.

Some improvements!

I believe that with more demographic information from clients, such as — city, profession or feedback — it would be possible to propose better strategies and find better results. In addition, the classification models could have benefited from more user transactions in the database.

Also, it would be good to check and improve the success/unsuccess formula used for bogo and discount offers with a more robust method and include the informational offers by setting clear metrics for them.

Another approach would be optimizing the model using others gradient boosting algorithms such as XGBoost or CatBoost. They are improvised versions of the Gradient Bosting Classifier. Looking at the XGBoost one of the most important points is that it implements parallel preprocessing (at the node level) which makes it faster than GBM also includes a variety of regularization techniques that reduce overfitting and improve overall performance. For the CatBoost it works well with the default set of hyperparameters and can internally handle categorical variables in the data.

You can acess the full code in my github page:

--

--

Matheus Such

Automotive Engineer and a Data Science entuasiastic.