Capstone Challenge — An analysis of the Starbucks app data
Project overview
Starbucks has made its promotional data available for analysis. These data simulate the behavior of customers in the company’s mobile application. The following article is about the Capstone Challenge project for Udacity’s Data Scientist Nanodegree.
The main objective is to be able to identify which offers and customers are the most adherent to the company’s campaigns and, subsequently, create a model to predict the success of an offer based on the demographic and categorical information contained in the available databases.
To achieve this goal, the project was divided into three steps:
- Investigation, cleaning and unification of databases
- Identify successful offers and analyze data
- Building classification models to predict whether the offer will be a success.
What will be seen?
There are three available databases:
Portfolio — It is the dataset with all available offers and their characteristics, including information as: offer type, difficulty and duration. There are three types of offers: BOGO (buy one get one), discount and informational and these vary according to duration, difficulty, channels and reward.
- id (string) — offer id
- offer_type (string) — type of offer ie BOGO, discount, informational
- difficulty (int) — minimum required spend to complete an offer
- reward (int) — reward given for completing an offer
- duration (int) — time for offer to be open, in days
- channels (list of strings)
Profile — In this database we have information relating to customers who have used the app during some period, including users information: age, gender, income and membership date.
- age (int) — age of the customer
- became_member_on (int) — date when customer created an app account
- gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
- id (str) — customer id
- income (float) — customer’s income
Transcript — Transactions that have been performed by users: transactions, offers received, offers viewed e offers completed. In this database we can do the following reading:
What was the action that occurred with that user at a certain time during the study?
- Offer received — When a user receives a offer
- Offer viewed — When a user sees a offer
- Transaction — When a user buy something in the app
- Offer completed — When a user completes that offer
- event (str) — record description (ie transaction, offer received, offer viewed, etc.)
- person (str) — customer id
- time (int) — time in hours since start of test. The data begins at time t=0
- value — (dict of strings) — either an offer id or transaction amount depending on the record
Problem Understanding
The main objective was to be able to segment customers and offers to understand which groups were most successful. The first challenge was to clean up each database so that it was possible to consolidate into a single dataset.
- Data processing
To start the analysis, it was necessary to understand each database and clean it properly so that it was possible to merge them. The following steps were performed in each database:
Portfolio
- Created a new column to identify offers according to type + difficulty + reward. Afterwards the old offer_id is removed
- Changed duration from days to hours
- Renamed the column from ‘id’ to ‘offer_id’
- One-hot-encode the channels and offer_type columns
Profile
- Changed the format of the column “became_member_on” on date Y / M / D
- One-hot-hides the gender column
- Renamed column ‘id’ to ‘user_id’
Transcript
- Normalized the ‘value’ column to separate the dict into two columns
- Renamed the ‘person’ column to ‘user_id’
- One-hot-encode the event column
- Putting all together
After organizing the three previous databases, it was time to put all databases togheter by:
- Transcript and profile merged by the ‘user_id’ column
- Merged the portfolio by the ‘offer_id’ column
- After removing the columns where ‘income’ were null — it was identified that all users who had their income null, have age equals to 118 and gender equals to ‘None’, so it was decided to remove that column.
- Removed the old ‘offer_id’ column and replaced with the new one
- Identifying a successful offer
The biggest challenge was to understand how to structure all offers that are considered successful or not. From a business perspective, if the customer completed an offer without being influenced by the campaign, it was not beneficial to have sent the coupon. Therefore, the following premises were assumed in this study:
- An offer is considered a success if the following steps are made:
- Offer Received -> View Offer -> Complete Offer - An offer is considered unsuccessful in the following situations:
- Offer Received
- Offer Received -> View Offer
- Offer Received -> Complete Offer -> View Offer
- Offer Received -> Complete Offer
Analyzing the FIG 7,we can see that it is possible to identify the user’s behavior. In the image above, the user receives an offer and immediately views it, 60h later he makes a transaction and the offer is completed. However, in other offers, the user ends up not completing the expected journey.
Another observation is that the discount offers are cumulative until the final validation date, that is, the customer receives an offer of $ 10, makes several purchases in the app and in the last transaction that reaches this minimum value, he will win the reward.
As for the BOGO (buy one get one) offers, if the customer makes a transaction equals to the minimum amount required (difficulty), he activates the coupon at the same time as the purchase.
Finally, informational type offers were disregarded in this analysis as they do not generate a direct benefit to the customer and may or may not have a direct influence on the purchase in the app.
Data analysis
Users general distribution — Age and Income
From the Age Distribution chart (FIG 8), we can see a very similar behavior between genders and alto that we have a greater volume of male users. The vast majority of male users are in the 40–70 age group, while female users are more concentrated between 50–60 years old. The average for both genders is approximately 60 years old.
As for the Income Distribution, we also have a similar behavior between genders. Apparently, they have a higher volume of men who earn up to 80k and after that women start to be the majority.
General distribution of offers by gender -Type and Event
As previously mentioned, there are three types of offers: Discount, BOGO (buy one and get one) and information. On the left chart (FIG 9), we can see that these were distributed proportionally among each gender. As for the right chart we have all events by gender, when we look at the amount of offers received vs. complete, it is clear that women have a higher conversion rate than men by comparing all received offers x completed offers.
Sucess rate by offer — Discount and BOGO
General Analysis
Considering only the discount and BOGO offers and separating them according to their difficulty, we were able to identify between each offer which one was successful (Received -> View -> Complete). The offers with the highest success rate are: Discount-10–10, Discount-7–7 and BOGO-5–5.
Analyzing by gender
As expected due to the higher conversion rate of received offers to completed offers, women also have a higher success rate than men in all offers.
Analyzing by age
Below (FIG 11) we can see which age group has the best success rate by gender:
- Women: 36–45 e 86–95 years old
- Men: 76–85, 76–55 e 66–75 years old
Analyzing the return over investment
In order to make an analysis on a partial return over investment (only with successful offers), offers that had a total value with a difference of + — 1.5 from the standard deviation on the “amount” column due to some outliers were withdrawn.
Note in the FIG 12 that although they do not have the highest success rate, the offers types BOGO-5–7 and BOGO-5–5 are the ones that brought the most financial return.
Correlation Matrix
With the general features correlation matrix (FIG 13), it was not possible to abstract relevant information from the first moment. Bringing the importance of obtaining more information about customers to achieve better segmentation and measure the impact on offers and success rate.
In other hand, analyzing the matrix more closely (FIG 14)with the offers seen and complete vs. the communication channels, we can see that there is a positive trend for visualization in the social and mobile channels.
Metrics
Metrics are used to evaluate your classification model. Basically, it mesures how good is your model. For this project I used the confusion matrix’s, accuracy, precision, Recall, F1-score as the performance metrics and also a cross validation using Stratified k fold.
Confusion Matrix — The confusion matrix provides a more insightful picture which is not only the performance of a predictive model, but also which classes are being predicted correctly and incorrectly, and what type of errors are being made.
It is a common way of presenting true positive (tp), true negative (tn), false positive (fp) and false negative (fn) predictions. Those values are presented in the form of a matrix where the Y-axis shows the true classes while the X-axis shows the predicted classes.
Accuracy — one of the common evaluation metrics in classification problems, that is the total number of correct predictions divided by the total number of predictions made for a dataset.
Where:
- tp = true positive
- tn = true negative
- fp = false positive
- fn = false negative
Precision — Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all offers that labeled as sucess, how many actually succeeded?
Recall — Recall is the ratio of correctly predicted positive observations to the all observations in actual class . The question recall answers is: Of all the offers that truly succeeded, how many did we label?
F1-Score — F1 Score is the weighted average of Precision and Recall. A good F1 score means that you have low false positives and low false negatives, so you’re correctly identifying real threats, and you are not disturbed by false alarms. Its good for imbalanced classifications
Classification Models
To find a model that will tell us whether an offer is successful or not, it is necessary to follow the following steps:
- Select the features for analysis (create dummies)
- Define the variable
- Split the database into Training and Testing
- Select, fit and predict the model
There are several models that can be used in binary classifications like this project, here we evaluated the following algorithms:
- Random Forest Classifier
- Gradient Boosting Classifier
- Ada Boost Classifier
Using the function (classification_model) described above, the following results were found for the algorithms:
Random Forest Classifier
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Below (FIG 16) are the results for the Random Forest Classifier, this model presented the worst result among the three:
Gradient Boosting Classifier
The idea behind Gradient boosting is to take a weak hypothesis or weak learning algorithm and make a series of adjustments that will improve the strength of the hypothesis. Its purpose is to minimize the loss or difference between the actual value of the class in the training example and the expected value of the class.
GBC Results:
Ada Boost Classifier
An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
AdaBoost Results
This model, as well as the Gradient Boosting Classifier, presented the best results. Therefore, later an analysis will be made to evaluate some parameters to identify possible improvements.
Model Evaluation and Validation
Finding best parameters
Initially, I had selected different parameters and ranges to evaluate the best options. However, due to the processing time I opted for those options shown in the code below:
Ada Boost Classifier — Best Params:
Among the previously defined parameters, for Ada boost the best results are: Algorithm: SAMME.R, learning_rate: 0.2, n_estimators: 1500
n_estimators is the number of models to iteratively train.
Learning Rate is the contribution of each model to the weights and defaults to 1. Reducing the learning rate will mean the weights will be increased or decreased to a small degree, forcing the model train slower (but sometimes resulting in better performance scores)
Evaluation — Gradient Boosting Classifier
Among the previously defined parameters, we can see by the FIG 20 for the Gradient Boosting Classifier the best results are: min_samples_split: 2, n_estimators: 100, learning_rate: 0.045, max_depth: 4, min_samples_leaf: 1
min_samples_split — Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
min_samples_leaf — The minimum number of samples required to be at a leaf node.
max_depth — The maximum depth of the individual regression estimators.
Using the best parameters slightly increase the overall model Accuracy to 0.6670 and the F1-Score to 0.5406, Kfold Accuracy 0.6664 with SD 0.0266
Below we have in the Fig 21 the confusion matrix for the gradient boosting classifier.
Validation — Gradient Boosting Classifier
To validate this model, I used a Stratified k-fold cross validation to check for any variance in the model. I choose the stratified because of the success proportion in the df with more than 60% diff between successful and unsuccessful offers (FIG 22)
Use stratified cross validation to enforce class distributions when there are a large number of classes or an imbalance in instances for each class.
To check out you model’s bias, find out the mean of all thes. If this value is low, it basically means that your model gives low error on an average– indirectly ensuring that your model’s notions about the data are accurate enough.
As we can see below (FIG 24), by the cross validation we got a accuracy of 66.64% with a standard deviation of 2.66% and for the purpose of this project it’s considered a stable performance and the model seems to be robust enough against small perturbations.
Justification
In order to achieve this project goals — identify which offers and customers are the most adherent to the company’s campaigns — it was necessary to follow a very careful data cleaning process and define a good strategy to identify real successful offers. This was a very important step since a offer completed does not necessary implies that it was successful. The result of this data cleaning made it possible to find the best costumer types by gender and age.
By doing the steps above, it was possible to find the best performing model to classify the successful offers, by comparing the results with 3 distinct algorithms showed in this project. GBC seems to be a valid model and a start point for business purposes and decision making.
Conclusion
The biggest challenge of this project was to do the correct data cleaning and understand the best way to define a successful offer. Separating the offers to identify 100% of the cases that actually followed the expected pattern for success previously defined is quite complex, especially in the way in which the data are arranged.
Analyzing the users, income and offers distributions, we can see that despite the female audience being smaller than the male, women tend to be more attentive to offers and have an average conversion rate higher than men. In addition, the offers that had the highest success rate did not necessarily bring a greater return for the company. It’s important to know the campaign strategy — focusing on increasing revenue, increase the app traffic, etc…
Also, by analyzing the correlation matrix more deeply, we can identify that the Social and Mobile channels were most related with offers visualization than others channels. Therefore, it is recommended that the company focus more on then in future campaigns.
Moving to the classification models, the focus was on evaluating which of the three proposed models would present the best solution. Among then, the one that obtained the best result was the Gradient Boosting Classifier with an overall accuracy of 0.67% and F1-Score of 54%.
Some improvements!
I believe that with more demographic information from clients, such as — city, profession or feedback — it would be possible to propose better strategies and find better results. In addition, the classification models could have benefited from more user transactions in the database.
Also, it would be good to check and improve the success/unsuccess formula used for bogo and discount offers with a more robust method and include the informational offers by setting clear metrics for them.
Another approach would be optimizing the model using others gradient boosting algorithms such as XGBoost or CatBoost. They are improvised versions of the Gradient Bosting Classifier. Looking at the XGBoost one of the most important points is that it implements parallel preprocessing (at the node level) which makes it faster than GBM also includes a variety of regularization techniques that reduce overfitting and improve overall performance. For the CatBoost it works well with the default set of hyperparameters and can internally handle categorical variables in the data.
You can acess the full code in my github page: