Sparkify Capstone Project

9 min readSep 22, 2021

Motivation

Imagine you are working for music streaming company like Spotify or Pandora called Sparkify. Millions of users stream thier favorite songs everyday. Each user uses either the Free-tier with advertisement between the songs or the premium Subscription Plan. Users can upgrade, downgrade or cancel thier service at any time.Hence, it’s crucial to make sure that users love the service provided by Sparkify. Every time a users interacts with the Sparkify app data is generated. Events such as playing a song, logging out, like a song etc. are all recorded. All this data contains key insights that can help the business thrive. The goal of this project is then to analyse this data and predict which group of users are expected to churn — either downgrading from premium to free or cancel their subscriptions altogether.

Project Definition

Project Overview

Sparkify Capstone Project is Udacity’s Data Science final project it explores the user churn rate for Sparkify company. Dataset provided with the 128MB of activity log for a number of users and the different activities that can be done via the (web) app.

Problem Statement

the problem is to find some features that correlate whether a user will churn or not.

Definition of Churn: The user completing the cancellation process and reaching the “Cancellation Completed” page.

What I will do in this project

1. Load the dataset

2. Clean up any null and space values

3. Explore the dataset and learn some basic insights about it

4. create a new dataset containing only the features we want

5. Engineer new features to help me to improve prediction

6. compare between 5 algorithms to find the best one

This project uses the following software and python libraries

Python 3.6
Pyspark-2.4.1
NumPy
Pandas
scikit-learn
matplotlib
Jupyter Notebook

Metrics

I will use accuracy score

Accuracy Score: a metric that anyone can understand

Analysis

Data Exploration and Questions

Dataset

Each row represents an activity that one user undertook at a particular time from a particular device. If the activity was listening to a song (most of them are), then we will see the artist and song they listened to. we have 278,154 rows over 24 columns.

The dataset contains log files that generate entries whenever a user makes an action on a site, like picking the next song, giving a song a ‘thumbs down, or landing onto a new page. These files take into account information about the user, such as their location, user agent string for accessing the site, and account level: free or paid. It’s important to note that those who use the free service receive advertisements. Let’s take a look at the schema of this log file:

Schema of our dataset

Check for missing data

After Cleaning the first version of the data

Currently, song, length, and artist columns still have null values

so I made a new version of the data

After cleaning

this was a problem that after cleaning, there are a lot of data are removed

so I let these columns still had nulls length, artist, and song

In terms of null values present, we have two cases:

case 1: userId, userAgent, firstName , lastName, gender, location, registration

About 8,000 rows are sessions picked up before the user has logged in — all redundant data for our purposes and to be cleaned away.

case 2:artist, length, song

About 58,000 rows are nulls occur when the user adds friends or gives a song thumbs up, all useful for our purposes and definitely data to keep.

Feature Engineering

I made the new features listed below to improve our model accuracy

Some features that can possibly affect churn rates can be derived from the dataset. some of the features that I think will be good predictors are listed below :

The total number of songs played in the session
Number of thumbs up (liked songs)
Number of thumbs down (disliked songs)
Number of songss added to the playlist
Average session length
The total number of songs played

numbers of songs played

Session length

songs that added by users to playlist

number of thumps up (liked songs) and thumbs down (disliked songs)

songs played per session

Then I put them all together

finally, I scaled the features and the dataset is ready

Data Visualization

Distributions for featured engineered

Many of them are distributed right-skewed, where the average length per session follows a symmetrical distribution. We know the impact of having skewed data in our model. We will need to normalize the dataset in order to nullify the impact of larger values over smaller values.

I used these important features with churned labels to pass them to our model, I will use this dataset for our machine learning problem of predicting the churn

level Churned VS non-Churned

Modeling:

I split the full dataset into train and test sets. Test out the baseline of four machine learning methods: Logistic Regression, Decision Tree Classifier, Naive Bayes, Random Forest Classifier, and GradientBoostedTreeClassifier.

first I scaled the data

Split data into train and test

Model Selection & Evaluation

Metrics

I can set the required column names in rawPredictionCol and labelCol Param and the metric in metricName Param.
we have a binary target (churned), so I can use BinaryClassificationEvaluator to evaluate our model.
The default metric for this Evaluator is areaUnderROC.

Accuracy It is the ratio of the number of correct predictions to the total number of input samples. It works well only if there are an equal number of samples belonging to each class.

Area Under The Curve ROC(Reciever Operating Characteristics) is the area under the probability curve. It measures the degree of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s.

I will train 5 algorithms on the dataset and select the best model

RandomForestClassifier
Naive Bayes classifier
DecisionTreeClassifier
LogisticRegression
GradientBoostedTreeClassifier

1- Random forest

Model is trained and spent 95.44201636314392 secondsArea Under Curve: 0.65
Test Error = 0.35

2- Naive Bayes

Model is trained and spent 53.696372747421265sArea Under Curve: 0.5
Test Error = 0.5

3- Decision tree

Model is trained and spent 84.65532064437866sArea Under Curve: 0.55
Test Error = 0.45

4- Logistic Regression

Model is trained and spent 222.21597623825073sArea Under Curve: 0.5
Test Error = 0.5

5- Gradient Boosted Tree

Model is trained and spent 303.87891817092896sArea Under Curve: 0.6
Test Error = 0.4

Results

I have analyzed the sparkify dataset and come up with new features to predict churn. The best performance models are random forest and gradient boosted tree

but gradient boosted tree took much careful when using it and also when set hyperparameters of random forest maxBins and maxDepth it make the model overfit and the same for decision tree

the random forest gives us the best result and in a short time and accuracy increased a little bit when using cross validation

so in the future, we can use deep learning techniques to see a new result

Justification:

As per the final Random Forest model which we had more accuracy when applied cross-validation. This small improvement if you consider the greater picture. This will greatly help Sparkify to more accurately predict churn and direct offers to more appropriate individuals and would save a lot of money for Sparkify.

Model Tuning and Improve Performance

As we see GradientBoostedTree is performing best compared to LogisticRegression, RandomForest, and DecisionTree but it took a much longer time than them. For, Logistic Regression the training time is significantly huge as well. Decision Tree without any parameters outperformed RandomForestClassifier.

cross-validation over a grid of parameters is expensive and took much time in my case about 15 minutes, the parameter grid has 2 values for numTrees and 4 values for maxDepth of Random forest, and CrossValidator uses 4 folds.

This multiplies out to (2×4)×4=32 different models being trained. In other words, using CrossValidator can be very expensive. . However it is a good way to decide parameters which is more effective and heuristic in your model

I will use Random Forest Tree to perform cross-validation

there was a little bit increasing of accuracy.

Conclusion

We are able to achieve an accuracy score of 52% with random forest and 60 for Gradient Boosted Tree and after applied cross-validation, the accuracy increased by almost 3%

Gradient Boosted Tree can be applied on a large dataset because it took much time while the random forest is faster

I used many machine learning models to find the best of them, I found the Random Forest is the most efficient model out of all.

one of the problems I met the accuracy is very low I tried much time to increase it

we may if future use deep learning methods to improve our accuracy

Repo :

https://github.com/AhmadAbdElhameed/Data-Scientist-Nanodegree/tree/main/sparkify%20capstone%20project

Reflection for future improvement

it's great for large datasets where you can run out of memory, but very time-consuming when you are running through the processing steps every time.

we can use in future deep learning methods it may give us a better result

There is potential to improve the model and add a lot more features from the activity log data provided.