Customer churn is a critical indicator to measure and understand a company’s clients.
Keeping 100% of customers happy is unfortunately unrealistic. That’s why companies try so hard to increase their customer retention rate.
In this tutorial, we are going to build a customer churn project based on the Telco dataset using machine learning and python.
Table of content:
- Customer churn definition
- Dataset of Telco
- Traditional ML models: logistic regression, Xgboost, decision tree, and Random forest.
- Deep learning model: ANN
- Evaluation
- Summary
Customer Churn Definition
Customer churn occurs when customers are leaving the company. In other words, clients stop using the company’s products and services for a period of time.
Why does the company want to reduce the customer churn metric? because attaining new clients is more expensive than retaining existing ones. Therefore, companies try to predict when a customer is about to churn and try to keep it on their list.
This is a classification problem. We want to classify whether a customer will churn or not.
Let’s analyze the dataset we will use in our example.
Dataset of Telco
The telco dataset is a sample dataset made by IBM.
The data set includes information about the customers of a company. Each column describes a specific feature like:
gender, Senior citizen, having a partner or not, Number of months the customer has stayed with the company…
You can download the dataset from this link: https://www.kaggle.com/blastchar/telco-customer-churn
Let’s load the dataset on a dataframe:
#load the telco dataset
dataset=pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
Data visualization
1- The shape of our dataset:
dataset.shape
The dataset has 20 columns and 7043 rows.
(7043, 20)
Let’s explore this data using seaborn.
import seaborn as sns
sns.countplot(x='TechSupport',data=dataset, hue='Churn',palette='viridis')
We can conclude that tech support is strongly correlated with the churn rate of customers.
2-Some columns are useless in our case, like CustomerID. We will drop it.
dataset.drop(['customerID'],axis=1,inplace=True)
Preprocessing
1- If we check our data, we found that some numerical columns are type ‘object’ like ‘TotalCharges’.
dataset.dtypes
gender object SeniorCitizen int64 Partner object Dependents object tenure int64 PhoneService object MultipleLines object InternetService object OnlineSecurity object OnlineBackup object DeviceProtection object TechSupport object StreamingTV object StreamingMovies object Contract object PaperlessBilling object PaymentMethod object MonthlyCharges float64 TotalCharges object Churn object dtype: object
We will transform this column to numerical using the ‘to_numeric’ function.
dataset["TotalCharges"] = pd.to_numeric(dataset["TotalCharges"],errors='coerce')
2- After, we need to transform categorical columns into numerical representations using get dummies.
df=pd.get_dummies(dataset,drop_first=True)
3- Check for nan values
df.isna().any(axis=1).sum()
There are 11 missing values on this dataset.
Display rows that have missing values.
df[df.isna().any(axis=1)]
Let’s drop those lines.
dataset.dropna(axis=0, inplace=True)
4- Split the dataset into training and test sets.
# Divide the dataset into Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
Traditional ML Models
Logistic regression
#logistic regression
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
clf = LogisticRegression(random_state=0).fit(X_train, y_train)
pred = clf.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
plot_confusion_matrix(clf, X_test, y_test)
The accuracy is:
accuracy: 0.800
Xgboost
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from xgboost import XGBClassifier
xgb_model = XGBClassifier()
clf=xgb_model.fit(X_train, y_train)
pred = clf.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
plot_confusion_matrix(clf, X_test, y_test)
The accuracy is:
accuracy: 0.791
Decision Tree
clf = tree.DecisionTreeClassifier(random_state=0).fit(X_train, y_train)
The accuracy is:
accuracy: 0.728
Random forest
clf = RandomForestClassifier(random_state=0).fit(X_train, y_train)
The accuracy is:
accuracy: 0.791
Deep learning model: ANN
We will create an ANN model to predict customer churn using TensorFlow and Keras.
import tensorflow as tf
from tensorflow import keras
model = keras.Sequential([
keras.layers.Dense(30, input_shape=(30,), activation='relu'),
keras.layers.Dense(15, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
# opt = keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(X_train, y_train, epochs=100)
model.evaluate(X_test, y_test)
The loss and the accuracy of the model are:
loss and accuracy : [0.5680199265480042, 0.7949159741401672]
Evaluation of models
Algorithm | Accuracy |
Logistic regression | 0.800 |
Xgboost | 0.791 |
Decision Tree | 0.728 |
Random Forest | 0.791 |
Ann | 0.794 |
We can see that the logistic regression model has the best accuracy without any hyperparameters.
Summary
We learned to create machine learning models and a simple ANN model to predict customer churn using the telco dataset.
Particularly:
- We visualize the telco dataset to find correlations.
- We created machine learning models: logistic regression, Xgboost decision tree, and Random forest.
- We created a simple deep learning model, ANN, using Tensorflow and Keras.
- Finally, we compare all those models in terms of accuracy.
If you have any comments or questions, don’t hesitate to leave them in the comments section below.
Hey there! I am the creator of AI Decoder.
I am a data scientist by training and a Ph.D. student in AI. In this blog, I try to explain the knowledge I learn in simple words and help someone somewhere.