Customer Churn Prediction Using Machine Learning and Python

Customer churn is a critical indicator to measure and understand a company’s clients.

Keeping 100% of customers happy is unfortunately unrealistic. That’s why companies try so hard to increase their customer retention rate.

In this tutorial, we are going to build a customer churn project based on the Telco dataset using machine learning and python.

Table of content:

Customer churn definition
Dataset of Telco
Traditional ML models: logistic regression, Xgboost, decision tree, and Random forest.
Deep learning model: ANN
Evaluation
Summary

Table of Contents

Customer Churn Definition

Customer churn occurs when customers are leaving the company. In other words, clients stop using the company’s products and services for a period of time.

Why does the company want to reduce the customer churn metric? because attaining new clients is more expensive than retaining existing ones. Therefore, companies try to predict when a customer is about to churn and try to keep it on their list.

This is a classification problem. We want to classify whether a customer will churn or not.

Let’s analyze the dataset we will use in our example.

Dataset of Telco

The telco dataset is a sample dataset made by IBM.

The data set includes information about the customers of a company. Each column describes a specific feature like:

gender, Senior citizen, having a partner or not, Number of months the customer has stayed with the company…

You can download the dataset from this link: https://www.kaggle.com/blastchar/telco-customer-churn

Let’s load the dataset on a dataframe:

#load the telco dataset
dataset=pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

Data visualization

1- The shape of our dataset:

dataset.shape

The dataset has 20 columns and 7043 rows.

(7043, 20)

Let’s explore this data using seaborn.

import seaborn as sns
sns.countplot(x='TechSupport',data=dataset, hue='Churn',palette='viridis')

We can conclude that tech support is strongly correlated with the churn rate of customers.

2-Some columns are useless in our case, like CustomerID. We will drop it.

dataset.drop(['customerID'],axis=1,inplace=True)

Preprocessing

1- If we check our data, we found that some numerical columns are type ‘object’ like ‘TotalCharges’.

dataset.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

We will transform this column to numerical using the ‘to_numeric’ function.

dataset["TotalCharges"] = pd.to_numeric(dataset["TotalCharges"],errors='coerce')

2- After, we need to transform categorical columns into numerical representations using get dummies.

df=pd.get_dummies(dataset,drop_first=True)

3- Check for nan values

df.isna().any(axis=1).sum()

There are 11 missing values on this dataset.

Display rows that have missing values.

df[df.isna().any(axis=1)]

Let’s drop those lines.

dataset.dropna(axis=0, inplace=True)

4- Split the dataset into training and test sets.

# Divide the dataset into Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

Traditional ML Models

Logistic regression

#logistic regression
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

clf = LogisticRegression(random_state=0).fit(X_train, y_train)

pred = clf.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)
plot_confusion_matrix(clf, X_test, y_test)

The accuracy is:

accuracy:   0.800

Xgboost

from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from xgboost import XGBClassifier
xgb_model = XGBClassifier()
clf=xgb_model.fit(X_train, y_train)
pred = clf.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
plot_confusion_matrix(clf, X_test, y_test)

The accuracy is:

accuracy:   0.791

Decision Tree

clf = tree.DecisionTreeClassifier(random_state=0).fit(X_train, y_train)

The accuracy is:

accuracy:   0.728

Random forest

clf = RandomForestClassifier(random_state=0).fit(X_train, y_train)

The accuracy is:

accuracy:   0.791

Deep learning model: ANN

We will create an ANN model to predict customer churn using TensorFlow and Keras.

import tensorflow as tf
from tensorflow import keras


model = keras.Sequential([
    keras.layers.Dense(30, input_shape=(30,), activation='relu'),
    keras.layers.Dense(15, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# opt = keras.optimizers.Adam(learning_rate=0.01)

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=100)
model.evaluate(X_test, y_test)

The loss and the accuracy of the model are:

loss and accuracy : [0.5680199265480042, 0.7949159741401672]

Evaluation of models

Algorithm	Accuracy
Logistic regression	0.800
Xgboost	0.791
Decision Tree	0.728
Random Forest	0.791
Ann	0.794

Comparing ML models

We can see that the logistic regression model has the best accuracy without any hyperparameters.

Summary

We learned to create machine learning models and a simple ANN model to predict customer churn using the telco dataset.

Particularly:

We visualize the telco dataset to find correlations.
We created machine learning models: logistic regression, Xgboost decision tree, and Random forest.
We created a simple deep learning model, ANN, using Tensorflow and Keras.
Finally, we compare all those models in terms of accuracy.

If you have any comments or questions, don’t hesitate to leave them in the comments section below.

lamya A.

Hey there! I am the creator of AI Decoder.
I am a data scientist by training and a Ph.D. student in AI. In this blog, I try to explain the knowledge I learn in simple words and help someone somewhere.