The goal of fake news detection is to uncover items that intentionally misinform or deceive readers. Fake news is frequently generated to alter public opinion or for political purposes.
After going through this tutorial, you will be able to understand and implement a fake news detection model on Twitter.
Let’s get started.
What is Fake News?
Fake news is defined as “news pieces that are purposely and verifiably untrue”.
Who generates fake news and why?
Those that run fake news websites want as many people to visit their sites as possible. While some may want their visitors to view the information and be influenced by it, others merely want internet users to click on it, which typically leads to a website where users see more content and advertisements, generating more money for the website owner.
The Pipeline of the Project: Fake News Detection using Python
I will walk you through step-by-step how we can create our machine-learning model to detect fake news in tweets.
You can download the 2 datasets from this link: https://github.com/tech-data/Fake-News-detection-using-SVM/tree/main
First, we will preprocess the data.
Data Preprocessing :
Text preprocessing is used to prepare text data for model building. It is the initial stage in every NLP project. Preprocessing stages include the following:
- Removing all special characters and punctuation.
- removing stop words: are often used terms that are eliminated from the text because they provide no value to the analysis. These are words with little or no meaning. The NLTK library contains a list of English stop words. Some of them are as follows: I, me, my, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now…
- Lower casing: This is one of the most popular preprocessing procedures in which the text is transformed to the same case, ideally lower case. However, this step is not required every time you work on an NLP problem because a lower casing might lead to information loss in some cases.
- Stemming is a text standardization phase in which words are stemmed or decreased to their root/base form. Words like “consultant,” “consulting,” and “consultants” will all be stemmed to “consult.”
Let’s start by importing the necessary libraries and reading the data.
import necessary libraries
#import necessary libraries
import pandas as pd
import re
import itertools
import matplotlib.pyplot as plt
import numpy as np
from sklearn import preprocessing
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import sklearn.metrics as metrics
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
#load training and testing sets
df=pd.read_csv("data/Constraint_Train.csv")
test=pd.read_csv("data/english_test_with_labels.csv")
#print first records of the training set
df.head()
#check the shape of our both datasets
print(df.shape)
print(test.shape)
The training data has 6420 rows and 2 columns. You can check the shape of data using data.shape function. Let’s check the dependent variable distribution between fake and real.
#check if the dataset is balanced
df['label'].value_counts()
Check if there is any missing data:
#count nan values in training dataset
df.isna().any(axis=1).sum()
Reading some of the real tweets :
#read some real tweets
pd.set_option('display.max_colwidth',None)
df[df.label=="real"].tweet
Read some fake tweets:
#read some fake tweets
pd.set_option('display.max_colwidth',None)
df[df.label=="fake"].tweet
Cleaning the Data
First, we are going to transform categorical labels into numerical labels using LabelEncoder().
# Import label encoder to transform label categories'fake,true) into numbers
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'labels'+create a new column for both datasets
df['NB_label']= label_encoder.fit_transform(df['label'])
test['NB_label']= label_encoder.fit_transform(test['label'])
In this step, we will create the function preprocess() to keep only characters within a to z alphabets, lowercase string,s and stem the words.
#function to preprocess the tweets
ps = PorterStemmer()
def preprocess(line):
review = re.sub('[^a-zA-Z]', ' ', line) #leave only characters from a to z
review = review.lower() #lower the text
review = review.split() #turn string into list of words
#apply Stemming
review = [ps.stem(word) for word in review if not word in stopwords.words('english')] #delet stop words like I, and ,OR review = ' '.join(review)
#trun list into sentences
return " ".join(review)
Let’s apply this function to the ‘tweet’ column and restore the new data ‘processed_data’ column.
#apply preprocessing function on both training and test tweets
df['processed_data']=df['tweet'].apply(lambda x: preprocess(x))
test['processed_data']=test['tweet'].apply(lambda x: preprocess(x))
let’s split our dataset into X and y for both training and test sets.
#split train and test dataset columns into features and outcome
X=df['processed_data']
print(X)
y=df['NB_label']
Xtest=test['processed_data']
ytest=test['NB_label']
Bags of words VS TF-IDF
- Bag of words
Let’s explain the Bag of words first, bag of words is a technique used to transform those words into a numerical representation understood by the machine.
# Creating the Bag of Words model by applying Countvectorizer -convert textual data to numerical data
cv = CountVectorizer(max_features=5000,ngram_range=(1,3))#example: the course was long-> [the,the course,the course was,course, course was, course was long,...]
X_cv = cv.fit_transform(X).toarray()
#apply cv to test vectorizer
test_cv=cv.transform(Xtest).toarray()
To apply “bag of words”, we will use countervectorizer, with two specific parameters: max_features and ngram_range .
ngram_range : If we will have 1 to 3 combinations of words, for example: ‘the course was long’. it will return a list of combinations. The first word is “the”, the second combination is “the course”, and the third combination is “the course was”, and so on until it finishes the rest of the combinations.
Splitting the training set into train and validation sets.
# Divide the dataset into Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_cv, y, test_size=0.33, random_state=0)
We will define a function to plot the confusion matrix:
#create function to plot confusion matrix
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
See full source and example:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
Classification Algorithms:
Now, we will train 3 models using the following algorithms: Logistic Regression, Linear SVC, and Passive Aggressive Classifier. We choose them because they gave the best results and you can experiment with other algorithms too.
#train different algorithms and compare them using accuracy
models = []
models.append(('LogisticRegression', LogisticRegression()))
models.append(('LinearSVC', LinearSVC()))
models.append(('PassiveAggressiveClassifier', PassiveAggressiveClassifier()))
# evaluate each model in turn
names = []
scoring = 'accuracy'
for name, model in models:
model=model.fit(X_train, y_train)
pred = model.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
msg = "%s: %f" % (name, score)
print(msg)
Let’s plot the confusion matrix of LinearSVC as an example!
# linearSVC
svc = LinearSVC().fit(X_train, y_train)
pred = svc.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred)
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
Now, we will test our algorithms on the test dataset.
#bag of words test
# evaluate each model in turn
results = []
scoring = 'accuracy'
for name, model in models:
pred = model.predict(test_cv)
score = metrics.accuracy_score(ytest, pred)
msg = "%s: %f" % (name, score)
print(msg)
# boxplot algorithm comparison
- TF-IDF
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations in a sentence amongst a collection of sentences.
In our example, we will have the frequency of a word on a tweet multiplied by the frequency of words across all the tweets.
Now instead of the “bag of words”, we will use TfidfVectorizer() function and we are going to apply it to training and test datasets.
# Creating the Bag of Words model by applying tfidif -convert textual data to numerical data
vectorizer = TfidfVectorizer()
X_idf = vectorizer.fit_transform(X).toarray()
test_idf=vectorizer.transform(Xtest).toarray()
And the rest of the steps are as previous: splitting the dataset into training and validation and training the previous algorithms.
Evaluation using Accuracy
By comparing the accuracy of all the methods, we conclude that TF-IDF AND LinearSVC gave the best performance on the validation and the test dataset
You can find the whole code in this repository: https://github.com/tech-data/Fake-News-detection-using-SVM
Summary
In this tutorial, you discovered how to create a fake news detector using Linear SVC and TF-IDF.
Specifically, we test different preprocessing techniques and algorithms:
- Bag of words vs TF-IDF
- Logistic Regression, Linear SVC, passive-aggressive classifier.
Do you have any questions?
Ask your questions in the comments section below and I’ll try my best to respond.
Hey there! I am the creator of AI Decoder.
I am a data scientist by training and a Ph.D. student in AI. In this blog, I try to explain the knowledge I learn in simple words and help someone somewhere.
I want to get the dataset of real and fake tweets..
Hi, Thank you for your question. You can download the 2 datasets from this GitHub repo link:https://github.com/tech-data/Fake-News-detection-using-SVM/tree/main