Documentation-Fake News Detection
Documentation-Fake News Detection
A PROJECT REPORT
Submitted by
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
MARCH 2021
COIMBATORE INSTITUTE OF TECHNOLOGY
(A Govt. Aided Autonomous Institution Affiliated to Anna University)
COIMBATORE – 641014
BONAFIDE CERTIFICATE
Prof.N.K.KARTHIKEYAN, Mrs.S.Devi,
HEAD OF THE DEPARTMENT, SUPERVISOR,
Department of Information Technology, Department of Information
Technology,
Coimbatore Institute of Technology, Coimbatore Institute of Technology,
Coimbatore - 641014. Coimbatore - 641014.
Certified that the candidates were examined by us in the project work viva-vice
examination held on …………………
ACKNOWLEDGEMENT I
ABSTRACT III
LIST OF ABBREVATIONS IV
1 INTRODUCTION 1
1.1 MACHINE LEARNING 1
1.2 SUPPORT VECTOR MACHINE 2
1.3 ADABOOST 2
1.4 CONVOLUTIONAL NEURAL NETWORK
1.5 KERAS TUNER 3
1.6 ERROR LEVEL ANALYSIS 4
1.7 ANACONDA
4
1.8 JUPYTER
1.9 SPYDER
During the entire period of study, the entire staff members of the Department of
Computer Science and Engineering & Information Technology have offered
ungrudging help. It is also a great pleasure to acknowledge the unfailing help we have
received from our friends.
It is a matter of great pleasure to thank our parents and family members for
their constant support and cooperation in the pursuit of this Endeavour.
I
ABSTRACT
II
ABSTRACT
Fake News detection is much needed in today’s world as it has a large impact in our social as
well as personal life but involved some challenges due to the limited resources like datasets ,
research papers. These days a lot of information is being shared over social media and we are not
able to differentiate between which information is fake and which is real. People immediately
start expressing their concern or sharing their opinion as soon as they come across a post,
without verifying its authenticity. This further results in spreading of it. Fake news and rumor are
the most popular forms of false and unauthenticated information and should be detected as soon
as possible for avoiding their dramatic consequences . In previous research, many fake news
detection methods were proposed. In this project the final report is generted by combined
predictions of both text classification using SVM or Adaboost and image classification using
CNN.
III
LIST OF ABBREVIATIONS
ABBREVIATION EXPANSION
ML Machine Learning
IV
INTRODUCTION
V
CHAPTER-1
1 INTRODUCTION
As an increasing amount of our lives in the world and we are spent interacting online through social
media platforms, more and more . A people tend to seek out and consume news from social media
rather than traditional news organization. The Satistics prove that nearly 6 in 10 people are using their
mobiles for their daily updates of news. Here comes the problem, because there are hundreds of
websites and sources providing news in their own way, which ultimately leads to manipulation of
news losing the original flavour of the news. There was a serious negative impact on individuals and
society by the extensive spread of fake news. The authenticity balance of the news ecosystem can be
break by Fake news. It is usually manipulated by propagandists to convey political messages or
influence. The way of people interpret and respond to real news will changes by fake news. Human
can’t predict all the fake news generated in social media ,to overcome this issues automated
techniques are under research. Machine Learning techniques is being used in this project.
1
1.2 SVM:
SVM is Support Vector Machine and is a Supervised Machine Learning algorithm. It is used for both
classification and regression challenges. Mostly it is used for solving classification problems. Each
data item is plotted as n-dimensional space points where the value of each feature is the value of a
particular coordinate in this algorithm . Classification can be done by finding the hyper-plane which
differentiate two classes easily. Support Vectors are the co-ordinates of individual observations.
Margin can be maximized by using SVM. Position of the hyperplane can be changed by deleting the
SVM.
1.3 ADABOOST:
2
1.4 CNN:
A CNN is convolutional neural network and CNN is a class of deep neural networks . It is
mostly used to analyzing visual imagery.
The “convolutional neural network” name indicates that the mathematical operation is
employs by network and is called as convolution. Convolutional networks use convolution
in place of general matrix multiplication in at least one of their layers and it was a specialized
type of neural networks. CNN has been designed for reduced processing requirements and
CNN uses a system much like a multilayer perceptron.
Compared to other image classification algorithms a CNNs use a relatively little pre-
processing. A major advantage of CNN is that, this independence from prior knowledge and
human effort in feature design .
Hyperparameters that govern the training process and the topology of an ML model and
hyperparameters are variable. These variables directly impact the performance of your ML
program and remain constant over the training process. Hyperparameters are of two types ,
they are Model hyperparameters and algorithm hyperparameters.
3
A influence of model selection such as the number and width of hidden layers is known
as Model Hyperparameters. A influence in the speed and quality of the learning algorithm
such as the learning rate for Stochastics gradient descent (SGD) and the number of nearest
neighbors for a k Nearest Neighbors (KNN) classifier is known as Algorithm
Hyperparameters.
1.7 ANACONDA:
Anaconda is a free and open-source distribution for scientific computing of Python and R
programming languages. It is used for data science, machine learning applications, predictive
analysis, large-scale data processing etc. It aims to simplify deployment and package
management. Package management system conda manages the package version.
4
1.8 JUPITER NOTEBOOK:
Jupyter Notebook is an open-source web application which allows to create and share
documents which contains narrative text, visualizations, live code and equations. It is used for
data cleaning and transformation, data visualization, statistical modeling, numerical
simulation, machine learning and much more. Over 40 different programming languages are
supported by Jupyter. Python is a requirement (Python 3.3 or greater, or Python 2.7) for
installing the Jupyter Notebook itself.
5
1.9 SPYDER:
Spyder is a scientific environment that is designed by scientists, data analysts and engineers
and written in Python. It provides a unique combination of the analysis, debugging, advanced
editing and profiling functionality of a comprehensive development tool with the deep
inspection, data exploration, interactive execution and beautiful visualization capabilities of a
scientific package. It has the ability that can be extended further via its API and plugin
system. Spyder is used as a PyQt5 extension library, that allows to build upon its
functionality and embed its components, such as the interactive console.
6
7
LITERATURE SURVEY
8
CHAPTER- 2
LITERATURE SURVEY
[1] TITLE:
A benchmarking study for fake news detection
AUTHORS:
DESCRIPTION:
It is taken from Elsevier. The algorithm used in this paper are K-NN, Decision Tree ,Naive
Bayes ,SVM , AdaBoost , Bagging. The proposed approach could be the base for a tool helping
publishers to quickly decide which article needs further exploitation concerning its veracity
ADVANTAGES:
In this Paper they proposed an enhanced set of linguistic features with powerful capabilities for
discriminating fake news from real news articles
LIMITATIONS:
It deals with textual content only. It detects only whether the given news is real or fake
[2]TITLE:
Detecting Fake Images on Social Media using Machine Learning
AUTHORS:
DESCRIPTION:
Using it in CNN to train the networks and compare three algorithms to find best one.
ADVANTAGES:
8
CNN is useful for automatic feature learning. CNN is weight sharing, in which two layers can have same
weight.
LIMITATIONS:
By comparing CNN with others it gives low accuracy in both datasets. In neural networks,
limitations include computationally expensive.
[3]TITLE:
AUTHORS:
Muhammad Umer, Saleem Ullah, Arif Mehmood, Gyu Sang Choi, Byung-Won On.
DESCRIPTION:
PCA and Chi-square is used to perform component level analysis and obtain the reduced feature
set. Accuracy-97.8%
ADVANTAGES:
PCA - Time and space complexity. Processing speed is reduced in feature selection. Text
classification is done easily using feature extraction.
LIMITATIONS:
This method is limited of claims and in English language only. It is limited to particular set of
data.
[4]TITLE:
AUTHORS:
9
DESCRIPTION:
Decision Tree, K-Nearest Neighbor (KNN), Logistic Regression, Linear Support Vector
Machines, Multinomial Naive Bayes, Bernolli Naïve Bayes, Perception, Neural Network,
Ensemble Random forest and Extreme Gradient Boosting Classifier.
ADVANTAGES:
It gives complete result regarding COVID – 19 and also covers most websites. It focuses mainly on
WHO, UNICEF, UN websites.
LIMITATIONS:
Due to the spread of fake news about COVID-19, people may reduce confidence on government and may
promote specific product to achieve enormous profit. This technique may avoid such activities.
[5]TITLE:
AUTHORS:
Edmar R.S. de Rezende, Guilherme C.S. Ruppert, Antonio Theophilo, Eric K. Tokuda, Tiago Carvalho.
DESCRIPTION:
Deals with the problem of detecting computer generated images, through deep convolutional networks
and transfer learning techniques. CNN is used to classify each image from the dataset using the raw RGB
pixels values as features, without the need for manual feature extraction.
Accuracy-97%
ADVANTAGES:
By using DNN, there is no need for human engineered feature, with a general purpose algorithm learning
direct from raw data.
LIMITATIONS:
CG images can be identified easily so that it could be easy to identify fake news.
10
[6]TITLE:
AUTHORS:
DESCRIPTION:
Random forest gives higher accuracy than the others. Ensemble methods increase the accuracy where the
Gradient boosting gives 86% accuracy.
ADVANTAGES:
By using ensemble methods we can improve the accuracy .It is found that the real news has high number
of quotes values than the fake news.
LIMITATIONS:
With the help of text- mining, it is known that real news including quoted contents mostly. The dataset
used in this prediction contains only limited news and also only related to politics. It should include
image dataset.
[7]TITLE:
AUTHORS:
Julio C. S. Reis, Andre Correia, Fabrıcio Murai, Adriano Veloso, Fabrıcio Benevenuto
DESCRIPTION:
11
ADVANTAGES:
William D. Ferreiraa, Cristiane B.R. Ferreiraa, Gelson da Cruz Júnior b, Fabrizzio Soares a,c
The two
primary
benefits of the k-Nearest Neighbor algorithm are efficiency and flexibility. The algorithm is
efficient in its simplicity, speed, and scalability. As described above, the mechanics of the
algorithm are readily apparent, and it is simple to understand and implement.
LIMITATIONS:
Sometimes its gives fake for true news if the ip address by News channel is same as the that
blacklisted in dataset. It needs an live action algorithm for realizing its own needs. This model
focus on only political news.
[8]TITLE:
AUTHORS:
William D. Ferreiraa, Cristiane B.R. Ferreiraa, Gelson da Cruz Júnior b, Fabrizzio Soares a,c
DESCRIPTION:
There are many methods in image manipulation in which two most common methods are active
and passive methods. It has further divided with respect to spatial and frequency domain.
ADVANTAGES:
Embedding of different domains in method can give great outcomes. Each proposed method can be used
for different usecases.
LIMITATIONS:
12
CHAPTER 3
3 SYSTEM ARCHITECTURE
PROPOSED SYSTEM
The proposed system is to predict the given twitter news is fake or original (text news using Adaboost
and Support vector machine , tampered image by error level analysis using CNN) using Machine learning
.The main objective is to improve the accuracy in both models separately in the data preprocessing
methods and average the results of both models, which gives final result as fake or not.
13
14
4 SYSTEM SPECIFICATION
The hardware and software specifications for the system are as follows.
CHAPTER 5
15
The system is made up of three important modules:
● Input Module
● Processing Module
○ Training and validation of pre-processed data for prediction.
● Output Module
The input module has divided into two parts such as Text box module and add image module.
The first part has to upload text inputs namely news title, author name and body of the news.
16
The second part is to upload image which may be tampered or original for tampered image
detection part.
a) Original image
b) Tampered image(copy-move)
Model 1:
In module one the text inputs like title, author and body of news get preprocessed by NLP
techniques includes tokenize the paragraphs into words ,removing special characters, removing
unwanted features called stopwords which may decrease accuracy. The next part is hashing
vectorizer, it extract features from above dataset of words with its counts and its importance in
17
whole corpus. The extracted features with floating values as its importance is finally passed to
trained model with Adaboost algorithm and then the prediction gives result as a 0 or 1 for true or
fake news respectively.
Model 2:
In module two , the input is image which may be tampered or original .The input image format is
.jpg. Because now a day’s most usable image format is .jpg. This is converted to error level analysis
format and then it is converted to 3d array of values between 0-255 and rescaled to specified size for
further process. Then the pre-processed data is passed to Convolutional neural network trained
model(keras tuner is used for selecting best model) for prediction. Then finally the result shows as 0
or 1 for original or tampered images respectively.
Note : Both models are independent to each other if image could not be upload ,then the model
gives prediction only based on text data and vice versa.
6 CONCLUSION:
The combined model of fake news and tampered image detection is created and tested successfully.
This model focus on important features like title , body of the news in fake news detection which
increases the accuracy and Error level analysis method is one of the efficient method in detecting
fake image especially JPEG images because JPEG is almost used everywhere. This model can be used
for forensics department by applying little enhancement and also it can be used for regular social
media users.
Recently people is totally on social media. It has powerful impact in every person’s life and it is
becoming an trustable platform recently. Due to some unavoidable causes it also leads to spread
fake news all over the world which has very negative impacts in our society. Hence this kind of
detection methods will be very helpful in defending against fake news.
7 FUTURE ENHANCEMENT:
The accuracy of this model can be increased by training it with much more data in future.
This model focus only in JPEG image format, in future it can widespread its effectiveness in many
image formats.
18
8 APPENDIX
APPENDIX – I
Importing Libraries
19
Data preprocessing
20
Hashing vectorizer(Feature Extraction)
SVM model
21
Confusion matrix with Accuracy
Adaboost Model
22
Confusion matrix with Accuracy
23
Model 2: Tampered image detection (image as input)
Importing Libraries
24
Orignal image and its ELA Conversion
25
Tampered image and its ELA conversion
26
Assigning label for each images
27
Creating dataframe which contains image path and its label
28
Dataframe with image paths and label
29
Conversion into ELA images
30
31
Training CNN model with image arrays
32
Classification result of CNN tested model
33
APPENDIX – II
import pandas as pd
import pandas as pd
import numpy as np
from sklearn.metrics import roc_curve,accuracy_score, roc_auc_score, auc,
mean_squared_error, classification_report, confusion_matrix
from sklearn.feature_extraction.text import HashingVectorizer
import matplotlib.pyplot as plt
34
f1 = "/kaggle/input/fake-news-data/train.csv"
f2 = "/kaggle/input/fake-news-data/test.csv"
f3 = "/kaggle/input/mlfake/news_dataset.csv"
df=pd.read_csv(f1)
dftest=pd.read_csv(f2)
test=pd.read_csv(f3)
df=df.dropna()
data=df.copy()
data=data.iloc[0:6000,]
data.shape
data.reset_index(inplace=True)
review = re.sub(r'www.*$','',data[label][i])
review = re.sub(r'http.*$','',data[label][i])
review = re.sub(r'@.*$','',data[label][i])
review = re.sub(r'#.*$','',data[label][i])
review = re.sub('[^a-zA-Z]', ' ', data[label][i])
review = review.lower()
review = review.split()
except:
continue
return corpus
cor=preprocess(data,'text')
len(cor)
cor1=preprocess(data,'title')
len(cor1)
co=cor
co1=cor1
def vectorizer(corpuss,feature):
35
hs_vectorizer=HashingVectorizer(n_features=feature,ngram_range= (1,3))
Xarray=hs_vectorizer.fit_transform(corpuss).toarray()
return Xarray
X=vectorizer(co,5000)
X1=X
X=vectorizer(co1,1000)
X2=X
X=np.concatenate((X1,X2),axis=1)
y=data['label']
y.shape
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
36
plt.ylabel('True label')
plt.xlabel('Predicted label')
import itertools
pred1 = SVM.predict(X_test)
score = accuracy_score(y_test, pred1)
print("accuracy: %0.3f" % score)
cm = confusion_matrix(y_test, pred1)
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
pred2= classifier.predict(X_test)
score = accuracy_score(y_test, pred2)
print("accuracy: %0.3f" % score)
cm = confusion_matrix(y_test, pred2)
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
proba = model.predict_proba(X_test)[:, 1]
prediction = model.predict(X_test)
fpr, tpr, thresh = roc_curve(y_true, proba)
accuracy = accuracy_score(y_true, prediction)
# roc_auc = auc(fpr, tpr)
label = "%s - Accuracy %0.2f" % (name, accuracy)
37
plt.plot(fpr, tpr, label=label)
# 45 degree line
xx = np.linspace(0, 1.0, 20)
plt.plot(xx, xx)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC curves")
plt.legend()
if save:
plt.savefig('ROC_Curve')
In [46]:
models = {
"AdaBoost": classifier,
"SVM": SVM
}
import numpy as np
import os
#import cv2
from imutils import paths
from keras.utils.np_utils import to_categorical
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt
from tqdm import tqdm
from PIL import Image, ImageChops, ImageEnhance
from sklearn.metrics import roc_curve, roc_auc_score, auc, mean_squared_error,
classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
def ErrorLevelAnalysis(imgpath):
original = Image.open(imgpath)
originall = Image.open(imgpath)
try:
original.save("ela1.jpg",'JPEG', quality=95)
temporary1 = Image.open("ela1.jpg")
38
# temporary1.convert('RGB').save("ela2.jpg",'JPEG', quality=90)
# temporary2= Image.open("ela2.jpg")
except:
original.convert('RGB').save("ela1.jpg",'JPEG', quality=90)
temporary1 = Image.open("ela1.jpg")
# temporary1.convert('RGB').save("ela2.jpg",'JPEG', quality=90)
# temporary2= Image.open("ela2.jpg")
return diff
path_org = "/kaggle/input/newdata/imgt/org"
path_fak = "/kaggle/input/newdata/imgt/fak"
org = os.listdir(path_org)
fak = os.listdir(path_fak)
except:
print(path_org+imgname)
39
for imgname in tqdm(os.listdir(path_fak)):
try:
if imgname.endswith('png') or imgname.endswith('jpg') :
except:
print(path_fak+imgname)
image_name = []
label = []
for i in tqdm(range(len(images_names))):
image_name.append(images_names[i][0:-3])
label.append(images_names[i][-2])
dataset = pd.DataFrame({'image':image_name,'output':label})
dataset['output'].value_counts()
dataset.to_csv('DLdataset.csv',index=False)
dataset = pd.read_csv('DLdataset.csv')
dataset=dataset.sample(frac=1)
x=[]
y=[]
x.append(np.array(ErrorLevelAnalysis(data[0]).resize((192,
192))).flatten() / 255.0)
y.append(data[1])
X = np.array(x)
Y = np.array(y)
40
X.shape,Y.shape
def build_model(hp):
model = keras.Sequential([
keras.layers.Conv2D(
filters=hp.Int('conv_1_filter', min_value=32, max_value=96, step=16),
kernel_size=hp.Choice('conv_1_kernel', values = [5,5]),
activation='relu',
input_shape=X_train.shape[1:]
),
keras.layers.MaxPooling2D(
pool_size=hp.Choice('2d2',values=[2,2]), strides=None,padding="valid",
data_format=None
),
keras.layers.Conv2D(
filters=hp.Int('conv_2_filter', min_value=32, max_value=64, step=16),
kernel_size=hp.Choice('conv_2_kernel', values = [5,5]),
activation='relu'
),
keras.layers.MaxPooling2D( pool_size=hp.Choice('2d2',values=[2,2]),
strides=None,padding="valid" ,data_format=None
),
keras.layers.Dropout(0.5, noise_shape=None, seed=None),
keras.layers.Flatten(),
keras.layers.Dense(
units=hp.Int('dense_1_units', min_value=32, max_value=256, step=16),
activation='relu'
),
keras.layers.Dropout(0.5, noise_shape=None, seed=None),
keras.layers.Dense(2, activation='softmax')
])
model.compile(optimizer=keras.optimizers.Adam(hp.Choice('learning_rate',
values=[1e-2, 1e-4])) , loss='binary_crossentropy',
metrics=['accuracy'])
41
# keras.optimizers.RMSprop(learning_rate=0.001,rho=0.9,
momentum=0.0,epsilon=1e-07, centered=False, name="RMSprop")
return model
tuner_search=RandomSearch(build_model,
objective='val_accuracy',
max_trials=5,directory='output',project_name="fake
image")
tuner_search.search(X_train,Y_train,epochs=30,validation_data = (X_test,
Y_test))
model=tuner_search.get_best_models(num_models=1)[0]
Y_true = np.argmax(Y_test,axis = 1)
score = accuracy_score(Y_true, y_pred_cnn)
print("Accuracy score: {}".format(score))
score = metrics.precision_score(Y_true,y_pred_cnn, average= "weighted")
print("Precision score: {}".format(score))
score = metrics.recall_score(Y_true, y_pred_cnn, average= "weighted")
print("Recall score: {}".format(score))
score_lr1 = metrics.f1_score(Y_true, y_pred_cnn, average= "weighted")
print("F1 score: {}".format(score_lr1))
print(classification_report(Y_true, y_pred_cnn))
42
df_cm = pd.DataFrame(cm, columns=np.unique(Y_true), index = np.unique(Y_true))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (10,7))
sns.set(font_scale=1.4)#for label size
sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})# font
size
cnn_model_y_proba=model.predict(X_test,verbose=1)
plot_roc_curve(Y_true, y_pred_cnn, cnn_model_y_proba[:,1])
43
44
45
21