100% found this document useful (1 vote)
580 views57 pages

Documentation-Fake News Detection

This document is a project report that aims to combine fake news detection using supervised machine learning algorithms like SVM and AdaBoost for text classification, and convolutional neural networks (CNN) for image classification using error level analysis. The project was completed by three students from Coimbatore Institute of Technology to fulfill their Bachelor of Technology degree requirements. Key technologies used include machine learning, SVM, AdaBoost, CNN, Keras Tuner, error level analysis, Anaconda, Jupyter, and Kaggle notebooks.

Uploaded by

18IT030 Manoj G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
580 views57 pages

Documentation-Fake News Detection

This document is a project report that aims to combine fake news detection using supervised machine learning algorithms like SVM and AdaBoost for text classification, and convolutional neural networks (CNN) for image classification using error level analysis. The project was completed by three students from Coimbatore Institute of Technology to fulfill their Bachelor of Technology degree requirements. Key technologies used include machine learning, SVM, AdaBoost, CNN, Keras Tuner, error level analysis, Anaconda, Jupyter, and Kaggle notebooks.

Uploaded by

18IT030 Manoj G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

COMBINED MODEL OF FAKE NEWS DETECTION USING

SVM, ADABOOST AND TAMPERED IMAGE DETECTION BY


ERROR LEVEL ANALYSIS USING CNN.

A PROJECT REPORT

Submitted by

BAGAVATHI @ BAVATHARANI S 1807011


INDHUMADHI K 1807022
KARTHIK V 1807027

In partial fulfillment for the award of the degree


of

BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY

COIMBATORE INSTITUTE OF TECHNOLOGY, COIMBATORE-641014


(Government Aided Autonomous Institution Affiliated to Anna University)

ANNA UNIVERSITY, CHENNAI 600025

MARCH 2021
COIMBATORE INSTITUTE OF TECHNOLOGY
(A Govt. Aided Autonomous Institution Affiliated to Anna University)
COIMBATORE – 641014

BONAFIDE CERTIFICATE

Certified that this project report titled “COMBINING FAKE NEWS


DETECTION USING SUPERVISED SVM CLASSIFIER AND FAKE
IMAGE DETECTION USING CNN FROM TWITTER” is the bonafide
work of BAGAVATHI @ BAVATHARANI S (1807011),
INDHUMADHI K (1807022), KARTHIK V (1807027) in partial
fulfillment for the award of the Degree of Bachelor of Technology in
Information Technology of Anna University, Chennai during the academic
year 2019-2020 under my supervision.

Prof.N.K.KARTHIKEYAN, Mrs.S.Devi,
HEAD OF THE DEPARTMENT, SUPERVISOR,
Department of Information Technology, Department of Information
Technology,
Coimbatore Institute of Technology, Coimbatore Institute of Technology,
Coimbatore - 641014. Coimbatore - 641014.

Certified that the candidates were examined by us in the project work viva-vice
examination held on …………………

Internal Examiner External Examiner


Place:
Date:
TABLE OF CONTENTS

CHAPTER TITLE PAGE NO.


NO.

ACKNOWLEDGEMENT I
ABSTRACT III
LIST OF ABBREVATIONS IV
1 INTRODUCTION 1
1.1 MACHINE LEARNING 1
1.2 SUPPORT VECTOR MACHINE 2
1.3 ADABOOST 2
1.4 CONVOLUTIONAL NEURAL NETWORK
1.5 KERAS TUNER 3
1.6 ERROR LEVEL ANALYSIS 4
1.7 ANACONDA
4
1.8 JUPYTER

1.9 SPYDER

1.10 NOTEBOOK IN KAGGLE


2 LITERATURE SURVEY 7
3 SYSTEM ARCHITECTURE 14
4 SYSTEM SPECIFICATION 15
4.1 HARDWARE SPECIFICATION 15
4.2 SOFTWARE SPECIFICATION 15
5 DESIGN&IMPLEMENTATION 16
5.1 INPUT MODULE 16
5.2 PROCESSING MODULE 17
5.3 OUTPUT MODULE
6 CONCLUSION 18
7 FUTURE ENHANCEMENT 18
8 APPENDIX 19
APPENDIX – I

8.1 SNAP SHOTS FOR OUTPUT 19


APPENDIX – II

8.2 SOURCE CODE 20


9 REFERENCES 20
ACKNOWLEDGEMENT
ACKNOWLEDGEMENT

Our project “Combining Fake News Detection Using Supervised SVM


Classifier and Fake Image Detection Using CNN From Twitter” has been the
result of motivation and encouragement from many, whom we would like to thank.

We express our sincere thanks to our Secretary Dr.R.Prabhakar and our


Principal Dr.V.Selladurai for providing us a greater opportunity to carry out our
work. The following words are rather very me agree to express our gratitude to them.
This work is the outcome of their inspiration and product of plethora of their
knowledge and rich experience.

We record the deep sense of gratefulness to Dr.N.K.Karthikeyan, Head of the


Department of Information Technology, for his encouragement and support during
this tenure

We equally tender my sincere gratitude to our project guide Mrs.S.Devi,


Department of Information Technology, for her valuable suggestions and guidance
during this course.

During the entire period of study, the entire staff members of the Department of
Computer Science and Engineering & Information Technology have offered
ungrudging help. It is also a great pleasure to acknowledge the unfailing help we have
received from our friends.

It is a matter of great pleasure to thank our parents and family members for
their constant support and cooperation in the pursuit of this Endeavour.

I
ABSTRACT

II
ABSTRACT
Fake News detection is much needed in today’s world as it has a large impact in our social as
well as personal life but involved some challenges due to the limited resources like datasets ,
research papers. These days a lot of information is being shared over social media and we are not
able to differentiate between which information is fake and which is real. People immediately
start expressing their concern or sharing their opinion as soon as they come across a post,
without verifying its authenticity. This further results in spreading of it. Fake news and rumor are
the most popular forms of false and unauthenticated information and should be detected as soon
as possible for avoiding their dramatic consequences . In previous research, many fake news
detection methods were proposed. In this project the final report is generted by combined
predictions of both text classification using SVM or Adaboost and image classification using
CNN.

III
LIST OF ABBREVIATIONS

ABBREVIATION EXPANSION

SVM Support Vector Machine

ML Machine Learning

AdaBoost Adaptive Boosting

CNN Convolutional Neural Network

IV
INTRODUCTION

V
CHAPTER-1

1 INTRODUCTION

As an increasing amount of our lives in the world and we are spent interacting online through social
media platforms, more and more . A people tend to seek out and consume news from social media
rather than traditional news organization. The Satistics prove that nearly 6 in 10 people are using their
mobiles for their daily updates of news. Here comes the problem, because there are hundreds of
websites and sources providing news in their own way, which ultimately leads to manipulation of
news losing the original flavour of the news. There was a serious negative impact on individuals and
society by the extensive spread of fake news. The authenticity balance of the news ecosystem can be
break by Fake news. It is usually manipulated by propagandists to convey political messages or
influence. The way of people interpret and respond to real news will changes by fake news. Human
can’t predict all the fake news generated in social media ,to overcome this issues automated
techniques are under research. Machine Learning techniques is being used in this project.

1.1 MACHINE LEARNING :


Machine Learning is an artificial intelligence application that helps the systems to learn automatically
and improve from experience without being explicitly programmed. The main focus of ML is
development of computer programs. The primary aim is to allow the computers learn automatically
without human intervention or assistance and also to the adjust actions accordingly. Machine
Learning also be used for security applications, such as analyzing email communication or internet
usage. The implications of other applications of the technology need to be considered and action must
be taken to ensure uses that will be beneficial to the society. This algorithms builds a model based on
sample data which is known as training data. Depending on the nature of the signal or feedback
available to the learning systems it can be divided into three categories, they are Supervised learning,
Unsupervised learning and Reinforcement learning. Machine learning has been used in multiple fields
and industries like medical diagnosis, image processing, prediction, classification, learning
association, regression etc.

1
1.2 SVM:
SVM is Support Vector Machine and is a Supervised Machine Learning algorithm. It is used for both
classification and regression challenges. Mostly it is used for solving classification problems. Each
data item is plotted as n-dimensional space points where the value of each feature is the value of a
particular coordinate in this algorithm . Classification can be done by finding the hyper-plane which
differentiate two classes easily. Support Vectors are the co-ordinates of individual observations.
Margin can be maximized by using SVM. Position of the hyperplane can be changed by deleting the
SVM.

1.3 ADABOOST:

AdaBoost algorithm is a Boosting algorithm that is used as an Ensemble method in Machine


learning. It is called as adaptive boosting where the weights are re-assigned to each instance,
Boosting is being used to reduce the bias and variance in Supervised learning. It works on the
principle where the learners grown sequentially. AdaBoost is referred as best out-of-the-box
classifier. When this algorithm is used with decision tree, information is being gathered at
each stage of the AdaBoost algorithm which is about the relative 'hardness' of individual
training sample is fed into the tree growing algorithm so that later trees tend to focus on the
harder-to-classify examples.

2
1.4 CNN:
A CNN is convolutional neural network and CNN is a class of deep neural networks . It is
mostly used to analyzing visual imagery.

The “convolutional neural network” name indicates that the mathematical operation is
employs by network and is called as convolution. Convolutional networks use convolution
in place of general matrix multiplication in at least one of their layers and it was a specialized
type of neural networks. CNN has been designed for reduced processing requirements and
CNN uses a system much like a multilayer perceptron.

Compared to other image classification algorithms a CNNs use a relatively little pre-
processing. A major advantage of CNN is that, this independence from prior knowledge and
human effort in feature design .

1.5 KERAS TUNER:


The keras Tuner is a library and For your tensorflow program it will helps you pick
the optimal set of hyperparameters. For a machine learning(ML) application , we have to
select the right set of hyperparameters and this process is called hyperparameter tuning or
hypertuning.

Hyperparameters that govern the training process and the topology of an ML model and
hyperparameters are variable. These variables directly impact the performance of your ML
program and remain constant over the training process. Hyperparameters are of two types ,
they are Model hyperparameters and algorithm hyperparameters.

3
A influence of model selection such as the number and width of hidden layers is known
as Model Hyperparameters. A influence in the speed and quality of the learning algorithm
such as the learning rate for Stochastics gradient descent (SGD) and the number of nearest
neighbors for a k Nearest Neighbors (KNN) classifier is known as Algorithm
Hyperparameters.

1.6 ERROR LEVEL ANALYSIS:


Error level analysis is one of the technique which is used for knowing images that have been
manipulated by storing images at a certain quality level and is a technique is used to calculating the
difference from the compression level. When JPEG was first saved, then it will compress the image
for the first time, most editing software like adobe photoshop and adobe lightroom supports JPEG
compressing operation. If the image is re-scheduled using image editing software, then it is
compressed again.

1.7 ANACONDA:
Anaconda is a free and open-source distribution for scientific computing of Python and R
programming languages. It is used for data science, machine learning applications, predictive
analysis, large-scale data processing etc. It aims to simplify deployment and package
management. Package management system conda manages the package version.

4
1.8 JUPITER NOTEBOOK:
Jupyter Notebook is an open-source web application which allows to create and share
documents which contains narrative text, visualizations, live code and equations. It is used for
data cleaning and transformation, data visualization, statistical modeling, numerical
simulation, machine learning and much more. Over 40 different programming languages are
supported by Jupyter. Python is a requirement (Python 3.3 or greater, or Python 2.7) for
installing the Jupyter Notebook itself.

5
1.9 SPYDER:
Spyder is a scientific environment that is designed by scientists, data analysts and engineers
and written in Python. It provides a unique combination of the analysis, debugging, advanced
editing and profiling functionality of a comprehensive development tool with the deep
inspection, data exploration, interactive execution and beautiful visualization capabilities of a
scientific package. It has the ability that can be extended further via its API and plugin
system. Spyder is used as a PyQt5 extension library, that allows to build upon its
functionality and embed its components, such as the interactive console.

1.10 NOTEBOOK IN KAGGLE:


Kaggle Notebook is a computational environment that enables reproducible and collaborative
analysis.

6
7
LITERATURE SURVEY

8
CHAPTER- 2
LITERATURE SURVEY

[1] TITLE:
A benchmarking study for fake news detection

AUTHORS:

Georgios Gravanis, Athena Vakali, Konstantinos Diamantaras, Panagiotis Karadais

DESCRIPTION:

It is taken from Elsevier. The algorithm used in this paper are K-NN, Decision Tree ,Naive
Bayes ,SVM , AdaBoost , Bagging. The proposed approach could be the base for a tool helping
publishers to quickly decide which article needs further exploitation concerning its veracity

ADVANTAGES:

In this Paper they proposed an enhanced set of linguistic features with powerful capabilities for
discriminating fake news from real news articles

LIMITATIONS:

It deals with textual content only. It detects only whether the given news is real or fake

[2]TITLE:
Detecting Fake Images on Social Media using Machine Learning

AUTHORS:

Njood Mohammed AlShariah, Abdul Khader Jilani Saudagar

DESCRIPTION:

Using it in CNN to train the networks and compare three algorithms to find best one.

ADVANTAGES:

8
CNN is useful for automatic feature learning. CNN is weight sharing, in which two layers can have same
weight.

LIMITATIONS:

By comparing CNN with others it gives low accuracy in both datasets. In neural networks,
limitations include computationally expensive.

[3]TITLE:

Fake News Stance Detection Using Deep Learning Architecture.

AUTHORS:

Muhammad Umer, Saleem Ullah, Arif Mehmood, Gyu Sang Choi, Byung-Won On.

DESCRIPTION:

PCA and Chi-square is used to perform component level analysis and obtain the reduced feature
set. Accuracy-97.8%

ADVANTAGES:

PCA - Time and space complexity. Processing speed is reduced in feature selection. Text
classification is done easily using feature extraction.

LIMITATIONS:

This method is limited of claims and in English language only. It is limited to particular set of
data.

[4]TITLE:

Detecting Misleading Information on COVID-19

AUTHORS:

Mohamed K. Elhadad, Kin Fun Li, Fayez Gebali.

9
DESCRIPTION:

Decision Tree, K-Nearest Neighbor (KNN), Logistic Regression, Linear Support Vector
Machines, Multinomial Naive Bayes, Bernolli Naïve Bayes, Perception, Neural Network,
Ensemble Random forest and Extreme Gradient Boosting Classifier.

ADVANTAGES:

It gives complete result regarding COVID – 19 and also covers most websites. It focuses mainly on
WHO, UNICEF, UN websites.

LIMITATIONS:

Due to the spread of fake news about COVID-19, people may reduce confidence on government and may
promote specific product to achieve enormous profit. This technique may avoid such activities.

[5]TITLE:

Exposing computer generated images by using deep convolutional neural networks

AUTHORS:

Edmar R.S. de Rezende, Guilherme C.S. Ruppert, Antonio Theophilo, Eric K. Tokuda, Tiago Carvalho.

DESCRIPTION:

Deals with the problem of detecting computer generated images, through deep convolutional networks
and transfer learning techniques. CNN is used to classify each image from the dataset using the raw RGB
pixels values as features, without the need for manual feature extraction.

Accuracy-97%

ADVANTAGES:

By using DNN, there is no need for human engineered feature, with a general purpose algorithm learning
direct from raw data.

LIMITATIONS:

CG images can be identified easily so that it could be easy to identify fake news.

10
[6]TITLE:

Text-mining-based Fake News Detection Using Ensemble Methods

AUTHORS:

Harita Reddy, Namratha Raj, Manali Gala, Annappa Basava

DESCRIPTION:

Random forest gives higher accuracy than the others. Ensemble methods increase the accuracy where the
Gradient boosting gives 86% accuracy.

ADVANTAGES:

By using ensemble methods we can improve the accuracy .It is found that the real news has high number
of quotes values than the fake news.

LIMITATIONS:

With the help of text- mining, it is known that real news including quoted contents mostly. The dataset
used in this prediction contains only limited news and also only related to politics. It should include
image dataset.

[7]TITLE:

Supervised Learning for Fake News Detection

AUTHORS:

Julio C. S. Reis, Andre Correia, Fabrıcio Murai, Adriano Veloso, Fabrıcio Benevenuto

DESCRIPTION:

K nearest neighbours, Naives baiyes, Random forest,Support vector machine,Xgboost.

11
ADVANTAGES:
William D. Ferreiraa, Cristiane B.R. Ferreiraa, Gelson da Cruz Júnior b, Fabrizzio Soares a,c
The two
primary
benefits of the k-Nearest Neighbor algorithm are efficiency and flexibility. The algorithm is
efficient in its simplicity, speed, and scalability. As described above, the mechanics of the
algorithm are readily apparent, and it is simple to understand and implement.

LIMITATIONS:

Sometimes its gives fake for true news if the ip address by News channel is same as the that
blacklisted in dataset. It needs an live action algorithm for realizing its own needs. This model
focus on only political news.

[8]TITLE:

A review on digital image forensics

AUTHORS:

William D. Ferreiraa, Cristiane B.R. Ferreiraa, Gelson da Cruz Júnior b, Fabrizzio Soares a,c

DESCRIPTION:

There are many methods in image manipulation in which two most common methods are active
and passive methods. It has further divided with respect to spatial and frequency domain.

ADVANTAGES:

Embedding of different domains in method can give great outcomes. Each proposed method can be used
for different usecases.

LIMITATIONS:

Detection many manipulation techniques at one model is quite tedious one.

12
CHAPTER 3

3 SYSTEM ARCHITECTURE
PROPOSED SYSTEM
The proposed system is to predict the given twitter news is fake or original (text news using Adaboost
and Support vector machine , tampered image by error level analysis using CNN) using Machine learning
.The main objective is to improve the accuracy in both models separately in the data preprocessing
methods and average the results of both models, which gives final result as fake or not.

The system architecture diagram shown below:

13
14
4 SYSTEM SPECIFICATION
The hardware and software specifications for the system are as follows.

4.1 Hardware Specification

Processor : Intel Core i7


Speed : 2.9 GHz - 4.2 GHz
RAM : 4 GB
Monitor : 14 IN
Keyboard : Window keyboard
Mouse : Optical mouse

Table 4.1 Hardware Specifications

4.2 Software Specification

Operating system : Windows 10


Web Language : HTML, CSS
Web Framework : Spyder(Anaconda)
IDE : Jupiter Notebook(Anaconda) and
Kaggle Notebook

Table 4.2 Software Specifications

CHAPTER 5

5 DESIGN & IMPLEMENTATION

15
The system is made up of three important modules:

● Input Module
● Processing Module
○ Training and validation of pre-processed data for prediction.
● Output Module

5.1 Input module:

The input module has divided into two parts such as Text box module and add image module.

The first part has to upload text inputs namely news title, author name and body of the news.

Sample input dataset for text news classification part is

16
The second part is to upload image which may be tampered or original for tampered image
detection part.

Sample of input original and tampered image data.

a) Original image

b) Tampered image(copy-move)

5.2 Processing module:

Model 1:
In module one the text inputs like title, author and body of news get preprocessed by NLP
techniques includes tokenize the paragraphs into words ,removing special characters, removing
unwanted features called stopwords which may decrease accuracy. The next part is hashing
vectorizer, it extract features from above dataset of words with its counts and its importance in

17
whole corpus. The extracted features with floating values as its importance is finally passed to
trained model with Adaboost algorithm and then the prediction gives result as a 0 or 1 for true or
fake news respectively.

Model 2:
In module two , the input is image which may be tampered or original .The input image format is
.jpg. Because now a day’s most usable image format is .jpg. This is converted to error level analysis
format and then it is converted to 3d array of values between 0-255 and rescaled to specified size for
further process. Then the pre-processed data is passed to Convolutional neural network trained
model(keras tuner is used for selecting best model) for prediction. Then finally the result shows as 0
or 1 for original or tampered images respectively.

Note : Both models are independent to each other if image could not be upload ,then the model
gives prediction only based on text data and vice versa.

5.3 Output module:


Finally the probability scores for outputs from text news classification and tampered image
classification models are taken then averaged to give final prediction as real or fake twitter news in
the output webpage.

6 CONCLUSION:
The combined model of fake news and tampered image detection is created and tested successfully.
This model focus on important features like title , body of the news in fake news detection which
increases the accuracy and Error level analysis method is one of the efficient method in detecting
fake image especially JPEG images because JPEG is almost used everywhere. This model can be used
for forensics department by applying little enhancement and also it can be used for regular social
media users.

Recently people is totally on social media. It has powerful impact in every person’s life and it is
becoming an trustable platform recently. Due to some unavoidable causes it also leads to spread
fake news all over the world which has very negative impacts in our society. Hence this kind of
detection methods will be very helpful in defending against fake news.

7 FUTURE ENHANCEMENT:
The accuracy of this model can be increased by training it with much more data in future.

This model focus only in JPEG image format, in future it can widespread its effectiveness in many
image formats.

In future the tampered part of the image can also be detected.

18
8 APPENDIX

APPENDIX – I

8.1 SNAP SHOTS FOR OUTPUT

Model 1:Fake news detection (text news as input)

Importing Libraries

Read Input data

Sample input data

19
Data preprocessing

Sample data before preprocessing

Sample data after preprocessing

20
Hashing vectorizer(Feature Extraction)

Sample data after hashing vectorizer

SVM model

21
Confusion matrix with Accuracy

Adaboost Model

22
Confusion matrix with Accuracy

AUC-ROC CURVES for SVM and Adaboost

23
Model 2: Tampered image detection (image as input)

Importing Libraries

Error Level Analysis convertion function

24
Orignal image and its ELA Conversion

25
Tampered image and its ELA conversion

26
Assigning label for each images

27
Creating dataframe which contains image path and its label

28
Dataframe with image paths and label

29
Conversion into ELA images

Rescale image in fixed size

Creating CNN model with keras tuner

30
31
Training CNN model with image arrays

32
Classification result of CNN tested model

Confusion matrix and AUC-ROC curve of CNN tested model

33
APPENDIX – II

8.1 Source code

Model 1:Fake news detection (text news as input)

import pandas as pd

import pandas as pd
import numpy as np
from sklearn.metrics import roc_curve,accuracy_score, roc_auc_score, auc,
mean_squared_error, classification_report, confusion_matrix
from sklearn.feature_extraction.text import HashingVectorizer
import matplotlib.pyplot as plt

34
f1 = "/kaggle/input/fake-news-data/train.csv"
f2 = "/kaggle/input/fake-news-data/test.csv"
f3 = "/kaggle/input/mlfake/news_dataset.csv"

df=pd.read_csv(f1)
dftest=pd.read_csv(f2)
test=pd.read_csv(f3)

df=df.dropna()

data=df.copy()
data=data.iloc[0:6000,]
data.shape

data.reset_index(inplace=True)

from nltk.stem.porter import PorterStemmer


import re
def preprocess(data,label):
ps = PorterStemmer()
corpus = []
for i in range(0, len(data)):
try:

review = re.sub(r'www.*$','',data[label][i])
review = re.sub(r'http.*$','',data[label][i])
review = re.sub(r'@.*$','',data[label][i])
review = re.sub(r'#.*$','',data[label][i])
review = re.sub('[^a-zA-Z]', ' ', data[label][i])
review = review.lower()
review = review.split()

review = [ps.stem(word) for word in review if not word in


stopwords.words('english')]
review = ' '.join(review)
corpus.append(review)

except:
continue
return corpus

cor=preprocess(data,'text')
len(cor)

cor1=preprocess(data,'title')
len(cor1)

co=cor
co1=cor1

def vectorizer(corpuss,feature):

35
hs_vectorizer=HashingVectorizer(n_features=feature,ngram_range= (1,3))
Xarray=hs_vectorizer.fit_transform(corpuss).toarray()
return Xarray

X=vectorizer(co,5000)
X1=X
X=vectorizer(co1,1000)
X2=X

X=np.concatenate((X1,X2),axis=1)

y=data['label']
y.shape

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=0)

def plot_confusion_matrix(cm, classes,


normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
See full source and example:
http://scikit-
learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

This function prints and plots the confusion matrix.


Normalization can be applied by setting `normalize=True`.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')

thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()

36
plt.ylabel('True label')
plt.xlabel('Predicted label')

import sklearn.svm as svm


SVM=svm.SVC(C=3.0,kernel='linear',degree=3,gamma='auto',probability=True)
#est = svm.LinearSVC()
SVM.fit(X_train, y_train)

import itertools
pred1 = SVM.predict(X_test)
score = accuracy_score(y_test, pred1)
print("accuracy: %0.3f" % score)
cm = confusion_matrix(y_test, pred1)
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])

from sklearn.ensemble import AdaBoostClassifier


from sklearn.tree import DecisionTreeClassifier
classifier = AdaBoostClassifier(DecisionTreeClassifier(max_depth=3),
n_estimators=200)
classifier.fit(X_train, y_train)

pred2= classifier.predict(X_test)
score = accuracy_score(y_test, pred2)
print("accuracy: %0.3f" % score)
cm = confusion_matrix(y_test, pred2)
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])

def plot_roc_multi(X_test, y_true, models, save=False):


plt.figure(figsize=(10,6))
for name, model in models.items():

proba = model.predict_proba(X_test)[:, 1]
prediction = model.predict(X_test)
fpr, tpr, thresh = roc_curve(y_true, proba)
accuracy = accuracy_score(y_true, prediction)
# roc_auc = auc(fpr, tpr)
label = "%s - Accuracy %0.2f" % (name, accuracy)

37
plt.plot(fpr, tpr, label=label)

# 45 degree line
xx = np.linspace(0, 1.0, 20)
plt.plot(xx, xx)

plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC curves")
plt.legend()
if save:
plt.savefig('ROC_Curve')

In [46]:
models = {
"AdaBoost": classifier,
"SVM": SVM
}

plot_roc_multi(X_test, y_test, models, save=True)

Model 2: Tampered image detection (image as input)

from tensorflow import keras


import tensorflow as tf

import numpy as np
import os
#import cv2
from imutils import paths
from keras.utils.np_utils import to_categorical
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt
from tqdm import tqdm
from PIL import Image, ImageChops, ImageEnhance
from sklearn.metrics import roc_curve, roc_auc_score, auc, mean_squared_error,
classification_report, confusion_matrix
from sklearn.metrics import accuracy_score

def ErrorLevelAnalysis(imgpath):

original = Image.open(imgpath)
originall = Image.open(imgpath)
try:
original.save("ela1.jpg",'JPEG', quality=95)
temporary1 = Image.open("ela1.jpg")

38
# temporary1.convert('RGB').save("ela2.jpg",'JPEG', quality=90)
# temporary2= Image.open("ela2.jpg")

diff = ImageChops.difference(originall, temporary1)


extrema = diff.getextrema()
max_diff = max([ex[1] for ex in extrema])
if max_diff == 0:
max_diff = 1
scale = 255.0/max_diff
diff = ImageEnhance.Brightness(diff).enhance(scale)

except:
original.convert('RGB').save("ela1.jpg",'JPEG', quality=90)
temporary1 = Image.open("ela1.jpg")

# temporary1.convert('RGB').save("ela2.jpg",'JPEG', quality=90)
# temporary2= Image.open("ela2.jpg")

diff = ImageChops.difference(originall.convert('RGB'), temporary1)


extrema = diff.getextrema()
max_diff = max([ex[1] for ex in extrema])
if max_diff == 0:
max_diff = 1
scale = 255.0/max_diff
diff = ImageEnhance.Brightness(diff).enhance(scale)

return diff

path_org = "/kaggle/input/newdata/imgt/org"
path_fak = "/kaggle/input/newdata/imgt/fak"

org = os.listdir(path_org)
fak = os.listdir(path_fak)

for imgname in tqdm(os.listdir(path_org)):


try:
if imgname.endswith('png') or imgname.endswith('jpg'):

imgnamefinal = path_org +'/'+ imgname + ',0\n'


images_names.append(imgnamefinal)

except:
print(path_org+imgname)

39
for imgname in tqdm(os.listdir(path_fak)):
try:
if imgname.endswith('png') or imgname.endswith('jpg') :

imgnamefinal = path_fak +'/'+imgname + ',1\n'


images_names.append(imgnamefinal)

except:
print(path_fak+imgname)

image_name = []
label = []
for i in tqdm(range(len(images_names))):
image_name.append(images_names[i][0:-3])
label.append(images_names[i][-2])

dataset = pd.DataFrame({'image':image_name,'output':label})
dataset['output'].value_counts()

dataset.to_csv('DLdataset.csv',index=False)
dataset = pd.read_csv('DLdataset.csv')
dataset=dataset.sample(frac=1)

x=[]
y=[]

for index, data in tqdm(dataset.iterrows()):

x.append(np.array(ErrorLevelAnalysis(data[0]).resize((192,
192))).flatten() / 255.0)
y.append(data[1])

X = np.array(x)
Y = np.array(y)

X = X.reshape(-1, 192, 192, 3)


Y = to_categorical(Y, 2)

40
X.shape,Y.shape

from numpy import save


## save all the data
save('X_.npy', X)
save('Y_.npy',Y)
from numpy import load
x_ = load('X_.npy')
y_ = load('Y_.npy')
x_.shape,y_.shape

X_train, X_test, Y_train, Y_test = train_test_split(x_, y_, test_size =


0.3,shuffle=True, random_state=3)

def build_model(hp):
model = keras.Sequential([

keras.layers.Conv2D(
filters=hp.Int('conv_1_filter', min_value=32, max_value=96, step=16),
kernel_size=hp.Choice('conv_1_kernel', values = [5,5]),
activation='relu',
input_shape=X_train.shape[1:]
),
keras.layers.MaxPooling2D(
pool_size=hp.Choice('2d2',values=[2,2]), strides=None,padding="valid",
data_format=None
),
keras.layers.Conv2D(
filters=hp.Int('conv_2_filter', min_value=32, max_value=64, step=16),
kernel_size=hp.Choice('conv_2_kernel', values = [5,5]),
activation='relu'
),
keras.layers.MaxPooling2D( pool_size=hp.Choice('2d2',values=[2,2]),
strides=None,padding="valid" ,data_format=None
),
keras.layers.Dropout(0.5, noise_shape=None, seed=None),
keras.layers.Flatten(),
keras.layers.Dense(
units=hp.Int('dense_1_units', min_value=32, max_value=256, step=16),
activation='relu'
),
keras.layers.Dropout(0.5, noise_shape=None, seed=None),
keras.layers.Dense(2, activation='softmax')
])

model.compile(optimizer=keras.optimizers.Adam(hp.Choice('learning_rate',
values=[1e-2, 1e-4])) , loss='binary_crossentropy',

metrics=['accuracy'])

41
# keras.optimizers.RMSprop(learning_rate=0.001,rho=0.9,
momentum=0.0,epsilon=1e-07, centered=False, name="RMSprop")

return model

from kerastuner import RandomSearch


from kerastuner.engine.hyperparameters import HyperParameters

tuner_search=RandomSearch(build_model,
objective='val_accuracy',
max_trials=5,directory='output',project_name="fake
image")

tuner_search.search(X_train,Y_train,epochs=30,validation_data = (X_test,
Y_test))

model=tuner_search.get_best_models(num_models=1)[0]

model.fit(X_train,Y_train, batch_size=15, epochs=30 , validation_data =


(X_test, Y_test))

from sklearn import metrics


from sklearn.metrics import confusion_matrix, classification_report
y_pred_cnn1 = model.predict(X_test)
y_pred_cnn = np.argmax(y_pred_cnn1,axis = 1)

Y_true = np.argmax(Y_test,axis = 1)
score = accuracy_score(Y_true, y_pred_cnn)
print("Accuracy score: {}".format(score))
score = metrics.precision_score(Y_true,y_pred_cnn, average= "weighted")
print("Precision score: {}".format(score))
score = metrics.recall_score(Y_true, y_pred_cnn, average= "weighted")
print("Recall score: {}".format(score))
score_lr1 = metrics.f1_score(Y_true, y_pred_cnn, average= "weighted")
print("F1 score: {}".format(score_lr1))

import seaborn as sns


cm = confusion_matrix(Y_true, y_pred_cnn)
print('Confusion matrix:\n',cm)

print(classification_report(Y_true, y_pred_cnn))

print('Plot of Confusion Matrix')

42
df_cm = pd.DataFrame(cm, columns=np.unique(Y_true), index = np.unique(Y_true))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (10,7))
sns.set(font_scale=1.4)#for label size
sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})# font
size

cnn_score = model.evaluate(X_test, Y_test, verbose=1)


print ('Test loss:', cnn_score[0])
print ('Test accuracy:', cnn_score[1])

def plot_roc_curve(y_true, y_pred, y_proba):


rmse = sqrt(mean_squared_error(y_true, y_pred))
print('RMSE', rmse)
from sklearn.metrics import roc_auc_score
print('ROC_AUC score:',roc_auc_score(Y_true,y_pred_cnn))

FPR, TPR, thresholds = roc_curve(y_true, y_proba)


roc_auc = auc(FPR, TPR)
plt.plot([0, 1], [0, 1], 'r--')
plt.plot(FPR, TPR, label='AUC = %0.1f' % roc_auc)
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.ylabel('True Positive')
plt.xlabel('False Positive')
plt.show()

cnn_model_y_proba=model.predict(X_test,verbose=1)
plot_roc_curve(Y_true, y_pred_cnn, cnn_model_y_proba[:,1])

43
44
45
21

You might also like