Classifying Fake News Articles Using NLP To Identify In-Article Attribution As A Supervised Learning
Classifying Fake News Articles Using NLP To Identify In-Article Attribution As A Supervised Learning
TITLES PAGE NO
CONTENTS
ABSTRACT vi
1. INTRODUCTION 1
2. LITERATURE SURVEY 5
2.1 KEY-DEDUPLICATION WITH IBBE 5
2.2 SERVER LESS DISTRIBUTEDFILESYSTEM 6
2.3 THE GOOGLEFILESYSTEM 7
2.4 CONVERGENTKEYMANAGEMENT 8
2.5 SOFTWAREENVIRONMENT 9
2.6 WHY CHOOSEPYTHON 10
3. SYSTEM ANALYSIS 14
3.1 EXISTING SYSTEM 16
3.2 PROPOSEDSYSTEM 16
4. FEASIBILITYSTUDY 17
4.1 ECONOMICALFEASIBILITY 17
4.2 TECHNICALFEASIBILITY 17
4.3 SOCIALFEASIBILITY 18
5. SYSTEMREQUIREMENTS 19
6. SYSTEMDESIGN 20
6.1 SYSTEMARCHITECTURE 20
6.2 DATAFLOW DIAGRAM 20
6.3 UMLDIAGRAMS 22
7. IMPLEMENTATION 27
7.1 MODULES 27
7.2 SAMPLE CODE 28
8. SYSTEMTESTING 29
8.1 UNITTESTING 31
8.2 INTEGRATION TESTING 32
8.3 ACCEPTANCETESTING 33
9.2 OUTPUTDESIGN 35
10. SCREENSHOTS 37
11. FUTUREWORK 50
12. CONCLUSION 51
13. BIBLOGRAPHY 52
ABSTRACT:
Intentionally deceptive content presented under the guise of legitimate journalism is a worldwide
information accuracy and integrity problem that affects opinion forming, decision making, and
voting patterns. Most so-called ‘fake news’ is initially distributed over social media conduits like
Facebook and Twitter and later finds its way onto mainstream media platforms such as traditional
television and radio news. The fake news stories that are initially seeded over social media
platforms share key linguistic characteristics such as making excessive use of unsubstantiated
hyperbole and non-attributed quoted content. In this paper, the results of a fake news
identification study that documents the performance of a fake news classifier are presented. The
Textblob, Natural Language, and SciPy Toolkits were used to develop a novel fake news detector
that uses quoted attribution in a Bayesian machine learning system as a key feature to estimate
the likelihood that a news article is fake. The resultant process precision is 63.333% effective at
assessing the likelihood that an article with quotes is fake. This process is called influence mining
and this novel technique is presented as a method that can be used to enable fake news and even
propaganda detection. In this paper, the research process, technical analysis, technical linguistics
work, and classifier performance and results are presented. The paper concludes with a
discussion of how the current system will evolve into an influence mining system.
1. INTRODUCTION
Intentionally deceptive content presented under the guise of legitimate journalism (or ‘fake news,’ as it is
commonly known) is a worldwide information accuracy and integrity problem that affects opinion
forming, decision making, and voting patterns. Most fake news is initially distributed over social media
conduits like Facebook and Twitter and later finds its way onto mainstream media platforms such as
traditional television and radio news. The fake news stories that are initially seeded over social media
platforms share key linguistic characteristics such as excessive use of unsubstantiated hyperbole and non-
attributed quoted content. The results of a fake news identification study that documents the performance
of a fake news classifier are presented and discussed in this paper.
2. LITERATURE SURVEY
2.1 When Fake News Becomes Real: Combined Exposure to Multiple News
Sources and Political Attitudes of Inefficacy, Alienation, and Cynicism
2.2 The Impact of Real News about "Fake News": Intertextual Processes and
Political Satire
Python 2.0 was released in 2000, and the 2.x versions were the prevalent releases
until December 2008. At that time, the development team made the decision to
release version 3.0, which contained a few relatively small but significant changes
that were not backward compatible with the 2.x versions. Python 2 and 3 are very
similar, and some features of Python 3 have been back ported to Python 2. But in
general, they remain not quite compatible.
Both Python 2 and 3 have continued to be maintained and developed, with periodic
release updates for both. As of this writing, the most recent versions available are
2.7.15 and 3.6.5. However, an official End of Life date of January 1, 2020 has been
established for Python 2, after which time it will no longer be maintained. If you
are a newcomer to Python, it is recommended that you focus on Python 3, as this
tutorial will do.
Python is still maintained by a core development team at the Institute, and Guido is
still in charge, having been given the title of BDFL (Benevolent Dictator For Life)
by the Python community. The name Python, by the way, derives not from the
snake, but from the British comedy troupe Monty Python’s Flying Circus, of which
Guido was, and presumably still is, a fan. It is common to find references to Monty
Python sketches and movies scattered throughout the Python documentation.
If you’re going to write programs, there are literally dozens of commonly used
languages to choose from. Why choose Python? Here are some of the features that
make Python an appealing choice.
Python is Popular
Python has been growing in popularity over the last few years. The 2018 Stack
Overflow Developer Survey ranked Python as the 7th most popular and the number
one most wanted technology of the year. World-class software development
countries around the globe use Python every single day.
According to research by Dice Python is also one of the hottest skills to have and
the most popular programming language in the world based on the Popularity of
Programming Language Index.
Python is interpreted
Many languages are compiled, meaning the source code you create needs to be
translated into machine code, the language of your computer’s processor, before it
can be run. Programs written in an interpreted language are passed straight to an
interpreter that runs them directly.
This makes for a quicker development cycle because you just type in your code and
run it, without the intermediate compilation step.
Python is Free
The Python interpreter is developed under an OSI-approved open-source license,
making it free to install, use, and distribute, even for commercial purposes.
A version of the interpreter is available for virtually any platform there is,
including all flavors of Unix, Windows, macOS, smart phones and tablets, and
probably anything else you ever heard of. A version even exists for the half dozen
people remaining who use OS/2.
Python is Portable
Because Python code is interpreted and not compiled into native machine
instructions, code written for one platform will work on any other platform that has
the Python interpreter installed. (This is true of any interpreted language, not just
Python.)
Python is Simple
As programming languages go, Python is relatively uncluttered, and the developers
have deliberately kept it that way.
A rough estimate of the complexity of a language can be gleaned from the number
of keywords or reserved words in the language. These are words that are reserved
for special meaning by the compiler or interpreter because they designate specific
built-in functionality of the language.
Python 3 has 33 keywords, and Python 2 has 31. By contrast, C++ has 62, Java has
53, and Visual Basic has more than 120, though these latter examples probably
vary somewhat by implementation or dialect.
Python code has a simple and clean structure that is easy to learn and easy to read.
In fact, as you will see, the language definition enforces code structure that is easy
to read.
Conclusion
This section gave an overview of the Python programming language, including:
Python is a great option, whether you are a beginning programmer looking to learn
the basics, an experienced programmer designing a large application, or anywhere
in between. The basics of Python are easily grasped, and yet its capabilities are
vast. Proceed to the next section to learn how to acquire and install Python on your
computer.
Python has a very easy-to-read syntax. Some of Python's syntax comes from C,
because that is the language that Python was written in. But Python uses
whitespace to delimit code: spaces or tabs are used to organize code into groups.
This is different from C. In C, there is a semicolon at the end of each line and curly
braces ({}) are used to group code. Using whitespace to delimit code makes Python
a very easy-to-read language.
Web development
Scientific programming
Desktop GUIs
Network programming
Game programming
3. SYSTEM ANALYSIS
3.1 EXISTING SYSTEM:
Up to now, most of the research on PDS has focused on how to enforce user privacy preferences
and how to secure data when stored into the PDS. In contrast, the key issue of helping users to
specify their privacy preferences on PDS data has not been so far deeply investigated. This is a
fundamental issue since average PDS users are not skilled enough to understand how to translate
their privacy requirements into a set of privacy preferences. As several studies have shown,
average users might have difficulties in properly setting potentially complex privacy preferences.
5. SYSTEM REQUIREMENTS
• Tool : PyCharm
• Database : MYSQL
• Server : Flask
6. SYSTEM DESIGN
Unauthorized user
Check
Logout
End process
6.3 UML DIAGRAMS:
UML stands for Unified Modeling Language. UML is a standardized
general-purpose modeling language in the field of object-oriented software
engineering. The standard is managed, and was created by, the Object Management
Group.
The goal is for UML to become a common language for creating models of
object oriented computer software. In its current form UML is comprised of two
major components: a Meta-model and a notation. In the future, some form of
method or process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying,
Visualization, Constructing and documenting the artifacts of software system, as
well as for business modeling and other non-software systems.
The UML represents a collection of best engineering practices that have
proven successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software
and the software development process. The UML uses mostly graphical notations
to express the design of software projects.
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that
they can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core
concepts.
3. Be independent of particular programming languages and development
process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Integrate best practices.
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction
diagram that shows how processes operate with one another and in what order. It is
a construct of a Message Sequence Chart. Sequence diagrams are sometimes called
event diagrams, event scenarios, and timing diagrams.
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities
and actions with support for choice, iteration and concurrency. In the Unified
Modeling Language, activity diagrams can be used to describe the business and
operational step-by-step workflows of components in a system. An activity
diagram shows the overall flow of control.
7. IMPLEMENTATION
7.1 MODULES:
Upload News Articles
MODULES DESCRIPTION:
7.2 SAMPLE CODE
8. SYSTEM TESTING
TYPES OF TESTS
Unit testing:
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is the
testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that
relies on knowledge of its construction and is invasive. Unit tests perform basic
tests at component level and test a specific business process, application, and/or
system configuration. Unit tests ensure that each unique path of a business process
performs accurately to the documented specifications and contains clearly defined
inputs and expected results.
Integration testing:
Integration tests are designed to test integrated software components to
determine if they actually run as one program. Testing is event driven and is more
concerned with the basic outcome of screens or fields. Integration tests demonstrate
that although the components were individually satisfaction, as shown by
successfully unit testing, the combination of components is correct and consistent.
Integration testing is specifically aimed at exposing the problems that arise from
the combination of components.
Functional test:
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system
documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.
Systems/Procedures : interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements,
key functions, or special test cases. In addition, systematic coverage pertaining to
identify Business process flows; data fields, predefined processes, and successive
processes must be considered for testing. Before functional testing is complete,
additional tests are identified and the effective value of current tests is determined.
System Test:
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An
example of system testing is the configuration oriented system integration test.
System testing is based on process descriptions and flows, emphasizing pre-driven
process links and integration points.
White Box Testing:
White Box Testing is a testing in which in which the software tester has
knowledge of the inner workings, structure and language of the software, or at least
its purpose. It is purpose. It is used to test areas that cannot be reached from a black
box level.
Black Box Testing:
Black Box Testing is testing the software without any knowledge of the inner
workings, structure or language of the module being tested. Black box tests, as
most other kinds of tests, must be written from a definitive source document, such
as specification or requirements document, such as specification or requirements
document. It is a testing in which the software under test is treated, as a black
box .you cannot “see” into it. The test provides inputs and responds to outputs
without considering how the software works.
8.1 Unit Testing:
Unit testing is usually conducted as part of a combined code and unit test
phase of the software lifecycle, although it is not uncommon for coding and unit
testing to be conducted as two distinct phases.
Test strategy and approach:
Field testing will be performed manually and functional tests will be written
in detail.
Test objectives:
All field entries must work properly.
Pages must be activated from the identified link.
The entry screen, messages and responses must not be delayed.
Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.
8.2 Integration Testing
Software integration testing is the incremental integration testing of two or
more integrated software components on a single platform to produce failures
caused by interface defects.
The task of the integration test is to check that components or software
applications, e.g. components in a software system or – one step up – software
applications at the company level – interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
.
10. SCREENSHOTS
11. FUTURE ENHANCEMENT
Future planned research efforts involve combing attribution feature extraction with other factors that
emerge from the research to produce tools that not only identify potential false content, but influence
based content designed to compel a reader or target audience to make inaccurate or altered decisions.
12. CONCLUSION
This paper presented the results of a study that produced a limited fake news detection system.
The work presented herein is novel in this topic domain in that it demonstrates the results of a
full-spectrum research project that started with qualitative observations and resulted in a working
quantitative model. The work presented in this paper is also promising, because it demonstrates a
relatively effective level of machine learning classification for large fake news documents with
only one extraction feature. Finally, additional research and work to identify and build additional
fake news classification grammars is ongoing and should yield a more refined classification
scheme for both fake news and direct quotes. Future planned research efforts involve combing
attribution feature extraction with other factors that emerge from the research to produce tools
that not only identify potential false content, but influence based content designed to compel a
reader or target audience to make inaccurate or altered decisions.
13. BIBLIOGRAPHY
[1] M. Balmas, “When Fake News Becomes Real: Combined Exposure to Multiple
News Sources and Political Attitudes of Inefficacy, Alienation, and Cynicism,”
Communic. Res., vol. 41, no. 3, pp. 430–454, 2014.
[2] C. Silverman and J. Singer-Vine, “Most Americans Who See Fake News
Believe It, New Survey Says,” BuzzFeed News, 06-Dec-2016.
[3] P. R. Brewer, D. G. Young, and M. Morreale, “The Impact of Real News about
‘“Fake News”’: Intertextual Processes and Political Satire,” Int. J. Public Opin.
Res., vol. 25, no. 3, 2013.
[4] D. Berkowitz and D. A. Schwartz, “Miley, CNN and The Onion,” Journal.
Pract., vol. 10, no. 1, pp. 1–17, Jan. 2016.
[5] C. Kang, “Fake News Onslaught Targets Pizzeria as Nest of Child-
Trafficking,” New York Times, 21-Nov-2016.
[6] C. Kang and A. Goldman, “In Washington Pizzeria Attack, Fake News
Brought Real Guns,” New York Times, 05-Dec-2016.
[7] R. Marchi, “With Facebook, Blogs, and Fake News, Teens Reject Journalistic
"Objectivity",” J. Commun. Inq., vol. 36, no. 3, pp. 246–262, 2012.
[8] C. Domonoske, “Students Have ‘Dismaying’ Inability o Tell Fake News From
Real, Study Finds,” Natl. Public Radio Two-w., 2016.
[9] H. Allcott and M. Gentzkow, “Social Media and Fake News in the 2016
Election,” J. Econ. Perspect., vol. 31, no. 2, 2017.
[10] C. Shao, G. L. Ciampaglia, O. Varol, A. Flammini, and F. Menczer, “The
spread of fake news by social bots.”
[11] A. Gupta, H. Lamba, P. Kumaraguru, and A. Joshi, “Faking Sandy:
Characterizing and Identifying Fake Images on Twitter during Hurricane Sandy,”
in WWW 2013 Companion, 2013.
[12] E. Mustafaraj and P. T. Metaxas, “The Fake News Spreading Plague: Was it
Preventable?”
[13] M. Farajtabar et al., “Fake News Mitigation via Point Process Based
Intervention.”
[14] M. Haigh, T. Haigh, and N. I. Kozak, “Stopping Fake News,” Journal. Stud.,
vol. 19, no. 14, pp. 2062–2087, Oct. 2018.
[15] O. Batchelor, “Getting out the truth: the role of libraries in the fight against
fake news,” Ref. Serv. Rev., vol. 45, no. 2, pp. 143–148, Jun. 2017.
[16] B. D. Horne and S. Adalı, “This Just In: Fake News Packs a Lot in Title, Uses
Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News,”
in NECO Workshop, 2017.
[17] V. L. Rubin, N. J. Conroy, Y. Chen, and S. Cornwell, “Fake News or Truth?
Using Satirical Cues to Detect Potentially Misleading News,” in Proceedings of
NAACL-HLT 2016, 2016, pp. 7–17.
[18] S. Volkova, K. Shaffer, J. Y. Jang, and N. Hodas, “Separating Facts from
Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on
Twitter,” in Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics, 2017, pp. 647–653.
[19] N. J. Conroy, V. L. Rubin, and Y. Chen, “Automatic Deception Detection:
Methods for Finding Fake News,” in Proceedings of ASIST, 2015.
[20] Y. Chen, N. J. Conroy, and V. L. Rubin, “News in an Online World: The Need
for an "Automatic Crap Detector",” in Proceedings of ASIST 2015, 2015.
[21] J. Kim, B. Tabibian, A. Oh, B. Schölkopf, and M. Gomez-Rodriguez,
“Leveraging the Crowd to Detect and Reduce the Spread of Fake News and
Misinformation.”
[22] B. Riedel, I. Augenstein, G. P. Spithourakis, and S. Riedel, “A simple but
tough-to-beat baseline for the Fake News Challenge stance detection task.”
[23] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, Y. Choi, and P. G. Allen, “Truth
of Varying Shades: Analyzing Language in Fake News and Political Fact-
Checking,” in Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing, 2017, pp. 2931– 2937.
[24] Z. Jin, J. Cao, Y.-G. Jiang, and Y. Zhang, “News Credibility Evaluation on
Microblog with a Hierarchical Propagation Model,” in Proceedings of the IEEE
International Conference on Data Mining, 2014.
[25] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake News Detection on
Social Media: A Data Mining Perspective.”
[26] D. Saez-Trumper, “Fake Tweet Buster: A Webtool to Identify Users
Promoting Fake News on Twitter,” in Proceedings of HT’14, 2014.
[27] S. Pareti, T. O’keefe, I. Konstas, J. R. Curran, and I. Koprinska,
“Automatically Detecting and Attributing Indirect Quotations,” in Proceedings of
the 2013 Conference on Empirical Methods in Natural Language Processing, 2013,
pp. 18–21.
[28] T. O’keefe, S. Pareti, J. R. Curran, I. Koprinska, and M. Honnibal, “A
Sequence Labelling Approach to Quote Attribution,” in Proceedings of the 2012
Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, 2012, pp. 12– 14.
[29] G. Muzny, M. Fang, A. X. Chang, and D. Jurafsky, “A Two-stage Sieve
Approach for Quote Attribution,” in Proceedings of the 15th Conference of the
European Chapter of the Association for Computational Linguistics: Volume 1,
Long Papers, 2017, vol. 1, pp. 460–470.
[30] B. G. Glaser and A. L. Strauss, The discovery of grounded theory: strategies
for qualitative theory. New Brunswick: Aldine, 196