Social Media Analysis Using Machine Learning
Social Media Analysis Using Machine Learning
Recently, social media is playing a vital role in social networking and sharing
of data. Social media is favored by many users as it is available to millions of
people without any limitations to share their opinions, educational learning
experience and concerns via their status. Twitter API is processed to search
for the tweets based on the geo-location. Someone posts on social network
gives us a better concern to Analyze about the particular systems process of
the system. Evaluating such data in social network is quite a challenging
process. In the proposed system, there will be a workflow to mine the data
which integrates both qualitative analysis and large-scale machine learning
technique. Based on the different prominent theme’s tweets will be
categorized into different groups. Machine learning classifier will be
implemented on mined data for qualitative analysis purpose to get the
deeper understanding of the data. It uses multi label classification technique
as each label falls into different categories and all the attributes are
independent to each other. Label based measures will be taken to analyze
the results and comparing them with the existing sentiment analysis
technique.
v
LIST OF FIGURES
5.3.2 ER DIAGRAM 25
11 OUTPUTS 37
LIST OF TABLES
LIST OF GRAPHS
1.2 SVM 05
vi
TABLE OF CONTENTS
ABSTRACT V
LIST OF FIGURES VI
LIST OF TABLES VI
LIST OF GRAPHS VI
1. INTRODUCTION 01
2. LITERATURE REVIEW 10
3.1 PURPOSE
3.2 SCOPE 14
4. SYSTEM REQURIMENTS 15
vii
4.2 SOFTWARE REQURIMENTS 15
5. SYSTEM DESIGN 23
6. MODULES 28
6.2 PRE-PROCESSING 29
7. ARCHITECTURE DIAGRAM 29
viii
8. SYSTEM TESTING 30
8.2 VERIFICATION 30
8.3 VALIDATION 30
9. CONCLUSION 35
10. REFERENCE 36
11. APPENDIX 37
A.OUTPUTS 37
ix
CHAPTER 1
INTRODUCTION
Social media website is defined as “a website that facilitates meeting people,
finding like minds, communicating and sharing content, and building
community”; this kind of website allows or encourages various types of
activities, such as commercial, social, or a combination of the two. Social
media categories include digital library, e-commerce, entertainment, forum,
geolocation, social bookmark, social review, social game, and social network.
Social network is the subcategory of social media, which is the social
structure of people who are joined by common interest. Social media are
social channels of communication using web-based technologies, desktop
computers, and mobile technologies. These technologies create highly
interactive platforms through which individuals, communities, and
organizations can share information, discuss, rate, comment, and modify
user-generated and online contents. These advancements enable
communication among businesses, organizations, communities, and
individuals. Social media technologies change the way individuals and large
organizations communicate, and they are increasingly being developed.
Wide range of applications in business and public policy uses sentiment
analysis. Sentimental analysis is now being used from specific product
marketing to antisocial behaviour recognition. Businesses and organizations
have always been concerned about how they are perceived by the public.
This concern results from a variety of motivations, including marketing and
public relations. Before the era of Internet, the only way for an organization to
track its reputation in the media was to hire someone for the specific task of
reading newspapers and manually compiling lists of positive, negative and
neutral references to the organization, it could undertake expensive surveys
of uncertain validity. Today, many newspapers are published online. Some of
them publish dedicated online editions, while others publish the pages of their
print edition in PDF. In addition to newspapers, there are a wide range of
opinionated articles posted online in blogs and other social media. This opens
up the possibility of automatically detecting positive or negative mentions of
an organization in articles published online, thereby dramatically reducing the
effort required to collect this type of information. To this end, Organizations
are becoming increasingly interested in acquiring fine sentiment analysis from
news articles. Fine-grained sentiment analysis is an extremely challenging
problem because of the variety of ways in which opinions can be expressed.
News articles present an even greater challenge, as they usually avoid overt
indicators of attitudes. However, despite their apparent neutrality, news
articles can still bear polarity if they describe events that are objectively
positive or negative. Many techniques used for sentiment analysis involve
naïve approaches based on spotting certain keywords which reveal the author
or speaker’s emotions. We use naïve performs fine-grained sentiment
analysis to classify sentences as positive, negative or neutral.
1.2 PROPOSED ALGORITHM
1
o Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis
of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:
Bayes' Theorem:
Where,
2
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
There are three types of Naive Bayes Model, which are given below:
3
classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans
variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.
Now we will implement a Naive Bayes Algorithm using Python. So for this, we
will use the "user_data" dataset, which we have used in our other
classification model. Therefore we can easily compare the Naive Bayes model
with the other models.
Steps to implement:
o Data Pre-processing step
o Fitting Naive Bayes to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
SVM
The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which
there are two different categories that are classified using a decision boundary
or hyperplane:
4
Graph 1.2.1
Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will first
train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these
two data (cat and dog) and choose extreme cases (support vectors), it will see
the extreme case of cat and dog. On the basis of the support vectors, it will
classify it as a cat. Consider the below diagram:
Fig 1.2.1
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
5
o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight
line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since
these vectors support the hyperplane, hence called a Support vector.
Linear SVM: