SlideShare a Scribd company logo
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Big Data and Automated Content Analysis
Week 7 – Monday
»Machine Learning«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
20 March 2017
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Today
1 Recap: Types of Automated Content Analysis
2 Unsupervised Machine Learning
PCA
LDA
3 Supervised Machine Learning
You have done it before!
Applications
An implementation
4 Next meetings
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Recap: Types of Automated Content Analysis
Big Data and Automated Content Analysis Damian Trilling
Methodological approach
deductive inductive
Typical research interests
and content features
Common statistical
procedures
visibility analysis
sentiment analysis
subjectivity analysis
Counting and
Dictionary
Supervised
Machine Learning
Unsupervised
Machine Learning
frames
topics
gender bias
frames
topics
string comparisons
counting
support vector machines
naive Bayes
principal component analysis
cluster analysis
latent dirichlet allocation
semantic network analysis
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Top-down vs. bottom-up
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(independent and dependent
variables; features and labels) —
a labeled dataset.
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Top-down vs. bottom-up
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(independent and dependent
variables; features and labels) —
a labeled dataset. Think of
regression: You measured x1, x2, x3
and you want to predict y, which you
also measured
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Top-down vs. bottom-up
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(independent and dependent
variables; features and labels) —
a labeled dataset.
Unsupervised machine learning
You have no labels.
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Top-down vs. bottom-up
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(independent and dependent
variables; features and labels) —
a labeled dataset.
Unsupervised machine learning
You have no labels. (You did not
measure y)
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Top-down vs. bottom-up
Some terminology
Unsupervised machine learning
You have no labels.
Again, you already know some
techniques to find out how x1,
x2,. . . x_i co-occur from other
courses:
• Principal Component Analysis
(PCA)
• Cluster analysis
• . . .
Big Data and Automated Content Analysis Damian Trilling
inductive and bottom-up:
unsupervised machine learning
inductive and bottom-up:
unsupervised machine learning
(something you aready did in your Bachelor – no kidding.)
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
PCA
Principal Component Analysis? How does that fit in here?
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
PCA
Principal Component Analysis? How does that fit in here?
In fact, PCA is used everywhere, even in image compression
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
PCA
Principal Component Analysis? How does that fit in here?
PCA in ACA
• Find out what word cooccur (inductive frame analysis)
• Basically, transform each document in a vector of word
frequencies and do a PCA
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...
2 text1, 2, 0, 0, 1, 2, 3 ...
3 text2, 0, 0, 1, 2, 3, 4 ...
4 text3, 9, 0, 1, 1, 0, 0 ...
5 ...
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...
2 text1, 2, 0, 0, 1, 2, 3 ...
3 text2, 0, 0, 1, 2, 3, 4 ...
4 text3, 9, 0, 1, 1, 0, 0 ...
5 ...
These can be simple counts, but also more advanced metrics, like
tf-idf scores (where you weigh the frequency by the number of
documents in which it occurs), cosine distances, etc.
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
PCA
PCA: implications and problems
• given a term-document matrix, easy to do with any tool
• probably extremely skewed distributions
• some problematic assumptions: does the goal of PCA, to find
a solution in which one word loads on one component match
real life, where a word can belong to several topics or frames?
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
LDA
Enter topic modeling with Latent Dirichlet Allocation (LDA)
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
LDA
LDA, what’s that?
No mathematical details here, but the general idea
• There are k topics, T1. . . Tk
• Each document Di consists of a mixture of these topics,
e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk
• On the next level, each topic consists of a specific probability
distribution of words
• Thus, based on the frequencies of words in Di , one can infer
its distribution of topics
• Note that LDA (like PCA) is a Bag-of-Words (BOW)
approach
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
LDA
Doing a LDA in Python
You can use gensim (Řehůřek & Sojka, 2010) for this.
Let us assume you have a list of lists of words (!) called texts:
1 articles=[’The tax deficit is higher than expected. This said xxx ...’,
’Germany won the World Cup. After a’]
2 texts=[art.split() for art in articles]
which looks like this:
1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’,
’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’
After’, ’a’]]
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the
LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA.
Big Data and Automated Content Analysis Damian Trilling
1 from gensim import corpora, models
2
3 NTOPICS = 100
4 LDAOUTPUTFILE="topicscores.tsv"
5
6 # Create a BOW represenation of the texts
7 id2word = corpora.Dictionary(texts)
8 mm =[id2word.doc2bow(text) for text in texts]
9
10 # Train the LDA models.
11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=
NTOPICS, alpha="auto")
12
13 # Print the topics.
14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5):
15 print ("n",top)
16
17 print ("nFor further analysis, a dataset with the topic score for each
document is saved to",LDAOUTPUTFILE)
18
19 scoresperdoc=lda.inference(mm)
20
21 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo:
22 for row in scoresperdoc[0]:
23 fo.write("t".join(["{:0.3f}".format(score) for score in row]))
24 fo.write("n")
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
LDA
Output: Topics (below) & topic scores (next slide)
1 0.069*fusie + 0.058*brussel + 0.045*europesecommissie + 0.036*europese +
0.023*overname
2 0.109*bank + 0.066*britse + 0.041*regering + 0.035*financien + 0.033*
minister
3 0.114*nederlandse + 0.106*nederland + 0.070*bedrijven + 0.042*rusland +
0.038*russische
4 0.093*nederlandsespoorwegen + 0.074*den + 0.036*jaar + 0.029*onderzoek +
0.027*raad
5 0.099*banen + 0.045*jaar + 0.045*productie + 0.036*ton + 0.029*aantal
6 0.041*grote + 0.038*bedrijven + 0.027*ondernemers + 0.023*goed + 0.015*
jaar
7 0.108*werknemers + 0.037*jongeren + 0.035*werkgevers + 0.029*jaar +
0.025*werk
8 0.171*bank + 0.122* + 0.041*klanten + 0.035*verzekeraar + 0.028*euro
9 0.162*banken + 0.055*bank + 0.039*centrale + 0.027*leningen + 0.024*
financiele
10 0.052*post + 0.042*media + 0.038*nieuwe + 0.034*netwerk + 0.025*
personeel
11 ...
Big Data and Automated Content Analysis Damian Trilling
BDACA1617s2 - Lecture7
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
LDA
Visualization with pyldavis
1 import pyLDAvis
2 import pyLDAvis.gensim
3 % first estiate gensim model, then:
4 vis_data = pyLDAvis.gensim.prepare(lda,mm,id2word)
5 pyLDAvis.display(vis_data)
Big Data and Automated Content Analysis Damian Trilling
predefined categories, but no predefined rules:
supervised machine learning
predefined categories, but no predefined rules:
supervised machine learning
(something you aready did in your Bachelor – no kidding.)
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Example: We have a dataset of
Facebook-massages on an
organizations’ page. We use clustering
to group them and later interpret these
clusters (e.g., as complaints, questions,
praise, . . . )
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Example: We have a dataset of
Facebook-massages on an
organizations’ page. We use clustering
to group them and later interpret these
clusters (e.g., as complaints, questions,
praise, . . . )
Supervised
• We code a small dataset by
hand and use it to “train” a
machine
• The machine codes the rest
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Example: We have a dataset of
Facebook-massages on an
organizations’ page. We use clustering
to group them and later interpret these
clusters (e.g., as complaints, questions,
praise, . . . )
Supervised
• We code a small dataset by
hand and use it to “train” a
machine
• The machine codes the rest
Example: We have 2,000 of these
messages grouped into such categories
by human coders. We then use this
data to group all remaining messages
as well.
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
You have done it before!
You have done it before!
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
You have done it before!
You have done it before!
Regression
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
2 Even if you have some new unseen data, you can estimate
your expected outcome ˆy!
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
2 Even if you have some new unseen data, you can estimate
your expected outcome ˆy!
3 Example: You estimated a regression equation where y is
newspaper reading in days/week:
y = −.8 + .4 × man + .08 × age
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
2 Even if you have some new unseen data, you can estimate
your expected outcome ˆy!
3 Example: You estimated a regression equation where y is
newspaper reading in days/week:
y = −.8 + .4 × man + .08 × age
4 You could now calculate ˆy for a man of 20 years and a woman
of 40 years – even if no such person exists in your dataset:
ˆyman20 = −.8 + .4 × 1 + .08 × 20 = 1.2
ˆywoman40 = −.8 + .4 × 0 + .08 × 40 = 2.4
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
You have done it before!
This is
Supervised Machine Learning!
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
• e.g., 2000 labeled cases, 1000 for training, 1000 for testing —
if successful, run on 100,000 unlabeled cases
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
• e.g., 2000 labeled cases, 1000 for training, 1000 for testing —
if successful, run on 100,000 unlabeled cases
• We use many more independent variables (“features”)
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
• e.g., 2000 labeled cases, 1000 for training, 1000 for testing —
if successful, run on 100,000 unlabeled cases
• We use many more independent variables (“features”)
• Typically, IVs are word frequencies (often weighted, e.g.
tf×idf) (⇒BOW-representation)
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Applications
Applications
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Applications
Applications
In other fields
A lot of different applications
• from recognizing hand-written characters to recommendation
systems
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Applications
Applications
In other fields
A lot of different applications
• from recognizing hand-written characters to recommendation
systems
In our field
It starts to get popular to measure latent variables
• frames
• topics
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
• But it is very hard to formulate an explicit rule
(as in: code as ’Human Interest’ if regular expression R is
matched)
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
• But it is very hard to formulate an explicit rule
(as in: code as ’Human Interest’ if regular expression R is
matched)
⇒ This is where you need supervised machine learning!
Burscher, B., Odijk, D., Vliegenthart, R., De Rijke, M., & De Vreese, C. H. (2014). Teaching the computer to
code frames in news: Comparing two supervised machine learning approaches to frame analysis. Communication
Methods and Measures, 8(3), 190–206. doi:10.1080/19312458.2014.937527
Burscher, B., Vliegenthart, R., & De Vreese, C. H. (2015). Using supervised machine learning to code policy issues:
Can classifiers generalize across contexts? Annals of the American Academy of Political and Social Science, 659(1),
122–131.
Big Data and Automated Content Analysis Damian Trilling
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
http://commons.wikimedia.org/wiki/File:Precisionrecall.svg
Some measures of accuracy
• Recall
• Precision
• F1 = 2 · precision·recall
precision+recall
• AUC (Area under curve)
[0, 1], 0.5 = random
guessing
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Applications
What does this mean for our research?
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Applications
What does this mean for our research?
It we have 2,000 documents with manually coded frames and
topics. . .
• we can use them to train a SML classifier
• which can code an unlimited number of new documents
• with an acceptable accuracy
Some easier tasks even need only 500 training documents, see Hopkins, D. J., & King, G. (2010). A method of
automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247.
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
An implementation
An implementation
Let’s say we have a list of tuples with movie reviews and their
rating:
1 reviews=[("This is a great movie",1),("Bad movie",-1), ... ...]
And a second list with an identical structure:
1 test=[("Not that good",-1),("Nice film",1), ... ...]
Both are drawn from the same population, it is pure chance
whether a specific review is on the one list or the other.
Based on an example from http://blog.dataquest.io/blog/naive-bayes-movies/
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
An implementation
Training a A Naïve Bayes Classifier
1 from sklearn.naive_bayes import MultinomialNB
2 from sklearn.feature_extraction.text import CountVectorizer
3 from sklearn import metrics
4
5 # This is just an efficient way of computing word counts
6 vectorizer = CountVectorizer(stop_words=’english’)
7 train_features = vectorizer.fit_transform([r[0] for r in reviews])
8 test_features = vectorizer.transform([r[0] for r in test])
9
10 # Fit a naive bayes model to the training data.
11 nb = MultinomialNB()
12 nb.fit(train_features, [r[1] for r in reviews])
13
14 # Now we can use the model to predict classifications for our test
features.
15 predictions = nb.predict(test_features)
16 actual=[r[1] for r in test]
17
18 # Compute the error.
19 fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label
=1)
20 print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr)))
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
An implementation
And it works!
Using 50,000 IMDB movies that are classified as either negative or
positive,
• I created a list with 25,000 training tuples and another one
with 25,000 test tuples and
• trained a classifier
• that achieved an AUC of .82.
Dataset obtained from http://ai.stanford.edu/~amaas/data/sentiment, Maas, A.L., Daly, R.E., Pham, P.T.,
Huang, D., Ng, A.Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. 49th Annual Meeting of
the Association for Computational Linguistics (ACL 2011)
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
An implementation
Playing around with new data
1 newdata=vectorizer.transform(["What a crappy movie! It sucks!", "This is
awsome. I liked this movie a lot, fantastic actors","I would not
recomment it to anyone.", "Enjoyed it a lot"])
2 predictions = nb.predict(newdata)
3 print(predictions)
This returns, as you would expect and hope:
1 [-1 1 -1 1]
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
An implementation
But we can do even better
We can use different vectorizers and different classifiers.
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
An implementation
Different vectorizers
• CountVectorizer (=simple word counts)
• TfidfVectorizer (word counts (“term frequency”) weighted by
number of documents in which the word occurs at all
(“inverse document frequency”))
• additional options: stopwords, thresholds for minimum
frequencies etc.
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
An implementation
Different classifiers
• Naïve Bayes
• Logistic Regression
• Support Vector Machine (SVM)
• . . .
Typical approach: Find out which setup performs best (see
example source code in the book).
Big Data and Automated Content Analysis Damian Trilling
Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings
Next (last. . . ) meetings
Wednesday
Playing around with machine learning! Choose what you find
interesting (SML or LDA).
Monday: Guest lectures
• Joanna Strycharz: From research questions to analysis:
Integration of data collection, preparation and statystical tests
in Python
• Marthe Möller: Automated content analyses in entertainment
communication: A YouTube study
Wednesday
Final chance for questions regarding final project
Big Data and Automated Content Analysis Damian Trilling
Ad

More Related Content

What's hot (20)

BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
Department of Communication Science, University of Amsterdam
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
Department of Communication Science, University of Amsterdam
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
Department of Communication Science, University of Amsterdam
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
Department of Communication Science, University of Amsterdam
 
BD-ACA week5
BD-ACA week5BD-ACA week5
BD-ACA week5
Department of Communication Science, University of Amsterdam
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
Department of Communication Science, University of Amsterdam
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
Department of Communication Science, University of Amsterdam
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
Department of Communication Science, University of Amsterdam
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
Department of Communication Science, University of Amsterdam
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
Department of Communication Science, University of Amsterdam
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
Department of Communication Science, University of Amsterdam
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
Department of Communication Science, University of Amsterdam
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
Department of Communication Science, University of Amsterdam
 
BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
Department of Communication Science, University of Amsterdam
 
BD-ACA week2
BD-ACA week2BD-ACA week2
BD-ACA week2
Department of Communication Science, University of Amsterdam
 
BD-ACA week1a
BD-ACA week1aBD-ACA week1a
BD-ACA week1a
Department of Communication Science, University of Amsterdam
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
Department of Communication Science, University of Amsterdam
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
Department of Communication Science, University of Amsterdam
 
BD-ACA Week6
BD-ACA Week6BD-ACA Week6
BD-ACA Week6
Department of Communication Science, University of Amsterdam
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
Frank van Harmelen
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
Frank van Harmelen
 

Viewers also liked (17)

10 ошибок agile трансформации
10 ошибок agile трансформации10 ошибок agile трансформации
10 ошибок agile трансформации
Evgeniy Labunskiy
 
Presentacion riesgos de los tatuajes
Presentacion riesgos de los tatuajesPresentacion riesgos de los tatuajes
Presentacion riesgos de los tatuajes
Alejandro Martinez
 
Os três endereços da escola Júlia Amália
Os três endereços da escola Júlia AmáliaOs três endereços da escola Júlia Amália
Os três endereços da escola Júlia Amália
escolaestadualjuliaamalia
 
Kisi kisi bahasa Inggris Kelas 8
Kisi kisi bahasa Inggris Kelas 8 Kisi kisi bahasa Inggris Kelas 8
Kisi kisi bahasa Inggris Kelas 8
chie chie
 
электронная почта и другие сервисы компьютерных сетей
электронная почта и другие сервисы компьютерных сетейэлектронная почта и другие сервисы компьютерных сетей
электронная почта и другие сервисы компьютерных сетей
maxim1415
 
Wellness | Merit Weekend 2017
Wellness | Merit Weekend 2017Wellness | Merit Weekend 2017
Wellness | Merit Weekend 2017
ukyenroll
 
Financial Stability | Merit Weekend 2017
Financial Stability | Merit Weekend 2017Financial Stability | Merit Weekend 2017
Financial Stability | Merit Weekend 2017
ukyenroll
 
Modelo Híbrido BTJ
Modelo Híbrido BTJModelo Híbrido BTJ
Modelo Híbrido BTJ
carlos hurtado
 
Belonging and Engagement | Merit Weekend 2017
Belonging and Engagement | Merit Weekend 2017Belonging and Engagement | Merit Weekend 2017
Belonging and Engagement | Merit Weekend 2017
ukyenroll
 
Evaluation of captive
Evaluation of captiveEvaluation of captive
Evaluation of captive
HarryPB
 
Trabajo
TrabajoTrabajo
Trabajo
Dayra Camargo
 
Liebovitz pictures.ppt
Liebovitz pictures.pptLiebovitz pictures.ppt
Liebovitz pictures.ppt
GeorgeRob447
 
Telur cuka
Telur cukaTelur cuka
Telur cuka
Nurul CliQuers
 
Gamestorming booster 2017
Gamestorming   booster 2017Gamestorming   booster 2017
Gamestorming booster 2017
Fredrik Johnsson
 
Automated Classification and Quantification of Verbatims via Machine...
         Automated Classification and Quantification of Verbatims via Machine...         Automated Classification and Quantification of Verbatims via Machine...
Automated Classification and Quantification of Verbatims via Machine...
Fabrizio Sebastiani
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Machine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AEMachine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AE
butest
 
10 ошибок agile трансформации
10 ошибок agile трансформации10 ошибок agile трансформации
10 ошибок agile трансформации
Evgeniy Labunskiy
 
Presentacion riesgos de los tatuajes
Presentacion riesgos de los tatuajesPresentacion riesgos de los tatuajes
Presentacion riesgos de los tatuajes
Alejandro Martinez
 
Os três endereços da escola Júlia Amália
Os três endereços da escola Júlia AmáliaOs três endereços da escola Júlia Amália
Os três endereços da escola Júlia Amália
escolaestadualjuliaamalia
 
Kisi kisi bahasa Inggris Kelas 8
Kisi kisi bahasa Inggris Kelas 8 Kisi kisi bahasa Inggris Kelas 8
Kisi kisi bahasa Inggris Kelas 8
chie chie
 
электронная почта и другие сервисы компьютерных сетей
электронная почта и другие сервисы компьютерных сетейэлектронная почта и другие сервисы компьютерных сетей
электронная почта и другие сервисы компьютерных сетей
maxim1415
 
Wellness | Merit Weekend 2017
Wellness | Merit Weekend 2017Wellness | Merit Weekend 2017
Wellness | Merit Weekend 2017
ukyenroll
 
Financial Stability | Merit Weekend 2017
Financial Stability | Merit Weekend 2017Financial Stability | Merit Weekend 2017
Financial Stability | Merit Weekend 2017
ukyenroll
 
Belonging and Engagement | Merit Weekend 2017
Belonging and Engagement | Merit Weekend 2017Belonging and Engagement | Merit Weekend 2017
Belonging and Engagement | Merit Weekend 2017
ukyenroll
 
Evaluation of captive
Evaluation of captiveEvaluation of captive
Evaluation of captive
HarryPB
 
Liebovitz pictures.ppt
Liebovitz pictures.pptLiebovitz pictures.ppt
Liebovitz pictures.ppt
GeorgeRob447
 
Automated Classification and Quantification of Verbatims via Machine...
         Automated Classification and Quantification of Verbatims via Machine...         Automated Classification and Quantification of Verbatims via Machine...
Automated Classification and Quantification of Verbatims via Machine...
Fabrizio Sebastiani
 
Machine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AEMachine Learning for automated diagnosis of distributed ...AE
Machine Learning for automated diagnosis of distributed ...AE
butest
 
Ad

Similar to BDACA1617s2 - Lecture7 (20)

BD-ACA Week8a
BD-ACA Week8aBD-ACA Week8a
BD-ACA Week8a
Department of Communication Science, University of Amsterdam
 
BD-ACA week7a
BD-ACA week7aBD-ACA week7a
BD-ACA week7a
Department of Communication Science, University of Amsterdam
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
Rohit Dubey
 
From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer
Jim Czuprynski
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
TrainerAnalogicx
 
Data Science
Data Science Data Science
Data Science
University of Sindh
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
RahulTr22
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .ppt
Ganesh E
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
kalai75
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
Aravind Reddy
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_public
Vincent Michel
 
Artificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of IntelligenceArtificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of Intelligence
Abhishek Upadhyay
 
Data Structures and Algorithms (DSA) in C
Data Structures and Algorithms (DSA) in CData Structures and Algorithms (DSA) in C
Data Structures and Algorithms (DSA) in C
Nabajyoti Banik
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Mathieu DESPRIEE
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
TJ Stalcup
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
OCTO Technology
 
Big Data
Big DataBig Data
Big Data
Santhosh Shankar
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j
 
Computer Science (CSC 102) Lecture 1.pdf
Computer Science (CSC 102) Lecture 1.pdfComputer Science (CSC 102) Lecture 1.pdf
Computer Science (CSC 102) Lecture 1.pdf
victorabioye124
 
Data Science in E-commerce
Data Science in E-commerceData Science in E-commerce
Data Science in E-commerce
Vincent Michel
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
Rohit Dubey
 
From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer
Jim Czuprynski
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
RahulTr22
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .ppt
Ganesh E
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
kalai75
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
Aravind Reddy
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_public
Vincent Michel
 
Artificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of IntelligenceArtificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of Intelligence
Abhishek Upadhyay
 
Data Structures and Algorithms (DSA) in C
Data Structures and Algorithms (DSA) in CData Structures and Algorithms (DSA) in C
Data Structures and Algorithms (DSA) in C
Nabajyoti Banik
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Mathieu DESPRIEE
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
TJ Stalcup
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
OCTO Technology
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j
 
Computer Science (CSC 102) Lecture 1.pdf
Computer Science (CSC 102) Lecture 1.pdfComputer Science (CSC 102) Lecture 1.pdf
Computer Science (CSC 102) Lecture 1.pdf
victorabioye124
 
Data Science in E-commerce
Data Science in E-commerceData Science in E-commerce
Data Science in E-commerce
Vincent Michel
 
Ad

More from Department of Communication Science, University of Amsterdam (10)

BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
Department of Communication Science, University of Amsterdam
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
Department of Communication Science, University of Amsterdam
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
Department of Communication Science, University of Amsterdam
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
Department of Communication Science, University of Amsterdam
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
Department of Communication Science, University of Amsterdam
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
Department of Communication Science, University of Amsterdam
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
Department of Communication Science, University of Amsterdam
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
Department of Communication Science, University of Amsterdam
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
Department of Communication Science, University of Amsterdam
 
Should we worry about filter bubbles?
Should we worry about filter bubbles?Should we worry about filter bubbles?
Should we worry about filter bubbles?
Department of Communication Science, University of Amsterdam
 

Recently uploaded (20)

Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Library Association of Ireland
 
GDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptxGDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptx
azeenhodekar
 
Sinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_NameSinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_Name
keshanf79
 
Introduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe EngineeringIntroduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe Engineering
Damian T. Gordon
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 
How to Manage Opening & Closing Controls in Odoo 17 POS
How to Manage Opening & Closing Controls in Odoo 17 POSHow to Manage Opening & Closing Controls in Odoo 17 POS
How to Manage Opening & Closing Controls in Odoo 17 POS
Celine George
 
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public SchoolsK12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
dogden2
 
Metamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative JourneyMetamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative Journey
Arshad Shaikh
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptxSCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
Ronisha Das
 
SPRING FESTIVITIES - UK AND USA -
SPRING FESTIVITIES - UK AND USA            -SPRING FESTIVITIES - UK AND USA            -
SPRING FESTIVITIES - UK AND USA -
Colégio Santa Teresinha
 
The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...
Sandeep Swamy
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
P-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 finalP-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 final
bs22n2s
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
LDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini UpdatesLDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini Updates
LDM Mia eStudios
 
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsepulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
sushreesangita003
 
Political History of Pala dynasty Pala Rulers NEP.pptx
Political History of Pala dynasty Pala Rulers NEP.pptxPolitical History of Pala dynasty Pala Rulers NEP.pptx
Political History of Pala dynasty Pala Rulers NEP.pptx
Arya Mahila P. G. College, Banaras Hindu University, Varanasi, India.
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Library Association of Ireland
 
GDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptxGDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptx
azeenhodekar
 
Sinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_NameSinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_Name
keshanf79
 
Introduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe EngineeringIntroduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe Engineering
Damian T. Gordon
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 
How to Manage Opening & Closing Controls in Odoo 17 POS
How to Manage Opening & Closing Controls in Odoo 17 POSHow to Manage Opening & Closing Controls in Odoo 17 POS
How to Manage Opening & Closing Controls in Odoo 17 POS
Celine George
 
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public SchoolsK12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
dogden2
 
Metamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative JourneyMetamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative Journey
Arshad Shaikh
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptxSCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
Ronisha Das
 
The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...
Sandeep Swamy
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
P-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 finalP-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 final
bs22n2s
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
LDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini UpdatesLDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini Updates
LDM Mia eStudios
 
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsepulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
sushreesangita003
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 

BDACA1617s2 - Lecture7

  • 1. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Big Data and Automated Content Analysis Week 7 – Monday »Machine Learning« Damian Trilling [email protected] @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 20 March 2017 Big Data and Automated Content Analysis Damian Trilling
  • 2. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Today 1 Recap: Types of Automated Content Analysis 2 Unsupervised Machine Learning PCA LDA 3 Supervised Machine Learning You have done it before! Applications An implementation 4 Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 3. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Recap: Types of Automated Content Analysis Big Data and Automated Content Analysis Damian Trilling
  • 4. Methodological approach deductive inductive Typical research interests and content features Common statistical procedures visibility analysis sentiment analysis subjectivity analysis Counting and Dictionary Supervised Machine Learning Unsupervised Machine Learning frames topics gender bias frames topics string comparisons counting support vector machines naive Bayes principal component analysis cluster analysis latent dirichlet allocation semantic network analysis
  • 5. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Top-down vs. bottom-up Some terminology Supervised machine learning You have a dataset with both predictor and outcome (independent and dependent variables; features and labels) — a labeled dataset. Big Data and Automated Content Analysis Damian Trilling
  • 6. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Top-down vs. bottom-up Some terminology Supervised machine learning You have a dataset with both predictor and outcome (independent and dependent variables; features and labels) — a labeled dataset. Think of regression: You measured x1, x2, x3 and you want to predict y, which you also measured Big Data and Automated Content Analysis Damian Trilling
  • 7. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Top-down vs. bottom-up Some terminology Supervised machine learning You have a dataset with both predictor and outcome (independent and dependent variables; features and labels) — a labeled dataset. Unsupervised machine learning You have no labels. Big Data and Automated Content Analysis Damian Trilling
  • 8. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Top-down vs. bottom-up Some terminology Supervised machine learning You have a dataset with both predictor and outcome (independent and dependent variables; features and labels) — a labeled dataset. Unsupervised machine learning You have no labels. (You did not measure y) Big Data and Automated Content Analysis Damian Trilling
  • 9. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Top-down vs. bottom-up Some terminology Unsupervised machine learning You have no labels. Again, you already know some techniques to find out how x1, x2,. . . x_i co-occur from other courses: • Principal Component Analysis (PCA) • Cluster analysis • . . . Big Data and Automated Content Analysis Damian Trilling
  • 11. inductive and bottom-up: unsupervised machine learning (something you aready did in your Bachelor – no kidding.)
  • 12. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings PCA Principal Component Analysis? How does that fit in here? Big Data and Automated Content Analysis Damian Trilling
  • 13. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings PCA Principal Component Analysis? How does that fit in here? In fact, PCA is used everywhere, even in image compression Big Data and Automated Content Analysis Damian Trilling
  • 14. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings PCA Principal Component Analysis? How does that fit in here? PCA in ACA • Find out what word cooccur (inductive frame analysis) • Basically, transform each document in a vector of word frequencies and do a PCA Big Data and Automated Content Analysis Damian Trilling
  • 15. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings PCA A so-called term-document-matrix 1 w1,w2,w3,w4,w5,w6 ... 2 text1, 2, 0, 0, 1, 2, 3 ... 3 text2, 0, 0, 1, 2, 3, 4 ... 4 text3, 9, 0, 1, 1, 0, 0 ... 5 ... Big Data and Automated Content Analysis Damian Trilling
  • 16. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings PCA A so-called term-document-matrix 1 w1,w2,w3,w4,w5,w6 ... 2 text1, 2, 0, 0, 1, 2, 3 ... 3 text2, 0, 0, 1, 2, 3, 4 ... 4 text3, 9, 0, 1, 1, 0, 0 ... 5 ... These can be simple counts, but also more advanced metrics, like tf-idf scores (where you weigh the frequency by the number of documents in which it occurs), cosine distances, etc. Big Data and Automated Content Analysis Damian Trilling
  • 17. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings PCA PCA: implications and problems • given a term-document matrix, easy to do with any tool • probably extremely skewed distributions • some problematic assumptions: does the goal of PCA, to find a solution in which one word loads on one component match real life, where a word can belong to several topics or frames? Big Data and Automated Content Analysis Damian Trilling
  • 18. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings LDA Enter topic modeling with Latent Dirichlet Allocation (LDA) Big Data and Automated Content Analysis Damian Trilling
  • 19. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings LDA LDA, what’s that? No mathematical details here, but the general idea • There are k topics, T1. . . Tk • Each document Di consists of a mixture of these topics, e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk • On the next level, each topic consists of a specific probability distribution of words • Thus, based on the frequencies of words in Di , one can infer its distribution of topics • Note that LDA (like PCA) is a Bag-of-Words (BOW) approach Big Data and Automated Content Analysis Damian Trilling
  • 20. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings LDA Doing a LDA in Python You can use gensim (Řehůřek & Sojka, 2010) for this. Let us assume you have a list of lists of words (!) called texts: 1 articles=[’The tax deficit is higher than expected. This said xxx ...’, ’Germany won the World Cup. After a’] 2 texts=[art.split() for art in articles] which looks like this: 1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’, ’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’ After’, ’a’]] Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA. Big Data and Automated Content Analysis Damian Trilling
  • 21. 1 from gensim import corpora, models 2 3 NTOPICS = 100 4 LDAOUTPUTFILE="topicscores.tsv" 5 6 # Create a BOW represenation of the texts 7 id2word = corpora.Dictionary(texts) 8 mm =[id2word.doc2bow(text) for text in texts] 9 10 # Train the LDA models. 11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics= NTOPICS, alpha="auto") 12 13 # Print the topics. 14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5): 15 print ("n",top) 16 17 print ("nFor further analysis, a dataset with the topic score for each document is saved to",LDAOUTPUTFILE) 18 19 scoresperdoc=lda.inference(mm) 20 21 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo: 22 for row in scoresperdoc[0]: 23 fo.write("t".join(["{:0.3f}".format(score) for score in row])) 24 fo.write("n")
  • 22. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings LDA Output: Topics (below) & topic scores (next slide) 1 0.069*fusie + 0.058*brussel + 0.045*europesecommissie + 0.036*europese + 0.023*overname 2 0.109*bank + 0.066*britse + 0.041*regering + 0.035*financien + 0.033* minister 3 0.114*nederlandse + 0.106*nederland + 0.070*bedrijven + 0.042*rusland + 0.038*russische 4 0.093*nederlandsespoorwegen + 0.074*den + 0.036*jaar + 0.029*onderzoek + 0.027*raad 5 0.099*banen + 0.045*jaar + 0.045*productie + 0.036*ton + 0.029*aantal 6 0.041*grote + 0.038*bedrijven + 0.027*ondernemers + 0.023*goed + 0.015* jaar 7 0.108*werknemers + 0.037*jongeren + 0.035*werkgevers + 0.029*jaar + 0.025*werk 8 0.171*bank + 0.122* + 0.041*klanten + 0.035*verzekeraar + 0.028*euro 9 0.162*banken + 0.055*bank + 0.039*centrale + 0.027*leningen + 0.024* financiele 10 0.052*post + 0.042*media + 0.038*nieuwe + 0.034*netwerk + 0.025* personeel 11 ... Big Data and Automated Content Analysis Damian Trilling
  • 24. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings LDA Visualization with pyldavis 1 import pyLDAvis 2 import pyLDAvis.gensim 3 % first estiate gensim model, then: 4 vis_data = pyLDAvis.gensim.prepare(lda,mm,id2word) 5 pyLDAvis.display(vis_data) Big Data and Automated Content Analysis Damian Trilling
  • 25. predefined categories, but no predefined rules: supervised machine learning
  • 26. predefined categories, but no predefined rules: supervised machine learning (something you aready did in your Bachelor – no kidding.)
  • 27. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Big Data and Automated Content Analysis Damian Trilling
  • 28. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Example: We have a dataset of Facebook-massages on an organizations’ page. We use clustering to group them and later interpret these clusters (e.g., as complaints, questions, praise, . . . ) Big Data and Automated Content Analysis Damian Trilling
  • 29. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Example: We have a dataset of Facebook-massages on an organizations’ page. We use clustering to group them and later interpret these clusters (e.g., as complaints, questions, praise, . . . ) Supervised • We code a small dataset by hand and use it to “train” a machine • The machine codes the rest Big Data and Automated Content Analysis Damian Trilling
  • 30. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Example: We have a dataset of Facebook-massages on an organizations’ page. We use clustering to group them and later interpret these clusters (e.g., as complaints, questions, praise, . . . ) Supervised • We code a small dataset by hand and use it to “train” a machine • The machine codes the rest Example: We have 2,000 of these messages grouped into such categories by human coders. We then use this data to group all remaining messages as well. Big Data and Automated Content Analysis Damian Trilling
  • 31. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings You have done it before! You have done it before! Big Data and Automated Content Analysis Damian Trilling
  • 32. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings You have done it before! You have done it before! Regression Big Data and Automated Content Analysis Damian Trilling
  • 33. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi Big Data and Automated Content Analysis Damian Trilling
  • 34. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi 2 Even if you have some new unseen data, you can estimate your expected outcome ˆy! Big Data and Automated Content Analysis Damian Trilling
  • 35. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi 2 Even if you have some new unseen data, you can estimate your expected outcome ˆy! 3 Example: You estimated a regression equation where y is newspaper reading in days/week: y = −.8 + .4 × man + .08 × age Big Data and Automated Content Analysis Damian Trilling
  • 36. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi 2 Even if you have some new unseen data, you can estimate your expected outcome ˆy! 3 Example: You estimated a regression equation where y is newspaper reading in days/week: y = −.8 + .4 × man + .08 × age 4 You could now calculate ˆy for a man of 20 years and a woman of 40 years – even if no such person exists in your dataset: ˆyman20 = −.8 + .4 × 1 + .08 × 20 = 1.2 ˆywoman40 = −.8 + .4 × 0 + .08 × 40 = 2.4 Big Data and Automated Content Analysis Damian Trilling
  • 37. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings You have done it before! This is Supervised Machine Learning! Big Data and Automated Content Analysis Damian Trilling
  • 38. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) Big Data and Automated Content Analysis Damian Trilling
  • 39. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) • e.g., 2000 labeled cases, 1000 for training, 1000 for testing — if successful, run on 100,000 unlabeled cases Big Data and Automated Content Analysis Damian Trilling
  • 40. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) • e.g., 2000 labeled cases, 1000 for training, 1000 for testing — if successful, run on 100,000 unlabeled cases • We use many more independent variables (“features”) Big Data and Automated Content Analysis Damian Trilling
  • 41. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) • e.g., 2000 labeled cases, 1000 for training, 1000 for testing — if successful, run on 100,000 unlabeled cases • We use many more independent variables (“features”) • Typically, IVs are word frequencies (often weighted, e.g. tf×idf) (⇒BOW-representation) Big Data and Automated Content Analysis Damian Trilling
  • 42. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Applications Applications Big Data and Automated Content Analysis Damian Trilling
  • 43. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Applications Applications In other fields A lot of different applications • from recognizing hand-written characters to recommendation systems Big Data and Automated Content Analysis Damian Trilling
  • 44. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Applications Applications In other fields A lot of different applications • from recognizing hand-written characters to recommendation systems In our field It starts to get popular to measure latent variables • frames • topics Big Data and Automated Content Analysis Damian Trilling
  • 45. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list Big Data and Automated Content Analysis Damian Trilling
  • 46. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list • But it is very hard to formulate an explicit rule (as in: code as ’Human Interest’ if regular expression R is matched) Big Data and Automated Content Analysis Damian Trilling
  • 47. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list • But it is very hard to formulate an explicit rule (as in: code as ’Human Interest’ if regular expression R is matched) ⇒ This is where you need supervised machine learning! Burscher, B., Odijk, D., Vliegenthart, R., De Rijke, M., & De Vreese, C. H. (2014). Teaching the computer to code frames in news: Comparing two supervised machine learning approaches to frame analysis. Communication Methods and Measures, 8(3), 190–206. doi:10.1080/19312458.2014.937527 Burscher, B., Vliegenthart, R., & De Vreese, C. H. (2015). Using supervised machine learning to code policy issues: Can classifiers generalize across contexts? Annals of the American Academy of Political and Social Science, 659(1), 122–131. Big Data and Automated Content Analysis Damian Trilling
  • 51. http://commons.wikimedia.org/wiki/File:Precisionrecall.svg Some measures of accuracy • Recall • Precision • F1 = 2 · precision·recall precision+recall • AUC (Area under curve) [0, 1], 0.5 = random guessing
  • 52. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Applications What does this mean for our research? Big Data and Automated Content Analysis Damian Trilling
  • 53. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Applications What does this mean for our research? It we have 2,000 documents with manually coded frames and topics. . . • we can use them to train a SML classifier • which can code an unlimited number of new documents • with an acceptable accuracy Some easier tasks even need only 500 training documents, see Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247. Big Data and Automated Content Analysis Damian Trilling
  • 54. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings An implementation An implementation Let’s say we have a list of tuples with movie reviews and their rating: 1 reviews=[("This is a great movie",1),("Bad movie",-1), ... ...] And a second list with an identical structure: 1 test=[("Not that good",-1),("Nice film",1), ... ...] Both are drawn from the same population, it is pure chance whether a specific review is on the one list or the other. Based on an example from http://blog.dataquest.io/blog/naive-bayes-movies/ Big Data and Automated Content Analysis Damian Trilling
  • 55. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings An implementation Training a A Naïve Bayes Classifier 1 from sklearn.naive_bayes import MultinomialNB 2 from sklearn.feature_extraction.text import CountVectorizer 3 from sklearn import metrics 4 5 # This is just an efficient way of computing word counts 6 vectorizer = CountVectorizer(stop_words=’english’) 7 train_features = vectorizer.fit_transform([r[0] for r in reviews]) 8 test_features = vectorizer.transform([r[0] for r in test]) 9 10 # Fit a naive bayes model to the training data. 11 nb = MultinomialNB() 12 nb.fit(train_features, [r[1] for r in reviews]) 13 14 # Now we can use the model to predict classifications for our test features. 15 predictions = nb.predict(test_features) 16 actual=[r[1] for r in test] 17 18 # Compute the error. 19 fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label =1) 20 print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr))) Big Data and Automated Content Analysis Damian Trilling
  • 56. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings An implementation And it works! Using 50,000 IMDB movies that are classified as either negative or positive, • I created a list with 25,000 training tuples and another one with 25,000 test tuples and • trained a classifier • that achieved an AUC of .82. Dataset obtained from http://ai.stanford.edu/~amaas/data/sentiment, Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011) Big Data and Automated Content Analysis Damian Trilling
  • 57. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings An implementation Playing around with new data 1 newdata=vectorizer.transform(["What a crappy movie! It sucks!", "This is awsome. I liked this movie a lot, fantastic actors","I would not recomment it to anyone.", "Enjoyed it a lot"]) 2 predictions = nb.predict(newdata) 3 print(predictions) This returns, as you would expect and hope: 1 [-1 1 -1 1] Big Data and Automated Content Analysis Damian Trilling
  • 58. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings An implementation But we can do even better We can use different vectorizers and different classifiers. Big Data and Automated Content Analysis Damian Trilling
  • 59. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings An implementation Different vectorizers • CountVectorizer (=simple word counts) • TfidfVectorizer (word counts (“term frequency”) weighted by number of documents in which the word occurs at all (“inverse document frequency”)) • additional options: stopwords, thresholds for minimum frequencies etc. Big Data and Automated Content Analysis Damian Trilling
  • 60. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings An implementation Different classifiers • Naïve Bayes • Logistic Regression • Support Vector Machine (SVM) • . . . Typical approach: Find out which setup performs best (see example source code in the book). Big Data and Automated Content Analysis Damian Trilling
  • 61. Recap Unsupervised Machine Learning Supervised Machine Learning Next meetings Next (last. . . ) meetings Wednesday Playing around with machine learning! Choose what you find interesting (SML or LDA). Monday: Guest lectures • Joanna Strycharz: From research questions to analysis: Integration of data collection, preparation and statystical tests in Python • Marthe Möller: Automated content analyses in entertainment communication: A YouTube study Wednesday Final chance for questions regarding final project Big Data and Automated Content Analysis Damian Trilling