1 Intro
1 Intro
Info 159/259
Lecture 1: Introduction (Jan 18, 2022)
Turing 1950
Dave Bowman: Open the pod bay doors, HAL
HAL: I’m sorry Dave. I’m afraid I can’t do that
Animal Crackers
Ambiguity
Animal Crackers
Ambiguity
Animal Crackers
processing as representation
• dialogue
• translation
• speech recognition
• text analysis
Information theoretic view
X
“One morning I shot an elephant in my
pajamas”
encode(X) decode(encode(X))
Shannon 1948
Information theoretic view
X
⼀天早上我穿着睡⾐射了⼀只⼤象
encode(X) decode(encode(X))
When I look at an article in Russian, I say: 'This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.'
Weaver 1955
Rational speech act view
⼀天早上我穿着睡⾐射了⼀
只⼤象
words
decode(encode(X))
syntax
semantics
discourse
representation
discourse
semantics
syntax
morphology
words
Words
person
Imma let you finish but Beyonce had one of the best videos of all time
Syntax
nmod
dobj
subj
"Unfortunately I already
had this exact picture
tattooed on my chest,
but this shirt is very
useful in colder
weather."
[overlook1977]
NLP
• Machine translation
• Question answering
• Information extraction
• Conversational agents
• Summarization
NLP + X
Computational Social Science
• Inferring ideal points of
politicians based on voting
behavior, speeches
Ryan Heuser, Franco Moretti, Erik Steiner So et al (2014), “Cents and Sensibility”
(2016), The Emotions of London
Matt Wilkens (2013), “The Geographic
Richard Jean So and Hoyt Long (2015), Imagination of Civil War Era American
“Literary Pattern Recognition” Fiction”
Ted Underwood, David Bamman and Sabrina Jockers and Mimno (2013), “Significant
Lee, The Transformation of Gender in Themes in 19th-Century Literature,”
English-Language Fiction (2018)
Ted Underwood and Jordan Sellers (2012).
Franco Moretti (2005), Graphs, Maps, Trees “The Emergence of Literary Diction.” JDH
1.00
0.75
0.50
written by women
0.25
0.00
1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
Ted Underwood, David Bamman, and Sabrina Lee (2018), "The Transformation of Gender in English-Language Fiction," Cultural Analytics
1.00
0.75
0.50
written by women
written by men
0.25
0.00
1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
Ted Underwood, David Bamman, and Sabrina Lee (2018), "The Transformation of Gender in English-Language Fiction," Cultural Analytics
Text-driven forecasting
Methods
• Rule-based systems
Methods
• Probabilistic models
P (Y = y)P (X = x|Y = y)
P (Y = y|X = x) = P
y P (Y = y)P (X = x|Y = y)
Methods
• Dynamic programming (combining solutions to subproblems)
Viterbi algorithm,
CKY
• Basic probability/statistics
• Calculus
Viterbi algorithm, SLP3 ch. 8
dx2
= 2x
dx
Grading
• Info 159:
• Homeworks (25%)
• Midterm (20%)
…
Respect
Dogmatism
you to tell us how likely that commenter would be to engage
you in a constructive conversation about your disagreement,
where you each are able to explore the other’s beliefs. The
options are:
(4): They are deeply rooted in their opinion, but you are able
to exchange your views with- out the conversation
degenerating too much.
(3): It’s not likely you’ll be able to change their mind, but
you’re easily able to talk and understand each other’s point of
view.
(2): They may have a clear opinion about the subject, but
would likely be open to discussing alternative viewpoints.
(1): They are not set in their opinion, and it’s possible you
might change their mind. If the comment does not convey an
opinion of any kind, you may also select this option.
Fast and Horvitz (2016), “Identifying Dogmatism in Social Media:
Signals and Models”
AP deliverables
• AP1. Design a new task (either document classification or sequence labeling) and
gather data to support it (must be shareable with the public — nothing private or in
copyright).
• AP2. Annotate the data, creating at least 1000 labeled examples + robust set of
annotation guidelines, reporting inter-annotator agreement rates.
• AP3. In a separate assignment, a different group will annotate your data only using
your annotation guidelines (are your guidelines comprehensive enough that an
independent third party could reproduce your judgments?).
• AP4. Build a classifier to automatically predict the labels using the data you've
annotated.
NLP subfield survey
• 4-page survey for a specific NLP subfield of your choice (e.g., coreference
resolution, question answering, interpretability, narrative generation, etc.),
synthesizing at least 25 papers published at ACL, EMNLP, NAACL, EACL,
AACL, Transactions of the ACL or Computational Linguistics.
• Homeworks (20%)
• Midterm (20%)
• Project (30%)
259 Project
• Semester-long project (involving 1-3 students) involving natural
language processing -- either focusing on core NLP methods or using
NLP in support of an empirical research question
• Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)
Exams
• You have 3 late days total over the semester to use when turning in
homeworks/quizzes (not group annotation project deliverables or 259 project
deliverables); each day extends the deadline by 24 hours. If all late days
have been used up, homeworks/quizzes can be turned in up to 48 hours late
for 50% credit; anything submitted after 48 hours late = 0 credit.
• You may discuss homeworks at a high level with your classmates (if you
do, include their names on the submission), but each homework
deliverable must be completed independently -- all writing and code
must be your own; and all quizzes and exams must be completed
independently.
Academic integrity
• If you mention the work of others, you must be clear in citing the
appropriate source:
http://gsi.berkeley.edu/gsi-guide-contents/academic-misconduct-intro/
plagiarism/
• This holds for source code as well: if you use others' code (e.g., from
StackOverflow), you must cite its source.
• We have zero tolerance policy for cheating and plagiarism; violations will
be referred to the Center for Student Conduct and will likely result in failing
the class.
Curve
• Come talk to me to discuss concepts from class and NLP more generally
— I’m happy to chat!
Next time: