0% found this document useful (0 votes)
10 views

1 Intro

The document provides an introduction to natural language processing (NLP). It discusses NLP as an interdisciplinary field at the intersection of artificial intelligence, machine learning, linguistics, and social sciences. It also covers some of the challenges in NLP like ambiguity, representation of language, and modeling human language.

Uploaded by

1767824623
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

1 Intro

The document provides an introduction to natural language processing (NLP). It discusses NLP as an interdisciplinary field at the intersection of artificial intelligence, machine learning, linguistics, and social sciences. It also covers some of the challenges in NLP like ambiguity, representation of language, and modeling human language.

Uploaded by

1767824623
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Natural Language Processing

Info 159/259
Lecture 1: Introduction (Jan 18, 2022)

David Bamman, UC Berkeley


NLP is interdisciplinary
• Artificial intelligence

• Machine learning (ca. 2000—today); statistical models, neural


networks

• Linguistics (representation of language)

• Social sciences/humanities (models of language at use in culture/


society)
NLP = processing* language with computers
processing as “understanding”
Turing test

Distinguishing human vs.


computer only through
written language

Turing 1950
Dave Bowman: Open the pod bay doors, HAL
HAL: I’m sorry Dave. I’m afraid I can’t do that

Complex human emotion


Agent Movie
mediated through language

Hal 2001 Mission execution

Samantha Her Love

IG-11 The Mandalorian “to nurse and protect”


Where we are now
Where we are now
Where we are now
Q: What is your favorite animal? Q: How many bonks are in a quoit?
A: My favorite animal is a dog. A: There are three bonks in a quoit.

Q: Why? Q: How many rainbows does it take to jump


A: Because dogs are loyal and friendly. from Hawaii to seventeen?
A: It takes two rainbows to jump from Hawaii
Q: What are two reasons that a dog might to seventeen.
be in a bad mood?
A: Two reasons that a dog might be in a
bad mood are if it is hungry or if it is hot.

Lacker (2020), "Giving GPT-3 a Turing Test” (https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html); Brown et al.


(2020), “Language Models are Few-Shot Learners” (NeurIPS)
What makes language hard?

• Language is a complex social process

• Tremendous ambiguity at every level of representation

• Modeling it is AI-complete (requires first solving


general AI)
What makes language hard?

• Speech acts (“can you pass the salt?)


[Austin 1962, Searle 1969]

• Conversational implicature (“The opera singer was


amazing; she sang all of the notes”).
[Grice 1975]

• Shared knowledge (“Warren ran for president”)


• Variation/Indexicality (“This homework is wicked hard”)
[Labov 1966, Eckert 2008]
Elizabeth Warren Warren G. Harding
2020 1920
What makes language
hard?
• Speech acts (“can you pass the salt?)
[Austin 1962, Searle 1969]

• Conversational implicature (“The opera singer was amazing; she sang


all of the notes”).
[Grice 1975]

• Shared knowledge (“Warren ran for president”)


• Variation/Indexicality (“This homework is wicked hard”)
[Labov 1966, Eckert 2008]
Ambiguity

“One morning I shot


an elephant in my pajamas”

Animal Crackers
Ambiguity

“One morning I shot


an elephant in my pajamas”

Animal Crackers
Ambiguity

“One morning I shot


an elephant in my pajamas”
Ambiguity
verb noun

“One morning I shot


an elephant in my pajamas”

Animal Crackers
processing as representation

• NLP generally involves representing language for


some end, e.g.:

• dialogue
• translation
• speech recognition
• text analysis
Information theoretic view
X
“One morning I shot an elephant in my
pajamas”

encode(X) decode(encode(X))

Shannon 1948
Information theoretic view
X
⼀天早上我穿着睡⾐射了⼀只⼤象

encode(X) decode(encode(X))

When I look at an article in Russian, I say: 'This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.'

Weaver 1955
Rational speech act view

“One morning I shot an elephant in my


pajamas”

Communication involves recursive


reasoning: how can X choose words to
maximize understanding by Y?

Frank and Goodman 2012


Pragmatic view

“One morning I shot an elephant in my


pajamas”

Meaning is co-constructed by the


interlocutors and the context of the
utterance
Whorfian view

“One morning I shot an elephant in my


pajamas”

Weak relativism: structure of


language influences thought
Whorfian view

⼀天早上我穿着睡⾐射了⼀
只⼤象

Weak relativism: structure of


language influences thought
Decoding

“One morning I shot an elephant in my


pajamas”

words

decode(encode(X))
syntax

semantics

discourse
representation
discourse

semantics

syntax

morphology

words
Words

• One morning I shot an elephant in my pajamas


• I didn’t shoot an elephant
• Imma let you finish but Beyonce had one of the best videos of
all time
• ⼀天早上我穿着睡⾐射了⼀只⼤象
Parts of speech

noun verb noun noun

One morning I shot an elephant in my pajamas


Named entities

person

Imma let you finish but Beyonce had one of the best videos of all time
Syntax

nmod

dobj
subj

One morning I shot an elephant in my pajamas


Sentiment analysis

"Unfortunately I already
had this exact picture
tattooed on my chest,
but this shirt is very
useful in colder
weather."
[overlook1977]
NLP
• Machine translation

• Question answering

• Information extraction

• Conversational agents

• Summarization
NLP + X
Computational Social Science
• Inferring ideal points of
politicians based on voting
behavior, speeches

• Detecting the triggers of


censorship in blogs/social
media

• Inferring power differentials in


language use Link structure in political blogs
Adamic and Glance 2005
Computational Journalism

• Robust import • Quantitative summaries


• Robust analysis • Interactive methods
• Search, not exploration • Clarity and Accuracy
Computational Humanities
Ted Underwood (2018), “Why Literary Time is Holst Katsma (2014), Loudness in the
Measured in Minutes” Novel

Ryan Heuser, Franco Moretti, Erik Steiner So et al (2014), “Cents and Sensibility”
(2016), The Emotions of London
Matt Wilkens (2013), “The Geographic
Richard Jean So and Hoyt Long (2015), Imagination of Civil War Era American
“Literary Pattern Recognition” Fiction”

Ted Underwood, David Bamman and Sabrina Jockers and Mimno (2013), “Significant
Lee, The Transformation of Gender in Themes in 19th-Century Literature,”
English-Language Fiction (2018)
Ted Underwood and Jordan Sellers (2012).
Franco Moretti (2005), Graphs, Maps, Trees “The Emergence of Literary Diction.” JDH
1.00

0.75

words about women

0.50
written by women

0.25

0.00

1820 1840 1860 1880 1900 1920 1940 1960 1980 2000

Fraction of words about female characters

Ted Underwood, David Bamman, and Sabrina Lee (2018), "The Transformation of Gender in English-Language Fiction," Cultural Analytics
1.00

0.75

words about women

0.50
written by women

written by men
0.25

0.00

1820 1840 1860 1880 1900 1920 1940 1960 1980 2000

Fraction of words about female characters

Ted Underwood, David Bamman, and Sabrina Lee (2018), "The Transformation of Gender in English-Language Fiction," Cultural Analytics
Text-driven forecasting
Methods

• Finite state automata/transducers (tokenization, morphological analysis)

• Rule-based systems
Methods
• Probabilistic models

• Naive Bayes, Logistic regression, HMM, MEMM, CRF, language models

P (Y = y)P (X = x|Y = y)
P (Y = y|X = x) = P
y P (Y = y)P (X = x|Y = y)
Methods
• Dynamic programming (combining solutions to subproblems)

Viterbi algorithm,
CKY

Viterbi lattice, SLP3 ch. 9


Methods
• Dense representations for features/labels (generally: inputs and outputs)

Srikumar and Manning (2014), “Learning Distributed Representations for


Structured Output Prediction” (NIPS)

• Neural networks: multiple, highly parameterized layers of


(usually non-linear) interactions mediating the input/output
Vaswani et al. (2017), “Attention is All You Need” (NeurIPS)
Methods
• Latent variable models (specifying probabilistic structure between variables and inferring
likely latent values)

Nguyen et al. 2015, “Tea Party in the House: A Hierarchical


Ideal Point Topic Model and Its Application to Republican
Legislators in the 112th Congress”
Info 159/259
• This is a class about models.

• You’ll learn and implement algorithms to solve NLP tasks


efficiently and understand the fundamentals to innovate new
methods.

• This is a class about the linguistic representation of text.

• You’ll annotate texts for a variety of representations so you’ll


understand the phenomena you’ll be modeling
Prerequisites

• Strong programming skills

• Translate pseudocode into code (Python)


• Analysis of algorithms (big-O notation)

• Basic probability/statistics
• Calculus
Viterbi algorithm, SLP3 ch. 8
dx2
= 2x
dx
Grading
• Info 159:

• Homeworks (25%)

• Annotation project (25%)

• Weekly quizzes (10%)

• Midterm (20%)

• NLP subfield survey (20%)


Existing tasks
Annotation project Question answering

Named entity recognition

• This course covers many of the methods Sentiment analysis


and existing tasks in NLP →
Machine translation

• But the most exciting applications of NLP Syntactic parsing


have yet to be invented.
Coreference resolution
• Design a new NLP task and annotate data
Text generation
to support it, working in groups of exactly
3 students. Word sense disambiguation


Respect

• Present one dialogue turn (police/


driver) to be rated by people for
respect (4-point Likert scale).
High IAA.

• Build a predictive model mapping


text to respect.

Voigt et al. 2017, “Language from police body camera


footage shows racial disparities in officer respect”
Time Passage

I fear then, Emma, Sewell is a knave, and


Mins
.

joined in mean collusion with his brother, to


distress your husband, who looks upon him
as his friend. You are deceived, Charles, I
Underwood (2018), “Why Literary Time is am sure he is Sir James's friend, and mine,
5.0
by his perpetually dissuading him from
Measured in Minutes”. Measuring how much play. It may be so; but tell me, Emma, all
you know, and all you think of Lady
time has passed in 250-word chunks of text. Juliana's sudden departure, what can it
mean? …

At length we reached the gates of this


noble edifice, and had the pleasure to find
the family not retired to rest, by perceiving
lights in the hall. … In a few minutes all was
hushed, and a man, whom I believed to be 15.0
an upper servant, was sent to reconnoitre
my person, and enquire my name and
business. I told him I should not reveal
either, but to his master. …

Underwood (2018), “Why Literary Time Is Measured in Minutes”


Given a comment, imagine you hold a well-informed,
different opinion from the commenter in question. We’d like

Dogmatism
you to tell us how likely that commenter would be to engage
you in a constructive conversation about your disagreement,
where you each are able to explore the other’s beliefs. The
options are:

(5): It’s unlikely you’ll be able to engage in any substantive


conversation. When you respectfully express your
disagreement, they are likely to ignore you or insult you or
otherwise lower the level of discourse.

(4): They are deeply rooted in their opinion, but you are able
to exchange your views with- out the conversation
degenerating too much.

(3): It’s not likely you’ll be able to change their mind, but
you’re easily able to talk and understand each other’s point of
view.

(2): They may have a clear opinion about the subject, but
would likely be open to discussing alternative viewpoints.

(1): They are not set in their opinion, and it’s possible you
might change their mind. If the comment does not convey an
opinion of any kind, you may also select this option.
Fast and Horvitz (2016), “Identifying Dogmatism in Social Media:
Signals and Models”
AP deliverables
• AP1. Design a new task (either document classification or sequence labeling) and
gather data to support it (must be shareable with the public — nothing private or in
copyright).

• AP2. Annotate the data, creating at least 1000 labeled examples + robust set of
annotation guidelines, reporting inter-annotator agreement rates.

• AP3. In a separate assignment, a different group will annotate your data only using
your annotation guidelines (are your guidelines comprehensive enough that an
independent third party could reproduce your judgments?).

• AP4. Build a classifier to automatically predict the labels using the data you've
annotated.
NLP subfield survey
• 4-page survey for a specific NLP subfield of your choice (e.g., coreference
resolution, question answering, interpretability, narrative generation, etc.),
synthesizing at least 25 papers published at ACL, EMNLP, NAACL, EACL,
AACL, Transactions of the ACL or Computational Linguistics.

• This survey should be able to provide a newcomer (such as yourself at the


start of the semester) a sense of the current state of the art in that subfield in
2022, the major historical papers that have defined that area, and the
different schools of thought within it.
Grading
• Info 259:

• Homeworks (20%)

• Annotation project (20%)

• Weekly quizzes (10%)

• Midterm (20%)

• Project (30%)
259 Project
• Semester-long project (involving 1-3 students) involving natural
language processing -- either focusing on core NLP methods or using
NLP in support of an empirical research question

• Project proposal/literature review


• Midterm report
• 6-page final report, workshop quality
• Poster presentation
ACL 2021 workshops
• *SEM 2021: The 10th Joint Conference on Lexical and Computational Semantics

• 2nd International Workshop on Computational Approaches to Historical Language Change


(LChange’21)

• Workshop on Natural Language Processing for Programming

• Third Workshop on Gender Bias for Natural Language Processing

• Workshop on Online Abuse and Harms

• 17th Workshop on Multiword Expressions (MWE 2021)

• 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

• Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)
Exams

• We’ll have one exam:

• Midterm (3/10, 2-3:30pm, remote).

• We will not be offering alternative exam dates, so if you anticipate a


conflict, don’t take this class!
Late submissions
• All homeworks and quizzes are due on the date/time specified.

• You have 3 late days total over the semester to use when turning in
homeworks/quizzes (not group annotation project deliverables or 259 project
deliverables); each day extends the deadline by 24 hours. If all late days
have been used up, homeworks/quizzes can be turned in up to 48 hours late
for 50% credit; anything submitted after 48 hours late = 0 credit.

• Late days are assessed immediately once homeworks or quizzes are


submitted late and can't be retroactively changed (if you submit 2 homeworks
and 2 quizzes late, for example, you can't decide after the fact which ones to
apply your 3 slip days to -- they apply to whichever homeworks or quizzes
use them up first).
Academic integrity

• We’ll follow the UC Berkeley code of conduct


http://sa.berkeley.edu/code-of-conduct

• You may discuss homeworks at a high level with your classmates (if you
do, include their names on the submission), but each homework
deliverable must be completed independently -- all writing and code
must be your own; and all quizzes and exams must be completed
independently.
Academic integrity
• If you mention the work of others, you must be clear in citing the
appropriate source:
http://gsi.berkeley.edu/gsi-guide-contents/academic-misconduct-intro/
plagiarism/

• This holds for source code as well: if you use others' code (e.g., from
StackOverflow), you must cite its source.

• We have zero tolerance policy for cheating and plagiarism; violations will
be referred to the Center for Student Conduct and will likely result in failing
the class.
Curve

Grades in this class will not be curved.


Lectures

• Recordings of lectures will be available on bCourses.

• Attendance is not required for lectures.


Piazza
• We'll use Piazza as a platform for asking and answering questions
about the course material, including homeworks.

• Students are encouraged to actively participate on this forum and help


others by answering questions that arise (helpful students can see a
grade bump across a threshold (e.g., B+ to A-) for this participation.

• When helping with homework questions, keep the discussion to the


high-level concepts; don't post answers to homeworks or quiz/exam
questions.
TAs
• Gautham Koorma (Mon 2-3:30pm) • Visit TA office hours for help
with homeworks/quizzes/
• Manav Rathod (Tues 3:30-5pm)
exams or just to chat about
• Jerry Shan (Wed 10:30-12pm) NLP.

• Shefali Bhatia (Wed 1:30-3pm) • TA OH will be held through


Zoom.
• Tim Schott (Thurs 9-10:30am)

• Aayushi Sanghi (Fri 10-11:30am)


TAs
• Keep academic integrity in mind during TA office hours: you may discuss
homework questions at a high level with others present, but don't discuss
specific answers or share screens with code solutions. Neither the TA
office hours or Piazza should be used for pre-grading (asking if a specific
answer to a homework or quiz question is correct before the assignment
is due).
DB office hours

• DB office hours Wed + Thurs 10am-11am (Zoom link on bCourses)

• Come talk to me to discuss concepts from class and NLP more generally
— I’m happy to chat!
Next time:

Lexical semantics/static word embeddings

You might also like