Lecture of Optimization
Lecture of Optimization
https://ecampus.paris-saclay.fr/enrol/index.php?id=115778#section-13
2
Introduction
Short Story: Automatic Translation
●
1954: first machine translation system (Russian→English), historical context (cold war)
●
1962: first conference on machine translation at MIT, by Joshua Bar-Hillel
– Original (EN): “The spirit is willing but the flesh is weak” EN→RU→EN
– Intermediate (RU): Russian language
– Output (EN): “The whiskey is strong but the meat is rotten”
– “pen”: (i) enclosed area, penitentiary, (ii) tool containing ink used to write: “The box is in the pen”
●
I now claim that no existing or imaginable program will enable an electronic computer to determine that the
word pen in the given sentence within the given context has the second of the above meanings, whereas every
reader with a sufficient knowledge of English will do this "automatically." = problem of commonsense reasoning
4 https://edwardbetts.com/monograph/Machine_translation_of_%22The_spirit_is_willing%2C_but_the_flesh_is_weak.%22_to_Russian_and_back
https://aclanthology.org/www.mt-archive.info/Bar-Hillel-1960-App3.pdf
Short Story: Languages Structure
●
Zellig Harris (1954), distribution structure of languages: consistency +
distributional relations between words and sentences
●
Paradigmatic relationships: football / tennis / rugby / volley ball – Monday / Tuesday / Wednesday...
●
Syntagmatic relationships: “John is playing football on every Monday” vs. “every football is John
Monday on playing” → is the syntax ok? is the meaning ok?
●
Noam Chomsky (1957), syntactic structure, generative grammar/formal languages
– “Colorless green ideas sleep furiously” → is the syntax ok? is the meaning ok?
●
Artificial Intelligence (1956, Dartmouth summer school): McCarthy, Minsky,
Newell, Simon → computers with language abilities
– first systems: BASEBALL (1961), SIR (1964), STUDENT (1964), ELIZA (1966, Meta-X
doctor)
●
Knowlege representation: Quillian (semantic networks)
– 1968-70: SCHRDLU (Terry Winograd @MIT); first system that “understands”
5 https://hci.stanford.edu/winograd/shrdlu/
Short Story: NLP Methods
●
70’s: systems developed by ●
2000-: deep learning,
Schank, Wilks transformers, etc.
– Semantics is the most important. – Which limits?
– BUT how is it be possible to give – How to learn?
all the necessary knowledge? – Manually annotated data?
●
80’s: progress in syntax ●
2019: BERT, RoBERTa,
– unification grammars FlauBERT / CamemBERT
– BUT how will it be possible to – Tokenization into sub-units
give all the rules? ●
2022: ChatGPT
6
Source: Dan Jurafsky (2012), Introduction to NLP
http://web.stanford.edu/~jurafsky/NLPCourseraSlide
s.html
Definitions
Natural Language Processing (NLP)
●
All computer processes for processing content expressed in
natural language
●
This content exists in different forms:
– written (typed or handwritten text)
– speech: audio files (radio, telephone) or audiovisual files (movies,
television)
– visual (audiovisual files, images): optical character recognition (OCR)
– signed (sign languages)
9
Natural Language Processing (NLP)
●
English: “language” (single term)
●
French, two terms:
– “langue” (system of vocal and graphic signs which makes
consensus in a community: French-speaking countries)
– “langage”
1) faculty to express a thought
2) system of graphic and vocal signs (= langue), programming language
3) system of any symbols, animal communication (dance of bees)
10
Natural Language Processing
Tasks
Example
●
2024-01-09: he developed dyspnea on exertion
when climbing stairs, assoc w/ 2-pillow
orthopnea and PND.
09-01-2024 : il a développé une dyspnée à l’effort
en montant les escaliers, associée à une orthopnée
mesurée à 2 oreillers et une dyspnée nocturne
paroxystique.
12
NLP Tasks
●
Tokenization: adding spaces around punctuation marks
2024-01-09: he developed dyspnea on exertion when
climbing stairs, assoc w/ 2-pillow orthopnea and PND.
↓
2024-01-09 : he developed dyspnea on exertion when
climbing stairs , assoc w/ 2 - pillow orthopnea and PND .
→ tokens / lexical units
Question: why not adding space on the sequence “w/”?
13
NLP Tasks
●
Tokenization of languages without spaces: splitting a
sequence of Chinese characters into meaningful words
电梯小
↓
1) 电梯小
2) 电梯 小
3) 电梯小
14 elevator + small
NLP Tasks
●
Tokenization:
1)Until now in NLP: adding spaces around punctuation marks in order to
produce tokens = meaning
●
Aujourd’hui, cours d’introduction → Aujourd’hui + , + cours + d’ + introduction
●
Quite easy: (i) easy to produce a list of punctuation marks, but (ii) implies to take into
account the context (namely, comma and dot depending on their usage) or the word
itself (“aujourd’hui”)
2)Deep Learning: splitting words into the statistically most frequent sub-units
in a corpus of texts (no meaning)
●
cordialement → _cord + iale + ment / remboursement → _rem + bour + s + ement
●
Several tokenization algorithms currently used in deep learning approaches:
SentencePiece (used by the CamemBERT model), BPE (compression algorithm,
used by the FlauBERT model)
15
NLP Tasks
●
Document splitting into sentences:
– M. Martin will defend his PhD entitled “Integration of NLP stuff in unsteady AI
models for low-level tasks” on January 13th, 2023, at 2.30pm The defense will
take place in room A.009, building 507. Link for live broadcast available on
demand
↓
– M. Martin will defend his PhD entitled “Integration of NLP stuff in unsteady AI
models for low-level tasks” on January 13th, 2023, at 2.30pm
The defense will take place in room A.009, building 507.
Link for live broadcast available on demand
Question: which observations can you make?
16
NLP Tasks
●
Abbreviations processing & acronyms expansion:
Three days ago , he developed dyspnea on exertion when
climbing stairs , assoc w/ 2 - pillow orthopnea and PND .
↓
Three days ago , he developed dyspnea on exertion when
climbing stairs , associated with 2 - pillow orthopnea and
paroxismal nocturnal dyspnea .
→text ready for lexical processing
17
NLP Tasks
●
Lexical components - Lemmatization:
lemma = canonical form
– adjectives: masculine singular form
●
unfortunates→unfortunate / malheureuses→malheureux
– nouns: singular form
●
eyes→eye – mice→mouse / yeux→œil – souris→souris
– verbs: infinitive form
●
was→be / était→être
– Ambiguity? From surface forms to lemmas
Les souris dansent → le souris [N] danser Elle est naïve [Adj]→ elle être naïf
Tu souris souvent → tu sourire [Vb] souvent Le mouvement naïf [N] → le mouvement naïf
18
NLP Tasks
●
Lexical components - Stemming:
stem = morphological subpart of a word that excludes
flexional endings
– studying → stud / students → stud
– chantions → chant / chanteuses → chant
●
▲Stems are different from lemmas:
– studying → study / students → student
– chantions → chanter / chanteuses → chanteur
Question: have you any idea of the usefulness of lemmas and stems?
19
NLP Tasks
●
Lexical components – Part-of-Speech tagging:
morphological labels for each token
– basic POS schema: Adj / Adv / Conj / Name / Noun / Pro / Verb
●
[Verb Studying] NLP [Verb is] nice! They [Verb study] NLP this year.
●
They [Verb studied] NLP last year and it [Verb was] fun!
– detailed POS schema: VBD / VBG / VBP / VBZ
●
[VBG Studying] NLP [VBZ is] nice! They [VBP study] NLP this year.
●
They [VBD studied] NLP last year and it [VBD was] fun!
20
NLP Tasks
●
Three days ago , he developed dyspnea on exertion when climbing stairs ,
associated with 2 - pillow orthopnea and paroxismal nocturnal dyspnea .
↓
Three/CD/three days/NN/day ago/RB/ago ,/PCT/, he/PRO/he
developed/VBD/develop dyspnea/NN/dyspnea on/IN/on
exertion/NN/exertion when/WRB/when climbing/VBG/climb
stairs/NN/stair ,/PCT/, associated/VBN/associate with/IN/with 2/CD/2
-/PCT/- pillow/NN/pillow orthopnea/NN/orthopnea and/CC/and
paroxismal/ADJ/paroxismal nocturnal/ADJ/nocturnal dyspnea/NN/dyspnea
./PCT/.
21
NLP Tasks
●
Recognition of higher level components and their relations: syntactic
processing
22
NLP Tasks
●
Syntactic components: chunking / constituency
parsing, based on a previous POS tagging
– John eats an apple.
↓
– John/NNP eats/VBZ an/DT apple/NN ./.
↓
– [NP John/NNP] [VP eats/VBZ [NP an/DT apple/NN]] ./.
23
NLP Tasks
●
Semantic analysis:
building the meaning representation of the statement
●
Further analysis, e.g. pragmatics:
how this statement may be related to the context in
which it is analyzed?
没有烧水壶 . (méiyōu shāo shuīhú). No boiler.
Is it a problem for European people? No.
24 Is it a problem for Asian people? Yes → impossible to prepare tea at all time!
Difficulties for NLP: Linguistics
●
Ambiguity: “la bonne brise la porte”
(1) the maid breaks the door
= a woman who is a domestic aid is breaking a door
(2) the good breeze carries her
= the wind allows a girl on a sailing boat to move
forward
26
Difficulties for NLP: Linguistics
●
Implicit discourse: what is not clearly stated in a
statement, which must be understood by the other
person by him/herself
– Closed on Monday → open the other days
– This bag is too heavy → can you help me?
– Warn him about Mary → Mary could be a problem for someone
– 没有烧水壶 . (méiyōu shāo shuīhú). No boiler.
27
Difficulties for NLP: Linguistics
●
Lay vs. Domain-specific Terms:
– Three days ago, he developed dyspnea on exertion when
climbing stairs, assoc w/ 2-pillow orthopnea and PND.
●
Translitteration
– 香榭里舍大道 (xiāng xiè lī shè dàdào). Champs-Élysées
– 香榭里舍大街 (xiāng xiè lī shè dàjiē). Champs-Élysées
– 香榭利舍大街 (xiāng xiè lì shè dàjiē). Champs-Élysées
Question: which kind of difficulties can you identify?
28
Difficulties for NLP: Informatics
●
L’arrivée en vainqueur du Maxi ●
0 L’ det O
Edmond de Rothschild à Fort de 2 arrivée subst O
France 10 en prep O
●
L'arrivée en vainqueur 13 vainqueur subst B-rang
du Maxi Edmond de Rothschild 23 du prep O
à Fort de France
26 Maxi subst B-categorie
●
L’arrivée en <rang>vainqueur</rang> 31 Edmond nom B-equipage
du <categorie>Maxi</categorie>
38 de prep I-equipage
<equipage>Edmond de
Rothschild</equipage> à <lieu>Fort 41 Rothschild nom I-equipage
de France</lieu> 52 à prep O
Questions: which form is used in those examples (ann, html, json, sgml, tsv, txt)?
Which kind of difficulties can you identify?
29
Difficulties for NLP: Usages
●
Exciting news! Our new blog post is all about the importance of using
#inclusivelanguage. We share tips on how to communicate respectfully and
avoid discrimination against #peoplewithdisabilities, and #gender-neutral
language. #diversity #inclusion https://dealersleague.com/inclusive-lang
https://twitter.com/DealersLeague/status/1612808718088904704
●
En ce #12janvier, 4 ans après le drame de la #ruedetrevise , les élus du
@GpeChangerParis adressent toutes leurs pensées aux victimes et aux
familles des victimes qui se reconstruisent jour après jour depuis cette
terrible explosion.
https://twitter.com/GpeChangerParis/status/1613469906543992833
Question: which kind of difficulties can you identify?
30
Difficulties for NLP: Usages
●
Out-of-Vocabulary (OOV) words, Unknown
words, Typos, New words, Code switching...
– I can’t use my eazly acoount sincea few dayz.
Trademark
Typos
Grammar errors (inadvertent errors)
Intentional error
33
NLP is always evolving
●
Societal evolution:
– morphological creation:
●
new terms (bader, chiller, crush, écoanxiété, flexoffice, gênance, ghoster,
mégenrer, nareux, parkour, wokisme, woke-washing)
●
new expressions (yellow vests, protective measure)
– social networks: forums (UseNet, websites), blogs (Facebook), micro-
blogs (InstaGram, LinkedIn, TikTok, Tumblr, Twitter)
→ specific syntax, spelling mistakes, use of emojis, inclusive writing
– mode of consumption: desktop computer (in decline), tablets and
smartphones
34
NLP is always evolving
●
Changing needs: – professional uses:
●
automatic transcription
– mainstream uses:
(subtitling)
●
speech recognition and ●
automatic summarization
voice synthesis software
●
automatic composition
●
spelling correction (press releases)
●
grammar correction ●
writing assistance (clinical
●
automatic completion (SMS) reports)
●
automatic translation ●
knowledge discovery in texts
(text mining)
35
NLP is always evolving
●
All these developments allow:
1)more elaborate systems (deep learning algorithms, word
embeddings)
2)more natural results (speech synthesis)
3)real-time operation (translation)
●
They require:
1)to take into account the creativity of the language
2)the need for applications that can be deported or used in mobility
36
Exercise
https://perso.limsi.fr/grouin/m2aic/exercice.pdf
Instructions
●
Work together in small groups (up to 3 students per group)
●
Prepare a final report composed of a header and three sections:
– Header: first and last names and e-mail of all members of the group
– First part: manual analysis of sentences
– Second part: automatic analysis
– Third part: observations
●
Send your report (PDF) to [email protected] (before
January 19th)
38
Morpho-syntactic analysis
●
Level 1: choose one sentence, either French or English
– Jean mange une pomme. / John eats an apple.
●
Level 2: analyze both French and English sentences
– Vous prétendez que vous connaissez le mobile du crime. / You claim to know the
motive for the crime.
– Elle voit le chat qui saute de la fenêtre. → where is the cat? / She sees the cat
jumping out of the window.
●
Level 3: choose one sentence, either French or English
– Je me suis fait ghoster par mon crush, depuis je bade. / I have been ghosted by
my crush, then I am bading. → how NLP tools succeed to identify POS labels?
39
Question 1
Manually tag the previous sentences using:
●
part-of-speech tags: ●
relations between words:
adj / adv / conj / det / det / obj (object) / sub (subject) /
attribute / N2 (for completive and
name (proper) / noun
noun complement) / N2app
(substantive) / prep / (apposition)
pro / vb / punct ●
relationships between
punctuation mark and words is
not really useful
40
Question 1
●
Expected output: for each sentence
– One output for POS tagging: word/tag
●
The/det maid/noun breaks/vb the/det door/noun ./punct
– One output for relations: first word-(relation)-second word
●
The-(det)-maid
●
the-(det)-door
●
maid-(sub)-breaks
●
breaks-(obj)-door
41
Question 2
●
Automatically tag the same sentences using
the online Stanford Parser tool:
https://corenlp.run/
●
Ouput:
– Provide the same kind of outputs than those from the
manual analysis
– Or integrate (nice) screenshots in your report
42
Discussion and Conclusions
●
Key points (max. 1 page):
– Are the automatic results similar to those you
manually produced?
– Which differences are you observing (tags,
relations, inconsistency, other remarks)?
– Which proposition can you make to evaluate the
results?
43