0% found this document useful (0 votes)
29 views

Lecture of Optimization

The document provides an introduction to natural language processing (NLP) tasks. It discusses common NLP tasks like tokenization, sentence splitting, abbreviation processing, and lemmatization. Tokenization involves adding spaces around punctuation or splitting words into sub-units. Sentence splitting separates a text into individual sentences. Abbreviation processing expands acronyms. Lemmatization converts words into their canonical forms like singular nouns or infinitive verbs. Stemming extracts the morphological stem of a word by removing affixes. The document uses examples to illustrate how each task pre-processes text for further NLP analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Lecture of Optimization

The document provides an introduction to natural language processing (NLP) tasks. It discusses common NLP tasks like tokenization, sentence splitting, abbreviation processing, and lemmatization. Tokenization involves adding spaces around punctuation or splitting words into sub-units. Sentence splitting separates a text into individual sentences. Abbreviation processing expands acronyms. Lemmatization converts words into their canonical forms like singular nouns or infinitive verbs. Stemming extracts the morphological stem of a word by removing affixes. The document uses examples to illustrate how each task pre-processes text for further NLP analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Text Mining and Chatbots

Introduction and Syntax

Cyril GROUIN - LISN (CNRS, Université Paris-Saclay)

https://ecampus.paris-saclay.fr/enrol/index.php?id=115778#section-13

January 12th, 2024


Schedule

January 12th: Introduction and Syntax

January 19th: Corpus Annotation (Aurélie Névéol)

January 26th: Text Mining in Open Domain and Medical Domain
(Aurélie Névéol)

February 2nd: Semantics and Word Embeddings (Sahar Ghannay)

February 9th: Chatbots and Evaluation (Thomas Gerald)

February 16th: Chatbots and Evaluation (Thomas Gerald)

March 1st: Defenses (room D101)

2
Introduction
Short Story: Automatic Translation

1954: first machine translation system (Russian→English), historical context (cold war)

1962: first conference on machine translation at MIT, by Joshua Bar-Hillel
– Original (EN): “The spirit is willing but the flesh is weak” EN→RU→EN
– Intermediate (RU): Russian language
– Output (EN): “The whiskey is strong but the meat is rotten”
– “pen”: (i) enclosed area, penitentiary, (ii) tool containing ink used to write: “The box is in the pen”

I now claim that no existing or imaginable program will enable an electronic computer to determine that the
word pen in the given sentence within the given context has the second of the above meanings, whereas every
reader with a sufficient knowledge of English will do this "automatically." = problem of commonsense reasoning

Conclusion from Bar-Hillel: “translation is not possible”



ALPAC report (1966): translation too expensive, no results → no more fundings!

4 https://edwardbetts.com/monograph/Machine_translation_of_%22The_spirit_is_willing%2C_but_the_flesh_is_weak.%22_to_Russian_and_back
https://aclanthology.org/www.mt-archive.info/Bar-Hillel-1960-App3.pdf
Short Story: Languages Structure

Zellig Harris (1954), distribution structure of languages: consistency +
distributional relations between words and sentences

Paradigmatic relationships: football / tennis / rugby / volley ball – Monday / Tuesday / Wednesday...

Syntagmatic relationships: “John is playing football on every Monday” vs. “every football is John
Monday on playing” → is the syntax ok? is the meaning ok?

Noam Chomsky (1957), syntactic structure, generative grammar/formal languages
– “Colorless green ideas sleep furiously” → is the syntax ok? is the meaning ok?

Artificial Intelligence (1956, Dartmouth summer school): McCarthy, Minsky,
Newell, Simon → computers with language abilities
– first systems: BASEBALL (1961), SIR (1964), STUDENT (1964), ELIZA (1966, Meta-X
doctor)

Knowlege representation: Quillian (semantic networks)
– 1968-70: SCHRDLU (Terry Winograd @MIT); first system that “understands”
5 https://hci.stanford.edu/winograd/shrdlu/
Short Story: NLP Methods

70’s: systems developed by ●
2000-: deep learning,
Schank, Wilks transformers, etc.
– Semantics is the most important. – Which limits?
– BUT how is it be possible to give – How to learn?
all the necessary knowledge? – Manually annotated data?

80’s: progress in syntax ●
2019: BERT, RoBERTa,
– unification grammars FlauBERT / CamemBERT
– BUT how will it be possible to – Tokenization into sub-units
give all the rules? ●
2022: ChatGPT

6
Source: Dan Jurafsky (2012), Introduction to NLP
http://web.stanford.edu/~jurafsky/NLPCourseraSlide
s.html

7 Still really hard? ChatGPT


Natural Language Processing

Definitions
Natural Language Processing (NLP)

All computer processes for processing content expressed in
natural language

This content exists in different forms:
– written (typed or handwritten text)
– speech: audio files (radio, telephone) or audiovisual files (movies,
television)
– visual (audiovisual files, images): optical character recognition (OCR)
– signed (sign languages)

9
Natural Language Processing (NLP)

English: “language” (single term)

French, two terms:
– “langue” (system of vocal and graphic signs which makes
consensus in a community: French-speaking countries)
– “langage”
1) faculty to express a thought
2) system of graphic and vocal signs (= langue), programming language
3) system of any symbols, animal communication (dance of bees)

10
Natural Language Processing

Tasks
Example

2024-01-09: he developed dyspnea on exertion
when climbing stairs, assoc w/ 2-pillow
orthopnea and PND.
09-01-2024 : il a développé une dyspnée à l’effort
en montant les escaliers, associée à une orthopnée
mesurée à 2 oreillers et une dyspnée nocturne
paroxystique.

12
NLP Tasks

Tokenization: adding spaces around punctuation marks
2024-01-09: he developed dyspnea on exertion when
climbing stairs, assoc w/ 2-pillow orthopnea and PND.

2024-01-09 : he developed dyspnea on exertion when
climbing stairs , assoc w/ 2 - pillow orthopnea and PND .
→ tokens / lexical units
Question: why not adding space on the sequence “w/”?
13
NLP Tasks

Tokenization of languages without spaces: splitting a
sequence of Chinese characters into meaningful words
电梯小

1) 电梯小
2) 电梯 小
3) 电梯小
14 elevator + small
NLP Tasks

Tokenization:
1)Until now in NLP: adding spaces around punctuation marks in order to
produce tokens = meaning

Aujourd’hui, cours d’introduction → Aujourd’hui + , + cours + d’ + introduction

Quite easy: (i) easy to produce a list of punctuation marks, but (ii) implies to take into
account the context (namely, comma and dot depending on their usage) or the word
itself (“aujourd’hui”)
2)Deep Learning: splitting words into the statistically most frequent sub-units
in a corpus of texts (no meaning)

cordialement → _cord + iale + ment / remboursement → _rem + bour + s + ement

Several tokenization algorithms currently used in deep learning approaches:
SentencePiece (used by the CamemBERT model), BPE (compression algorithm,
used by the FlauBERT model)
15
NLP Tasks

Document splitting into sentences:
– M. Martin will defend his PhD entitled “Integration of NLP stuff in unsteady AI
models for low-level tasks” on January 13th, 2023, at 2.30pm The defense will
take place in room A.009, building 507. Link for live broadcast available on
demand

– M. Martin will defend his PhD entitled “Integration of NLP stuff in unsteady AI
models for low-level tasks” on January 13th, 2023, at 2.30pm
The defense will take place in room A.009, building 507.
Link for live broadcast available on demand
Question: which observations can you make?
16
NLP Tasks

Abbreviations processing & acronyms expansion:
Three days ago , he developed dyspnea on exertion when
climbing stairs , assoc w/ 2 - pillow orthopnea and PND .

Three days ago , he developed dyspnea on exertion when
climbing stairs , associated with 2 - pillow orthopnea and
paroxismal nocturnal dyspnea .
→text ready for lexical processing

17
NLP Tasks

Lexical components - Lemmatization:
lemma = canonical form
– adjectives: masculine singular form

unfortunates→unfortunate / malheureuses→malheureux
– nouns: singular form

eyes→eye – mice→mouse / yeux→œil – souris→souris
– verbs: infinitive form

was→be / était→être
– Ambiguity? From surface forms to lemmas
Les souris dansent → le souris [N] danser Elle est naïve [Adj]→ elle être naïf
Tu souris souvent → tu sourire [Vb] souvent Le mouvement naïf [N] → le mouvement naïf
18
NLP Tasks

Lexical components - Stemming:
stem = morphological subpart of a word that excludes
flexional endings
– studying → stud / students → stud
– chantions → chant / chanteuses → chant

▲Stems are different from lemmas:
– studying → study / students → student
– chantions → chanter / chanteuses → chanteur
Question: have you any idea of the usefulness of lemmas and stems?
19
NLP Tasks

Lexical components – Part-of-Speech tagging:
morphological labels for each token
– basic POS schema: Adj / Adv / Conj / Name / Noun / Pro / Verb

[Verb Studying] NLP [Verb is] nice! They [Verb study] NLP this year.

They [Verb studied] NLP last year and it [Verb was] fun!
– detailed POS schema: VBD / VBG / VBP / VBZ

[VBG Studying] NLP [VBZ is] nice! They [VBP study] NLP this year.

They [VBD studied] NLP last year and it [VBD was] fun!

20
NLP Tasks

Three days ago , he developed dyspnea on exertion when climbing stairs ,
associated with 2 - pillow orthopnea and paroxismal nocturnal dyspnea .

Three/CD/three days/NN/day ago/RB/ago ,/PCT/, he/PRO/he
developed/VBD/develop dyspnea/NN/dyspnea on/IN/on
exertion/NN/exertion when/WRB/when climbing/VBG/climb
stairs/NN/stair ,/PCT/, associated/VBN/associate with/IN/with 2/CD/2
-/PCT/- pillow/NN/pillow orthopnea/NN/orthopnea and/CC/and
paroxismal/ADJ/paroxismal nocturnal/ADJ/nocturnal dyspnea/NN/dyspnea
./PCT/.

21
NLP Tasks

Recognition of higher level components and their relations: syntactic
processing

22
NLP Tasks

Syntactic components: chunking / constituency
parsing, based on a previous POS tagging
– John eats an apple.

– John/NNP eats/VBZ an/DT apple/NN ./.

– [NP John/NNP] [VP eats/VBZ [NP an/DT apple/NN]] ./.

23
NLP Tasks

Semantic analysis:
building the meaning representation of the statement

Further analysis, e.g. pragmatics:
how this statement may be related to the context in
which it is analyzed?
没有烧水壶 . (méiyōu shāo shuīhú). No boiler.
Is it a problem for European people? No.
24 Is it a problem for Asian people? Yes → impossible to prepare tea at all time!
Difficulties for NLP: Linguistics

Ambiguity: “la bonne brise la porte”
(1) the maid breaks the door
= a woman who is a domestic aid is breaking a door
(2) the good breeze carries her
= the wind allows a girl on a sailing boat to move
forward

Question: how can you explain the differences of meaning?


25
Difficulties for NLP: Linguistics

Ambiguity: “la bonne brise la porte”
(1) the maid breaks the door
la/DET bonne/NOUN brise/VB la/DET porte/NOUN
(2) the good breeze carries her
la/DET bonne/ADJ brise/NOUN la/PRO porte/VB

26
Difficulties for NLP: Linguistics

Implicit discourse: what is not clearly stated in a
statement, which must be understood by the other
person by him/herself
– Closed on Monday → open the other days
– This bag is too heavy → can you help me?
– Warn him about Mary → Mary could be a problem for someone
– 没有烧水壶 . (méiyōu shāo shuīhú). No boiler.

27
Difficulties for NLP: Linguistics

Lay vs. Domain-specific Terms:
– Three days ago, he developed dyspnea on exertion when
climbing stairs, assoc w/ 2-pillow orthopnea and PND.

Translitteration
– 香榭里舍大道 (xiāng xiè lī shè dàdào). Champs-Élysées
– 香榭里舍大街 (xiāng xiè lī shè dàjiē). Champs-Élysées
– 香榭利舍大街 (xiāng xiè lì shè dàjiē). Champs-Élysées
Question: which kind of difficulties can you identify?
28
Difficulties for NLP: Informatics

L’arrivée en vainqueur du Maxi ●
0 L’ det O
Edmond de Rothschild à Fort de 2 arrivée subst O
France 10 en prep O

L'arrivée en vainqueur 13 vainqueur subst B-rang
du Maxi Edmond de Rothschild 23 du prep O
à Fort de France
26 Maxi subst B-categorie

L’arrivée en <rang>vainqueur</rang> 31 Edmond nom B-equipage
du <categorie>Maxi</categorie>
38 de prep I-equipage
<equipage>Edmond de
Rothschild</equipage> à <lieu>Fort 41 Rothschild nom I-equipage
de France</lieu> 52 à prep O
Questions: which form is used in those examples (ann, html, json, sgml, tsv, txt)?
Which kind of difficulties can you identify?
29
Difficulties for NLP: Usages

Exciting news! Our new blog post is all about the importance of using
#inclusivelanguage. We share tips on how to communicate respectfully and
avoid discrimination against #peoplewithdisabilities, and #gender-neutral
language. #diversity #inclusion https://dealersleague.com/inclusive-lang
https://twitter.com/DealersLeague/status/1612808718088904704

En ce #12janvier, 4 ans après le drame de la #ruedetrevise , les élus du
@GpeChangerParis adressent toutes leurs pensées aux victimes et aux
familles des victimes qui se reconstruisent jour après jour depuis cette
terrible explosion.
https://twitter.com/GpeChangerParis/status/1613469906543992833
Question: which kind of difficulties can you identify?
30
Difficulties for NLP: Usages

Out-of-Vocabulary (OOV) words, Unknown
words, Typos, New words, Code switching...
– I can’t use my eazly acoount sincea few dayz.

Trademark
Typos
Grammar errors (inadvertent errors)
Intentional error

– She seen the huges problems: c’est la vie!


Code switching
31 (intentional langage change)
NLP is useful

Spelling/Grammar Correction: LibreOffice, OpenOffice, Word

Information Retrieval: basic query (Google) vs. question-answering systems

Machine translation: Google translate, DeepL…

Text-to-Speech and Speech-to-Text systems

Text Simplification/Summarization

Text Generation: generative AI (ChatGPT)

Conversational Agents (aka chatbots)

Recommendations: hashtag proposal, personalized advertising (based on
your history of queries)

Optical Character Recognition (OCR): in order to refine outputs
32
NLP is always evolving

Technological developments:
– computer storage capacities

internal storage: hard drives (5 Mb in 1956 up to 10 Tb in 2010),
Flash memory

external storage: punch cards, cassettes, floppy disk (256 kb for first
generation (1971) up to 1,47 Mb for the last generation (2011)), CD-
ROM (0,74 Gb), USB keys, cloud
– computing capacities
– speed of connection to servers

33
NLP is always evolving

Societal evolution:
– morphological creation:

new terms (bader, chiller, crush, écoanxiété, flexoffice, gênance, ghoster,
mégenrer, nareux, parkour, wokisme, woke-washing)

new expressions (yellow vests, protective measure)
– social networks: forums (UseNet, websites), blogs (Facebook), micro-
blogs (InstaGram, LinkedIn, TikTok, Tumblr, Twitter)
→ specific syntax, spelling mistakes, use of emojis, inclusive writing
– mode of consumption: desktop computer (in decline), tablets and
smartphones

34
NLP is always evolving

Changing needs: – professional uses:

automatic transcription
– mainstream uses:
(subtitling)

speech recognition and ●
automatic summarization
voice synthesis software

automatic composition

spelling correction (press releases)

grammar correction ●
writing assistance (clinical

automatic completion (SMS) reports)

automatic translation ●
knowledge discovery in texts
(text mining)

35
NLP is always evolving

All these developments allow:
1)more elaborate systems (deep learning algorithms, word
embeddings)
2)more natural results (speech synthesis)
3)real-time operation (translation)

They require:
1)to take into account the creativity of the language
2)the need for applications that can be deported or used in mobility

36
Exercise

https://perso.limsi.fr/grouin/m2aic/exercice.pdf
Instructions

Work together in small groups (up to 3 students per group)

Prepare a final report composed of a header and three sections:
– Header: first and last names and e-mail of all members of the group
– First part: manual analysis of sentences
– Second part: automatic analysis
– Third part: observations

Send your report (PDF) to [email protected] (before
January 19th)

38
Morpho-syntactic analysis

Level 1: choose one sentence, either French or English
– Jean mange une pomme. / John eats an apple.

Level 2: analyze both French and English sentences
– Vous prétendez que vous connaissez le mobile du crime. / You claim to know the
motive for the crime.
– Elle voit le chat qui saute de la fenêtre. → where is the cat? / She sees the cat
jumping out of the window.

Level 3: choose one sentence, either French or English
– Je me suis fait ghoster par mon crush, depuis je bade. / I have been ghosted by
my crush, then I am bading. → how NLP tools succeed to identify POS labels?

39
Question 1
Manually tag the previous sentences using:

part-of-speech tags: ●
relations between words:
adj / adv / conj / det / det / obj (object) / sub (subject) /
attribute / N2 (for completive and
name (proper) / noun
noun complement) / N2app
(substantive) / prep / (apposition)
pro / vb / punct ●
relationships between
punctuation mark and words is
not really useful
40
Question 1

Expected output: for each sentence
– One output for POS tagging: word/tag

The/det maid/noun breaks/vb the/det door/noun ./punct
– One output for relations: first word-(relation)-second word

The-(det)-maid

the-(det)-door

maid-(sub)-breaks

breaks-(obj)-door

41
Question 2

Automatically tag the same sentences using
the online Stanford Parser tool:
https://corenlp.run/

Ouput:
– Provide the same kind of outputs than those from the
manual analysis
– Or integrate (nice) screenshots in your report

42
Discussion and Conclusions

Key points (max. 1 page):
– Are the automatic results similar to those you
manually produced?
– Which differences are you observing (tags,
relations, inconsistency, other remarks)?
– Which proposition can you make to evaluate the
results?

43

You might also like