0% found this document useful (0 votes)
8 views

NLP 1.1

Nlp unit 1

Uploaded by

unknownusers157
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

NLP 1.1

Nlp unit 1

Uploaded by

unknownusers157
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Finding the structure

of
Documents
Introduction
• In human language, words and sentences do not appear randomly but
have structure.
• For example, combinations of words form sentences- meaningful
grammatical units, such as statements, requests, and commands.
• Automatic extraction of structure of documents helps subsequent NLP
tasks: for example, parsing, machine translation, and semantic role
labelling use sentences as the basic processing unit.
• Sentence boundary annotation(labelling) is also important for aiding
human readability of automatic speech recognition (ASR) systems.
• Task of deciding where sentences start and end given a sequence of
characters(made of words and typographical cues) sentences boundary
detection.
• Topic segmentation as the task of determining when a topic starts and
ends in a sequence of sentences.
• The statistical classification approaches that try to find the presence of sentence
and topic boundaries given human-annotated training data, for segmentation.
• These methods base their predictions on features of the input: local
characteristics that give evidence toward the presence or absence of a sentence,
such as a period(.), a question mark(?), an exclamation mark(!), or another type
of punctuation.
• Features are the core of classification approaches and require careful design and
selection in order to be successful and prevent overfitting and noise problem.
• Most statistical approaches described here are language independent, every
language is a challenging in itself.
• For example, for processing of Chinese documents, the processor may need to
first segment the character sequences into words, as the words usually are not
separated by a space.
• Similarly, for morphological rich languages, the word structure may need to be
analyzed to extract additional features.
• Such processing is usually done in a pre-processing step, where a sequence of
tokens is determined.
• Tokens can be word or sub-word units, depending on the task and language.
• These algorithms are then applied on tokens.
Sentence Boundary Detection
• Sentence boundary detection (Sentence segmentation) deals with automatically segmenting
a sequence of word tokens into sentence units.
• In written text in English and some other languages, the beginning of a sentence is usually
marked with an uppercase letter, and the end of a sentence is explicitly marked with a
period(.), a question mark(?), an exclamation mark(!), or another type of punctuation.
• In addition to their role as sentence boundary markers, capitalized initial letters are used
distinguish proper nouns, periods are used in abbreviations, and numbers and punctuation
marks are used inside proper names.
• The period at the end of an abbreviation can mark a sentence boundary at the same time.
• Example: I spoke with Dr. Smith. and My house is on Mountain Dr.
• In the first sentence, the abbreviation Dr. does not end a sentence, and in the second it
does.
• Especially quoted sentences are always problematic, as the speakers may have uttered
multiple sentences, and sentence boundaries inside the quotes are also marked with
punctuation marks.
• An automatic method that outputs word boundaries as ending sentences according to the
presence of such punctuation marks would result in cutting some sentences incorrectly.
• Ambiguous abbreviations and capitalizations are not only problem of
sentence segmentation in written text.
• Spontaneously written texts, such as short message service (SMS) texts or
instant messaging(IM) texts, tend to be nongrammatical and have poorly
used or missing punctuation, which makes sentence segmentation even
more challenging.
• Similarly, if the text input to be segmented into sentences comes from an
automatic system, such as optical character recognition (OCR) or ASR,
that aims to translate images of handwritten, type written, or printed text
or spoken utterances into machine editable text, the finding of sentences
boundaries must deal with the errors of those systems as well.
• On the other hand, for conversational speech or text or multiparty
meetings with ungrammatical sentences and disfluencies, in most cases it
is not clear where the boundaries are.
• Code switching -that is, the use of words, phrases, or sentences from multiple
languages by multilingual speakers- is another problem that can affect the
characteristics of sentences.
• For example, when switching to a different language, the writer can either keep the
punctuation rules from the first language or resort to the code of the second
language.
• Conventional rule-based sentence segmentation systems in well-formed texts rely
on patterns to identify potential ends of sentences and lists of abbreviations for
disambiguating them.
• For example, if the word before the boundary is a known abbreviation, such as
“Mr.” or “Gov.,” the text is not segmented at that position even though some
periods are exceptions.
• To improve on such a rule-based approach, sentence segmentation is stated as a
classification problem.
• Given the training data where all sentence boundaries are marked, we can train a
classifier to recognize them.
Topic Boundary Detection
• Segmentation(Discourse or text segmentation) is the task of automatically
dividing a stream of text or speech into topically homogenous blocks.
• This is, given a sequence of(written or spoken) words, the aim of topic
segmentation is to find the boundaries where topics change.
• Topic segmentation is an important task for various language understanding
applications, such as information extraction and retrieval and text
summarization.
• For example, in information retrieval, if a long documents can be segmented
into shorter, topically coherent segments, then only the segment that is about
the user’s query could be retrieved.
• During the late1990s, the U.S defence advanced research project agency(DARPA)
initiated the topic detection and tracking program to further the state of the art
in finding and following new topic in a stream of broadcast news stories.
• One of the tasks in the TDT effort was segmenting a news stream into individual
stories.
Methods
• Sentence segmentation and topic segmentation have been
considered as a boundary classification problem.
• Given a boundary candidate( between two word tokens for
sentence segmentation and between two sentences for
topic segmentation), the goal is to predict whether or not
the candidate is an actual boundary (sentence or topic
boundary).
• Formally, let xƐX be the vector of features (the observation)
associated with a candidate and y ƐY be the label predicted

• The label y can be b for boundary and 𝒃 for non boundary.


for that candidate.
• Classification problem: given a set of training
examples(x,y)train, find a function that will assign the
most accurate possible label y of unseen examples xunseen.
• Alternatively to the binary classification problem, it is
possible to model boundary types using finer-grained
categories.
• For segmentation in text be framed as a three-class
problem: sentence boundary ba, without an abbreviation
ba and abbreviation not as a boundary b-a
• Similarly spoken language, a three way classification can be
made between non-boundaries b, statements bs, and question
boundaries bq.
• For sentence or topic segmentation, the problem is defined as
finding the most probable sentence or topic boundaries.
• The natural unit of sentence segmentation is words and of
topic segmentation is sentence, as we can assume that topics
typically do not change in the middle of a sentences.
• The words or sentences are then grouped into categories
stretches belonging to one sentences or topic- that is word or
sentence boundaries are classified into sentences or topic
boundaries and -non-boundaries.
• The classification can be done at each potential boundary i (local modelling); then, the
aim is to estimate the most probable boundary type Ŷi for each candidate xi
Ŷ=argmax yi in y P(yi/xi)
Here, the ^ is used to denote estimated categories, and a variable without a ^ is used to
show possible categories.
• In this formulation, a category is assigned to each example in isolation; hence, decision is
made locally.
• However, the consecutive types can be related to each other. For example, in broadcast
news speech, two consecutive sentences boundaries that form a single word sentence
are very infrequent.
• In local modelling, features can be extracted from surrounding example context of the
candidate boundary to model such dependencies.
• It is also possible to see the candidate boundaries as a sequence and search for the
sequence of boundary types Ŷ=Ŷ1,Ŷ2…......Ŷn that have the maximum probability given the
candidate examples, X=x1,………xn
Ŷ=argmax y P(Y/X)
• We categorize the methods into local and sequence
classification.
• Another categorization of methods is done according to the
type of the machine learning algorithm: generative versus
discriminative.
• Generative sequence models estimate the joint distribution
of the observations P(X,Y) (words, punctuation) and the
labels(sentence boundary, topic boundary).
• Discriminative sequence models, however, focus on
features that categorize the differences between the
labelling of that examples.

You might also like