NLP 1.1
NLP 1.1
of
Documents
Introduction
• In human language, words and sentences do not appear randomly but
have structure.
• For example, combinations of words form sentences- meaningful
grammatical units, such as statements, requests, and commands.
• Automatic extraction of structure of documents helps subsequent NLP
tasks: for example, parsing, machine translation, and semantic role
labelling use sentences as the basic processing unit.
• Sentence boundary annotation(labelling) is also important for aiding
human readability of automatic speech recognition (ASR) systems.
• Task of deciding where sentences start and end given a sequence of
characters(made of words and typographical cues) sentences boundary
detection.
• Topic segmentation as the task of determining when a topic starts and
ends in a sequence of sentences.
• The statistical classification approaches that try to find the presence of sentence
and topic boundaries given human-annotated training data, for segmentation.
• These methods base their predictions on features of the input: local
characteristics that give evidence toward the presence or absence of a sentence,
such as a period(.), a question mark(?), an exclamation mark(!), or another type
of punctuation.
• Features are the core of classification approaches and require careful design and
selection in order to be successful and prevent overfitting and noise problem.
• Most statistical approaches described here are language independent, every
language is a challenging in itself.
• For example, for processing of Chinese documents, the processor may need to
first segment the character sequences into words, as the words usually are not
separated by a space.
• Similarly, for morphological rich languages, the word structure may need to be
analyzed to extract additional features.
• Such processing is usually done in a pre-processing step, where a sequence of
tokens is determined.
• Tokens can be word or sub-word units, depending on the task and language.
• These algorithms are then applied on tokens.
Sentence Boundary Detection
• Sentence boundary detection (Sentence segmentation) deals with automatically segmenting
a sequence of word tokens into sentence units.
• In written text in English and some other languages, the beginning of a sentence is usually
marked with an uppercase letter, and the end of a sentence is explicitly marked with a
period(.), a question mark(?), an exclamation mark(!), or another type of punctuation.
• In addition to their role as sentence boundary markers, capitalized initial letters are used
distinguish proper nouns, periods are used in abbreviations, and numbers and punctuation
marks are used inside proper names.
• The period at the end of an abbreviation can mark a sentence boundary at the same time.
• Example: I spoke with Dr. Smith. and My house is on Mountain Dr.
• In the first sentence, the abbreviation Dr. does not end a sentence, and in the second it
does.
• Especially quoted sentences are always problematic, as the speakers may have uttered
multiple sentences, and sentence boundaries inside the quotes are also marked with
punctuation marks.
• An automatic method that outputs word boundaries as ending sentences according to the
presence of such punctuation marks would result in cutting some sentences incorrectly.
• Ambiguous abbreviations and capitalizations are not only problem of
sentence segmentation in written text.
• Spontaneously written texts, such as short message service (SMS) texts or
instant messaging(IM) texts, tend to be nongrammatical and have poorly
used or missing punctuation, which makes sentence segmentation even
more challenging.
• Similarly, if the text input to be segmented into sentences comes from an
automatic system, such as optical character recognition (OCR) or ASR,
that aims to translate images of handwritten, type written, or printed text
or spoken utterances into machine editable text, the finding of sentences
boundaries must deal with the errors of those systems as well.
• On the other hand, for conversational speech or text or multiparty
meetings with ungrammatical sentences and disfluencies, in most cases it
is not clear where the boundaries are.
• Code switching -that is, the use of words, phrases, or sentences from multiple
languages by multilingual speakers- is another problem that can affect the
characteristics of sentences.
• For example, when switching to a different language, the writer can either keep the
punctuation rules from the first language or resort to the code of the second
language.
• Conventional rule-based sentence segmentation systems in well-formed texts rely
on patterns to identify potential ends of sentences and lists of abbreviations for
disambiguating them.
• For example, if the word before the boundary is a known abbreviation, such as
“Mr.” or “Gov.,” the text is not segmented at that position even though some
periods are exceptions.
• To improve on such a rule-based approach, sentence segmentation is stated as a
classification problem.
• Given the training data where all sentence boundaries are marked, we can train a
classifier to recognize them.
Topic Boundary Detection
• Segmentation(Discourse or text segmentation) is the task of automatically
dividing a stream of text or speech into topically homogenous blocks.
• This is, given a sequence of(written or spoken) words, the aim of topic
segmentation is to find the boundaries where topics change.
• Topic segmentation is an important task for various language understanding
applications, such as information extraction and retrieval and text
summarization.
• For example, in information retrieval, if a long documents can be segmented
into shorter, topically coherent segments, then only the segment that is about
the user’s query could be retrieved.
• During the late1990s, the U.S defence advanced research project agency(DARPA)
initiated the topic detection and tracking program to further the state of the art
in finding and following new topic in a stream of broadcast news stories.
• One of the tasks in the TDT effort was segmenting a news stream into individual
stories.
Methods
• Sentence segmentation and topic segmentation have been
considered as a boundary classification problem.
• Given a boundary candidate( between two word tokens for
sentence segmentation and between two sentences for
topic segmentation), the goal is to predict whether or not
the candidate is an actual boundary (sentence or topic
boundary).
• Formally, let xƐX be the vector of features (the observation)
associated with a candidate and y ƐY be the label predicted