0% found this document useful (0 votes)
196 views

What Is CL-NLP-1

Computational linguistics is an interdisciplinary field between linguistics and computer science that studies computational aspects of human language. It has both applied and theoretical components. The applied component aims to create software that understands human language for applications like machine translation and speech recognition. The theoretical component develops formal models of linguistic knowledge and implements them as computer programs to evaluate and advance linguistic theories. Computational linguistics draws on many fields and aims to understand language processing in computational terms to implement it on computers.

Uploaded by

Parinay Seth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
196 views

What Is CL-NLP-1

Computational linguistics is an interdisciplinary field between linguistics and computer science that studies computational aspects of human language. It has both applied and theoretical components. The applied component aims to create software that understands human language for applications like machine translation and speech recognition. The theoretical component develops formal models of linguistic knowledge and implements them as computer programs to evaluate and advance linguistic theories. Computational linguistics draws on many fields and aims to understand language processing in computational terms to implement it on computers.

Uploaded by

Parinay Seth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lecture notes HUM1012 (Logic & Language structure) AT

Brief Introduction to Computational Linguistics & Natural Language Processing

What is Computational Linguistics (CL)?

Computational linguistics (CL) is a discipline between linguistics and computer science which is concerned with the
computational aspects of the human language faculty. It belongs to the cognitive sciences and overlaps with the field
of artificial intelligence (AI), a branch of computer science that is aiming at computational models of human cognition.
There are two components of CL:

1. Applied and,
2. Theoretical.

The applied component of CL is more interested in the practical outcome of modelling human language use. The
goal is to create software products that have some knowledge of human language. Such products are urgently
needed for improving human-machine interaction since the main obstacle in the interaction between human and
computer is communication. The contemporary computers do not understand our language, and humans have
difficulties understanding the computer's language, which does not correspond to the structure of human thought.

Natural language interfaces enable the user to communicate with the computer in German, English, Hindi or
another human language. Some applications of such interfaces are database queries, information retrieval from texts
and so-called expert systems. Current advances in recognition of spoken language improve the usability of many
types of natural language systems. Communication with computers using spoken language will have a lasting impact
upon the work environment, opening up completely new areas of application for information technology.

Although existing CL programs are far from achieving human ability, they have numerous possible applications.
Even if the language the machine understands and its domain of discourse are very restricted, the use of human
language can increase the acceptance of software and the productivity of its users. Much older than communication
problems between human beings and machines are those between people with different mother tongues. One of the
original goals of applied computational linguistics was fully automatic translation between human languages. From
bitter experience scientists have realized that they are far from achieving this. Nevertheless, computational linguists
have created software systems which can simplify the work of human translators and clearly improve their
productivity. The future of applied computational linguistics will be determined by the growing need for user-friendly
software. Even though the successful simulation of human language competence is not to be expected in the near
future, computational linguists have numerous immediate research goals involving the design, realization and
maintenance of systems which facilitate everyday work, such as grammar checkers for word processing programs.

Theoretical CL takes up issues in formal theories. It deals with formal theories about the linguistic knowledge that
human needs for generating and understanding language. Today these theories have reached a degree of complexity
that can only be managed by employing computers. Computational linguists develop formal models simulating
aspects of the human language faculty and implement them as computer programmes. These programmes constitute
the basis for the evaluation and further development of the theories. In addition to linguistic theories, findings from
cognitive psychology play a major role in simulating linguistic competence. Within psychology, it is mainly the area of
psycholinguistics that examines the cognitive processes constituting human language use. The special attraction of
computational linguistics lies in the combination of methods and strategies from the humanities, natural and
behavioural sciences, and engineering.

CL: Goals

The scientific goal of computational linguistics is to understand the acquisition, comprehension and production of
human languages in information processing terms. Because language is used to convey information, we assume that
these processes fundamentally involve the processing of information, i.e., that they are fundamentally computational
in nature. Computational linguistics also has a more applied, technological side: if we understand the information
processing involved in human language, we can also implement it on computers. Applications of computational
linguistics include:

• Machine translation (i.e., translating documents from one language to another by computer)
• Speech recognition (e.g., transcribing speech)
• Information extraction (e.g., automatically identifying the topic of a document, the things that it talks about, and
the important relationships between those things)

1
Lecture notes HUM1012 (Logic & Language structure) AT

Even after the dot.com bubble, there is a steadily increasing demand for people with training in computational
linguistics in the software industry.

Computational Linguistics (CL) is a unique discipline which distinctively unites the field of Computer Science with the
findings in Linguistics. CL is most commonly grouped into the category of Artificial Intelligence. Applied CL does not
stop here however. Computational Linguistics also includes applications from the fields of Statistics, Mathematics, and
Logic, all contributing to the computational nature of this field. CL has also been generally described as the “study
devoted to developing algorithms and software for intelligently processing language data”.

In addition, there is a theoretical aspect to the study of Computational Linguistics. Theoretical CL deals with the
Philosophy and Psychology of Language, which shows how interdisciplinary the field is. This theoretical aspect of CL,
together with the applications mentioned previously, classify CL as an integral component of the
field of Cognitive Science. David Crystal describes computational linguistics as "A branch of linguistics in which
computational techniques and concepts are applied to the elucidation of linguistic and phonetic problems. Several
research areas have developed, including speech synthesis, speech recognition, automatic translation, the making of
concordances, the testing of grammars, and the many areas where statistical counts and analyses are required (e.g.,
in literary textual studies)".

Computational Linguistics has prevailed in the field of Cognitive Science for a long time and is closely tied to the
development of the digital computer. Computational linguistics originated with efforts in the United States in the 1960s
to have computers automatically translate texts from foreign languages into English (namely Russian). Consequently,
the development of a somewhat underived form of Computational Linguistics began with the need for some sort of
“Machine-aided Translation”.

In the 1960’s, with the onset of Artificial Intelligence (AI), it was evident that the highly complex and sometimes
ambiguous, recursive, and infinite aspects of language, such as syntax, morphology, semantics, and pragmatics
would require the use of computers to aid in the study of these complexities. Therefore, Computational Linguistics
officially emerged and became the sub-class or branch of AI that dealt with the simulation and processing of human-
like natural language processing.

CL: An interdisciplinary subject

The subject of Computational Linguistics is an interdisciplinary program combining the fields of Linguistics and
Computer Science. It is part of the wider sphere of Cognition Studies and, to a certain extent, overlaps into the field of
artificial intelligence, a branch of Computer Science, which examines with the computational models of human
cognition. While General Linguistics deals with the investigation of human language ability under theoretical and
practical criteria, Computer Science examines techniques and usage of mechanical processing and transmission of
information. The synergy of these two sub-topics leads to an area of research, which applies methods from Computer
Science, Logic, Mathematics and Formal Linguistics, among other disciplines. Therefore, Computational Linguistics is
particularly attractive for students, who have a broad interest in both the humanities and the natural sciences.
Computational Linguistics has undergone enthusiastic development over the last few years, thereby establishing
itself as an innovative, academic discipline. In an age of electronic computer processing, much interest has surfaced
in the mechanical processing of human language. Research in Computational Linguistics deals with the theoretical
and practical aspects of mechanical language processing. This research is derived from two complementary spheres
of interest, which connect the following areas of specialization.
The central thematic area of applied Computational Linguistics can be described, in a first approximation, as the
simulation of human language ability in computer supported applications. Regarding these linguistically technological
applications, the thematic areas are similar in character, as those in applied Computer Science. Additionally, the
thematic area includes the systematic investigation of the technological applications of linguistic and algorithmic
foundations necessary for language. Consequently, Computational Linguistics has the integral components of an
interdisciplinary field. As such, Computer Linguistics is closely aligned with the disciplines of Computer Science,
Mathematics, Logic, Cognitive Psychology, General Linguistics and the individual philological branches.
In order to arrive at computer-aided applications or, as they are called, Lingware, an analysis of human language
capacity is essential, which makes use of methods in theoretical Linguistics. Yet a detailed investigation of language
descriptive grammars or linguistic representations is also needed, in order to make it accessible to the application of
mechanical processing in Computer Science. The acquisition of other skills in the areas of Linguistics and Computer
Science, which are of direct relevance for mechanical language processing, are necessary for the study of applied
Computational Linguistics.
2
Lecture notes HUM1012 (Logic & Language structure) AT
On the linguistic side, the most important sub-disciplines, their methods as well as their conversion and application
in modern grammar formalisms are crucial to understand and master. Only on a solid linguistic basis, can a computer-
linguistic problem be appropriately assessed and adequately addressed in terms of definitions and in practical
systems, which meet the modularity and maintenance requirements of the modern software technology. In order to
achieve this aim, an understanding of the relevant linguistic methods employed in the analysis of natural languages, in
the context of formally explicit grammar formalisms, is required, as is the knowledge of the most important sub-topics
of modern grammar research and their relevant terminology and content.
In comparison with the basics of Linguistics, a substantial proportion of the program content is Computer Science.
This results directly from the needs of the vocational fields. On one hand, it involves the language technological
industry and the area of the information capacity, including the telecommunications industry, in which the technical
infrastructure is being developed for the functions of the information management, of information provision and the
preparation. On the other hand, this field involves branches of the production industry, in which the management of
linguistic information in the fields of document management, where multi-lingual, terminological work, and the quality
assurance, play an important role with product-concomitant documentation. Necessary prerequisites for work in these
areas are the mastery of the areas of discrete mathematics, relevant to Computer Science and an understanding of
different types of data structures and their applications in algorithms of language processing. An example of data
structures and their applications could be parsing and generation of spoken or written language. Also necessary, is
the competence of the most common programming languages in the area of the language technology. Additionally,
this field involves a significant area of application of quantitative and probabilistic methods; the supply and
administration of linguistic documents in the World Wide Web; the creation, the maintenance and the advancement of
linguistic resources, e.g. in the form of corpora; and experience with more extensive software projects, which are
executed in teamwork by methods of software engineering. Current applications of these techniques are directed to
areas of mechanical translation; integrated, automatic spell check and grammar corrections; extravagant text
operating programs; the data bases and information structures for intelligent on-line dictionaries and thesauri and
concordance programs.

Central to the subject of theoretical Computational Linguistics, is, in a first approximation, human language in the
aspect of its predictability. Although the computer is the achievement of an effective calculation model, par excellence,
the field and the leading interests of realization for this sub-discipline of Linguistics go beyond Linguistics for the
Computer.
In this context, computational method is understood, as a procedure, where, starting from a finite quantity of
objectives, a likewise final objective is achieved through the employment of a set of pre-determined rules. One can
compare this with a set of building blocks, whose blocks are assembled according to the laws of gravity, to construct a
building. Obviously, understanding a sentence involves a calculation process in this sense, since the words are
assembled, from these components, into complete sentences, based on linguistic rules. Based on results and
methods of logic, theoretical Computational Linguistics explores the characteristics of computational methods, which
provide insight into the structure of natural languages. Thus a fine-linked hierarchy of computational methods has
been developed in Logic and theoretical Computer Science, which classifies these processes in accordance with their
increasing strength. From this background it is of great theoretical and practical interest to determine, where exactly
on this hierarchy of calculable procedures both natural languages and their models are located, by considering
different linguistic theories. If it should turn out that we can model our most important communication medium only by
employing abstract calculation specifications, which are beyond the practical arithmetic performance of current and
foreseeable computers, then new questions would be raised about the subject of linguistic research.
Apart from the practical relevance of the results for the complexity of natural language, in so far that the complexity
can be understood through theoretical consideration about its predictability, corresponding results are also in close
relationship with the fundamental conceptions, which have been developed under the influence of concepts
expounded in the field of Cognitive Science, in past decades, about the nature of language. The subject of the
Cognitive Science includes the functions of perception and knowledge of motivated behavior of humans, animals and
machines. The Science of Cognition integrates results and methods of Cognitive Psychology, Artificial Intelligence,
Linguistics, Philosophy, the neural sciences and Cognitive Anthropology in its research program.
Starting from the assumption that possible solutions to problems, falling within a given area, must obey restrictive
hypotheses, cognition is to be conclusively understood as an explanation of (mental) representation. Human language
belongs to, according to the basic assumptions of Cognitive Science, the mental capabilities, which are to be
understood in their performance, as calculation from the finite, discrete mental representations, discussed above.
Should the procedures, which are to be assumed in the Linguistic sub-topics of Phonology, Morphology, Syntax,
Semantics and Pragmatics, be of such extreme complexity it’s that they not only disrupt the practical, arithmetic
performance of conceivable computers, but, in addition, fall outside the framework of the abstract concept of general
calculation procedure, sketched above, then, not only would obstacles obstruct the desired applications of
Computational Linguistics, but the foundations of one of the vying paradigms, which lead Homo-sapiens to self-
understanding, would be shaken.
The connection between Cognitive Science and Computational Linguistics is made clearer still that theoretical
Computational Linguistics does not have - as its designation seemingly suggests – the central goal of reducing the

3
Lecture notes HUM1012 (Logic & Language structure) AT
human language to a computer program. The area of research, which views language under computation-theoretical
aspects, is however of direct importance for the practical interests of computer linguists.

What is NLP?
Natural Language Processing (NLP) is both a modern computational technology and a method of investigating and
evaluating claims about human language itself. Some prefer the term Computational Linguistics in order to capture
this latter function, but NLP is a term that links back into the history of Artificial Intelligence (AI), the general study of
cognitive function by computational processes, normally with an emphasis on the role of knowledge representations,
that is to say the need for representations of our knowledge of the world in order to understand human language with
computers. Natural Language Processing (NLP) is the use of computers to process written and spoken language for
some practical, useful, purpose: to translate languages, to get information from the web on text data banks so as to
answer questions, to carry on conversations with

machines, so as to get advice about, say, pensions and so on. These are only examples of major types of NLP, and
there is also a huge range of lesser but interesting applications, e.g. getting a computer to decide if one newspaper
story has been rewritten from another or not. NLP is not simply applications but the core technical methods and
theories that the major tasks above divide up into, such as Machine Learning techniques, which is automating the
construction and adaptation of machine dictionaries, modeling human agents' beliefs and desires etc. This last is
closer to Artificial Intelligence, and is an essential component of NLP if computers are to engage in realistic
conversations: they must, like us, have an internal model of the humans they converse with. Language processing can
be divided into two tasks:
1. Processing written text, using lexical, syntactic, and semantic knowledge of the language as well as any
required real world information.
2. Processing spoken language, using all the information needed above, plus additional knowledge about
phonology as well as enough additional information to handle the further ambiguities that arise in speech.
The steps in the process of natural language understanding are:

Morphological analysis: Individual words are analyzed into their components, and non-word tokens (such as
punctuation) are separated from the words. For example, in the phrase "Bill's house" the proper noun "Bill" is
separated from the possessive suffix "'s."

Syntactic analysis: Linear sequences of words are transformed into structures that show how the words relate to one
another. This parsing step converts the flat list of words of the sentence into a structure that defines the units
represented by that list. Constraints imposed include word order ("manager the key" is an illegal constituent in the
sentence "I gave the manager the key"); number agreement; case agreement.

Semantic analysis: The structures created by the syntactic analyzer are assigned meanings. In most universes, the
sentence "Colorless green ideas sleep furiously" [Chomsky, 1957] would be rejected as semantically anomalous. This
step must map individual words into appropriate objects in the knowledge base, and must create the correct structures
to correspond to the way the meanings of the individual words combine with each other.

Discourse integration: The meaning of an individual sentence may depend on the sentences that precede it and may
influence the sentences yet to come. The entities involved in the sentence must either have been introduced explicitly
or they must be related to entities that were. The overall discourse must be coherent.

Pragmatic analysis: The structure representing what was said is reinterpreted to determine what was actually meant.

Natural language is only one medium for human-machine interaction, but has several obvious and desirable
properties:
1. It provides an immediate vocabulary for talking about the contents of the computer.
2. It provides a means of accessing information in the computer independently of its structure and encodings.
3. It shields the user from the formal access language of the underlying system.
4. It is available with a minimum of training.

The Complexity of Natural Language


There are several major reasons why natural language understanding/processing is a difficult problem. They include:
1. The complexity of the target representation into which the matching is being done. Extracting meaningful
information often requires the use of additional knowledge.
2. The type of mapping: one-to-one, many-to-one, one-to-many, or many-to-many. One-to-many mappings
require a great deal of domain knowledge beyond the input to make the correct choice among target
representations. So for example, the word tall in the phrase "a tall giraffe" has a different meaning than in "a
tall poodle." English requires many-to-many mappings.

4
Lecture notes HUM1012 (Logic & Language structure) AT
3. The level of interaction of the components of the source representation. In many natural language sentences,
changing a single word can alter the interpretation of the entire structure. As the number of interactions
increases, so does the complexity of the mapping.
4. The presence of noise in the input to the understander. We rarely listen to one another against a silent
background. Thus speech recognition is a necessary precursor to speech understanding.
5. The modifier attachment problem. (This arises because sentences aren't inherently hierarchical.) The
sentence Give me all the employees in a division making more than $50,000 doesn't make it clear whether the
speaker wants all employees making more than $50,000, or only those in divisions making more than
$50,000.
6. The quantifier scoping problem. Words such as "the," "each," or "what" can have several readings.
7. Elliptical utterances. The interpretation of a query may depend on previous queries and their interpretations.
E.g., asking Who is the manager of the automobile division and then saying, of aircraft?

Processing:
Syntactic Processing: Syntactic parsing determines the structure of the sentence being analyzed. Syntactic analysis
involves parsing the sentence to extract

whatever information the word order contains. Syntactic parsing is computationally less expensive than semantic
processing. A grammar is a declarative representation that defines the syntactic facts of a language. The most
common way to represent grammars is as a set of production rules, and the simplest structure for them to build is a
parse tree which records the rules and how they are matched. Sometimes backtracking is required (e.g., The horse
raced past the barn fell), and sometimes multiple interpretations may exist for the beginning of a sentence (e.g., Have
the students who missed the exam -- ). Example: Syntactic processing interprets the difference between "John hit
Mary" and "Mary hit John."

Semantic Processing: After (or sometimes in conjunction with) syntactic processing, we must still produce a
representation of the meaning of a sentence, based upon the meanings of the words in it.

Lexical processing: Look up the individual words in a dictionary. It may not be possible to choose a single correct
meaning, since there may be more than one. The process of determining the correct meaning of individual words is
called word sense disambiguation or lexical disambiguation. For example, "I'll meet you at the diamond" can be
understood since at requires either a time or a location. This usually leads to preference semantics when it is not clear
which definition we should prefer.

Sentence-level processing: There are several approaches to sentence-level processing. These include semantic
grammars, case grammars, and conceptual dependencies. Example: Semantic processing determines the differences
between such sentences as "The pig is in the pen" and "The ink is in the pen."

Discourse and Pragmatic Processing: To understand most sentences, it is necessary to know the discourse and
pragmatic context in which it was uttered. In general, for a program to participate intelligently in a dialog, it must be
able to represent its own beliefs about the world, as well as the beliefs of others (and their beliefs about its beliefs, and
so on). The context of goals and plans can be used to aid understanding. Plan recognition has served as the basis for
many understanding programs -- PAM is an early example. Speech acts can be axiomatized just as other operators in
written language, except that they require modal operators to describe states of belief, knowledge, et al.

CL vs NLP (essay by [email protected])


The intended distinction between computational linguistics and natural language processing is that Computational
linguistics is for “works on the application of computers in processing and analyzing language,” whereas Natural
language processing is for “works on the computer processing of natural language for the purpose of enabling
humans to interact with computers in natural language.” The distinction, however, does not reflect current thought.
Computational linguists tend to agree that “natural language processing” (NLP) and “computational linguistics” (CL)
mean pretty much the same thing (or, if different, that the meaning of natural language processing is encompassed
within the meaning of computational linguistics). That means we can merge natural language processing and
computational linguistics relatively easily.
There was agreement that the relative contribution of computer science to computational linguistics is greater than
the contribution of linguistics. Similarly, there was agreement that a background in computer science is more essential
for computational linguistics than a background in linguistics. Further, computer scientists are much more likely than
linguists to embrace computational linguistics as part of their field. From these statements, classing the merged
natural language processing / computational linguistics might seem a no-brainer. On the other hand, however, some
of the observations shared suggest that the situation may not be so cut-and-dry: Computational linguistics really
belongs in linguistics, but linguists don’t realize it yet. Computer scientists sometimes change the field they apply their
skills to (that is, a junior computational linguist might not continue to work in computational linguistics). One may get
better results teaching computer science to a linguist than teaching linguistics to a computer scientist.
5
Lecture notes HUM1012 (Logic & Language structure) AT
There are at least two distinctions made in computational linguistics that is relevant here. The first is a distinction
between symbolic and statistical approaches to computational linguistics, the former emphasizing linguistics-based
representations of natural language, the latter emphasizing quantitative representations of natural language. Many
symbolic approaches could be classed comfortably within linguistics; however, the same could be said of statistical
approaches considerably less often.

A second distinction is made in computational linguistics between tasks and applications: Computational linguistics
tasks (e.g., part-of-speech tagging, parsing, word sense disambiguation, text segmentation) rely, wholly or in part, on
specific properties of language in their processing and analysis and may be combined to form applications of extrinsic
value; computational linguistics applications (e.g., question answering, information retrieval, automatic abstracting,
machine translation) are comprised of components addressing multiple linguistic properties and are of extrinsic value.
Again, one end of our spectrum (in this case, tasks) is much more like linguistics than the other (in this case,
applications—unless the application is itself in linguistics, e.g., translation), but all applications carry out some number of
tasks.
It appears to us that the best solution would be to drop the distinction between natural language processing and
computational linguistics by relocating comprehensive and interdisciplinary works on computational linguistics.

Language Technology
Language technologies are information technologies that are specialized for dealing with the most complex
information medium in our world: human language. Therefore these technologies are also often subsumed under the
term Human Language Technology. Human language occurs in spoken and written form. Whereas speech is the
oldest and most natural mode of language communication, complex information and most of human knowledge is
maintained and transmitted in written texts. Speech and text technologies process or produce language in these two
modes of realization. But language also has aspects that are shared between speech and text such as dictionaries,
most of grammar and the meaning of sentences. Thus large parts of language technology cannot be subsumed under
speech and text technologies. Among those are technologies that link language to knowledge. We do not know how
language, knowledge and thought are represented in the human brain. Nevertheless, language technology had to
create formal representation systems that link language to concepts and tasks in the real world. This provides the
interface to the fast growing area of knowledge technologies.
In our communication we mix language with other modes of communication and other information media. We
combine speech with gesture and facial expressions. Digital texts are combined with pictures and sounds. Movies
may contain language and spoken and written form. Thus speech and text technologies overlap and interact with
many other technologies that facilitate processing of multimodal communication and multimedia documents.

Language Technology: Applications


Although existing LT systems are far from achieving human ability, they have numerous possible applications. The
goal is to create software products that have some knowledge of human language. Such products are going to change
our lives. They are urgently needed for improving human-machine interaction since the main obstacle in the
interaction between human and computer is merely a communication problem. Today's computers do not understand
our language but computer languages are difficult to learn and do not correspond to the structure of human thought.
Even if the language the machine understands and its domain of discourse are very restricted, the use of human
language can increase the acceptance of software and the productivity of its users.

Friendly technology should listen and speak: Natural language interfaces enable the user to communicate with the
computer in French, English, German, or another human language. Some applications of such interfaces are
database queries, information retrieval from texts, so-called expert systems, and robot control. Current advances in
the recognition of spoken language improve the usability of many types of natural language systems. Communication
with computers using spoken language will have a lasting impact upon the work environment; completely new areas of
application for information technology will open up. However, spoken language needs to be combined with other
modes of communication such as pointing with mouse or finger. If such multimodal communication is finally
embedded in an effective general model of cooperation, we have succeeded in turning the machine into a partner.
The ultimate goal of research is the omnipresent access to all kinds of technology and to the global information
structure by natural interaction. In an ambitious but not too far-fetched scenario, language technology provides the
interface to an ambient intelligence providing assistance at work and in many situations of daily life.

Machines can also help people communicate with each other: Language technologies can also help people
communicate with each other. Much older than communication problems between human beings and machines are
those between people with different mother tongues. One of the original aims of language technology has always
been fully automatic translation between human languages. From bitter experience scientists have realized that they
are still far away from achieving the ambitious goal of translating unrestricted texts. Nevertheless, they have been able
to create software systems that simplify the work of human translators and clearly improve their productivity. Less than
perfect automatic translations can also be of great help to information seekers who have to search through large
amounts of texts in foreign languages. The most serious bottleneck for e-commerce is the volume of communication

6
Lecture notes HUM1012 (Logic & Language structure) AT
between business and customers or among businesses. Language technology can help to sort, filter and route
incoming email. It can also assist the customer relationship agent to look up information and to compose a response.
In cases where questions have been answered before, language technology can find appropriate earlier replies and
automatically respond.

Language is the fabric of the web: The rapid growth of the Internet/WWW and the emergence of the information
society pose exciting new challenges to language technology. Although the new media combine text, graphics, sound
and movies, the whole world of multimedia information can only be structured, indexed and navigated through
language. For browsing, navigating, filtering and processing the information on the web, we need software that can
get at the contents of documents. Language technology for content management is a necessary precondition for
turning the wealth of digital information into collective knowledge. The increasing multilinguality of the web constitutes
an additional challenge for language technology. The global web can only be mastered with the help of multilingual
tools for indexing and navigating. Systems for cross-lingual information and knowledge management will surmount
language barriers for e-commerce, education and international cooperation.

Technologies
Speech recognition: Spoken language is recognized and transformed in into text as in dictation systems, into
commands as in robot control systems, or into some other internal representation.

Speech synthesis: Utterances in spoken language are produced from text (text-to-speech systems) or from internal
representations of words or sentences (concept-to-speech systems)

Text categorization: This technology assigns texts to categories. Texts may belong to more than one category,
categories may contain other categories. Filtering is a special case of categorization with just two categories.

Text Summarization: The most relevant portions of a text are extracted as a summary. The task depends on the
needed lengths of the summaries. Summarization is harder if the summary has to be specific to a certain query.

Text Indexing: As a precondition for document retrieval, texts are stored in an indexed database. Usually a text is
indexed for all word forms or – after lemmatization – for all lemmas. Sometimes indexing is combined with
categorization and summarization.

Text Retrieval: Texts are retrieved from a database that best match a given query or document. The candidate
documents are ordered with respect to their expected relevance. Indexing, categorization, summarization and retrieval
are often subsumed under the term information retrieval.

Information Extraction: Relevant pieces of information are discovered and marked for extraction. The extracted pieces
can be: the topic, named entities such as company, place or person names, simple relations such as prices,
destinations, functions etc. or complex relations describing accidents, company mergers or football matches.

Data Fusion and Text Data Mining: Extracted pieces of information from several sources are combined in one
database. Previously undetected relationships may be discovered.

Question Answering: Natural language queries are used to access information in a database. The database may be a
base of structured data or a repository of digital texts in which certain parts have been marked as potential answers.

Report Generation: A report in natural language is produced that describes the essential contents or changes of a
database. The report can contain accumulated numbers, maxima, minima and the most drastic changes.

Spoken Dialogue Systems: The system can carry out a dialogue with a human user in which the user can solicit
information or conduct purchases, reservations or other transactions.

Translation Technologies: Technologies that translate texts or assist human translators. Automatic translation is called
machine translation. Translation memories use large amounts of texts together with existing translations for efficient
look-up of possible translations for words, phrases and sentences.

Spelling and grammar checking: These are some of the most common NLP tasks applications and require linguistics
knowledge to accomplish them.

Brief history
Early NLP – 1950’s
-Machine Translation (MT) one of the earliest applications of computers

7
Lecture notes HUM1012 (Logic & Language structure) AT
-Major attempts in US and USSR
- Russian to English and reverse
George Town University, Washington system:
- Translated sample texts in 1954
- Euphoria - Lot of funding, many groups in US, USSR
But: the system could not be scaled up.

1964: The ALPAC report


- Assessed research results of groups working on MTs
- Concluded: MT not possible in near future.
- Funding should cease for MT
- Basic research should be supported
- Word to word translation does not work
- Linguistic Knowledge is needed

1966 – ELIZA
- Eliza, the first chatterbot – a computer program that mimics human conversation.
- Developed by: Joseph Weizenbaum – Massachusetts Institute of Technology
- User types in some statement or set of statements in natural language
- ELIZA then analyzes the user’s statement and generates some response which it types out.

60-80: Linguistics and CL


- 1957 – Noam Chomsky’s Syntactic Structures
- A formal definition of grammars and languages
- Provides the basis for a automatic syntactic processing of NL expressions
- Montague’s PTQ – Formal semantics for NL.
- Basis for logical treatment of NL meaning
- 1967 – Wood’s procedural semantics
- A procedural approach to the meaning of a sentence
- Provides the basis for a automatic semantic processing of NL expressions

Some successful early CL systems


- 1970 – TAUM Meteo
- Machine translation of weather reports (Canada)
- 1970s – SYSTRAN: MT system; still used by Google

- 1973 – Lunar
- To question expert system on rock analyses from Moon samples
- 1973 – SHRDLU (T. Winograd)
- Instructing a robot to move toy blocks
-1973 – SHRDLU, Terry Winograd, MIT: Interaction with a robot in a block world. The user can: - ask the
robot to manipulate the blocks
- ask it about the blocks configurations
- ask it about its reasoning
- update it with facts

1980s: Symbolic NLP


- Formally grounded and reasonably computationally tractable linguistic formalisms (Lexical Functional Grammar,
Head-Driven Phrase Structure Grammar, Tree Adjoining Grammar etc.)
- Linguistic/Logical paradigm extensively pursued
- Not robust enough

1980s: Corpora and Resources


- Disk space becomes cheap
- Machine readable text becomes ubiquitous
- US funding emphasises large scale evaluation on “real” data
- 1994 – The British National Corpus is made available
A balanced corpus of British English
- Mid 1990s – WordNet (Fellbaum & Miller)
A computational thesaurus developed by psycholinguists
- Early 2000s – The World Wide Web used as a corpus

1990s Statistical NLP


- The following factors promote the emergence of “statistical NLP”:
8
Lecture notes HUM1012 (Logic & Language structure) AT
- Speech recognition shows that given enough data, simple statistical techniques work
- US funding emphasises speech-based interfaces and information extraction
- Large size digitised corpora are available

CL History – Summary
- 50s – Machine translation; ended by ALPAC report

- 60s – Applications use linguistic techniques (Eliza, shrdlu) from Chomsky (formal grammars, parsers); Procedural
semantics (Woods) also important. Approaches only work on restricted, Domains. Not portable

- 70s/80s – Symbolic NLP. Applications based on extensive linguistic and real world knowledge. Not robust enough.
Lexical acquisition bottleneck

- 90s – Statistical NLP


Problems: Techniques are often very task specific; Sparse data problem (what to do when there is not enough training
data available?)

- Now – Combining statistical and symbolic approaches; Using machine learning to automate acquisition of knowledge
expensive resources (e.g., grammars)

Methods and Resources


As the investigation and modelling of human language is a truly interdisciplinary endeavour, the methods of language
technology come from several disciplines: computer science, computational and theoretical linguistics, mathematics,
electrical engineering and psychology.

Generic CS Methods
Programming languages, algorithms for generic data types, and software engineering methods for structuring and
organizing software development and quality assurance

Specialized Algorithms
Dedicated algorithms have been designed for parsing, generation and translation, for morphological and syntactic
processing with finite state automata/transducers and many other tasks.

Non-discrete Mathematical Methods


Statistical techniques have become especially successful in speech processing, information retrieval, and the
automatic acquisition of language models. Other methods in this class are neural networks and powerful techniques
for optimization and search.

Logical and Linguistic Formalisms


For deep linguistic processing, constraint-based grammar formalisms are employed. Complex formalisms have been
developed for the representation of semantic content and knowledge.

Linguistic Knowledge
Linguistic knowledge resources for many languages are utilized: dictionaries, morphological and syntactic grammars,
rules for semantic interpretation, pronunciation and intonation.

Corpora and Corpus Tools


Large collections of application-specific or generic collections of spoken and written language are exploited for the
acquisition and testing of statistical or rule-based language models.

NLP Modules and linguistic Knowledge


- Speech recogniser/text pre-processor: Phonetics/Phonology
- Morphological analyser: Morphology
- Part of Speech Tagger: Syntax
- Parsing: Syntax, (morphology, semantics)
- Disambiguation: Semantics, Discourse, Pragmatics
- Text planning: Discourse, Pragmatics
- Surface realisation: Syntax, (morphology, semantics)

Parsing
Input: Grammar, String
Output: Syntactic structure, Semantic representation
Parsing:
- Given a string and a grammar, a parser determines whether or not S is a valid sentence according to G (i.e., whether
S is generated by G)
- assigns a syntactic structure to S
9
Lecture notes HUM1012 (Logic & Language structure) AT
- uses the rules of G to construct this syntactic structure

Control strategies
Depending on how a parser uses the grammar rules, two main control strategies can be distinguished:

Top-Down: the parser rewrites the left-hand side of the rules as the right-hand side
Example: Rewrite S as NP VP

Bottom-Up: the parser rewrites the right-hand side of the rules as the left-hand side.
Example: Rewrite NP VP as S

Grammar
- Generative grammar (Chomsky 1957): describes (generates) all and only the valid sentences of a language
- Several types of grammar depending on the type of languages they can generate (Chomsky hierarchy)
- Context free grammars are enough to describe most of natural language; Indexed grammars seem to cover all of
natural language

Summary
- A parser uses a grammar to assign a syntactic structure to grammatical input strings
- The control strategy can be either bottom-up or top-down
- Bottom-up parsers are particularly inefficient when the grammar allows empty categories
- Top-down parsers might fail to terminate when the grammar is right or left recursive
- Chart parsers store intermediate results thereby avoiding the recomputation of previously parsed constituents
- The control strategy of a chart parser must ensure that the chart be complete to any constituent searched for

Lexical semantics
The study of word meanings and of their interaction with context
- Words have several possible meanings
- Early methods use selectional restrictions to identify meaning intended in given context
a. The astronomer saw the star.
b. The astronomer married the star.
- Modern techniques use statistical evidence derived from large corpora
c. John sat on the bank.
d. John went to the bank.
e. King Kong sat on the bank.

- Lexical relations i.e., relations between word meanings are also very important for CL based applications
- The most used lexical relations are:
- Hyponymy (ISA) e.g., a dog is a hyponym of animal
- Meronymy (part of) e.g., arm is a meronym of body
- Synonymy e.g., eggplant and aubergine
- Antonymy e.g., big and little

Word sense disambiguation


- Word sense disambiguation is needed for most NL application that involve semantics
- is a serious bottleneck for large coverage text processing
- is now mostly done using statistical or machine learning techniques
- has a best accuracy varying between 70% (Senseval 2) and 95% (Yarowsky)

Compositional Semantics
- Semantics of phrases
- Useful to reason about the meaning of an expression (e.g., to improve the accuracy of a question answering system)
a. John saw Mary.
b. Mary saw John.
- Same words, different meanings

Pragmatics
- Compositional semantics delivers the literal meaning of an utterance
- NL phrases are often used non literally
Examples.
a. Can you pass the salt?
b. You are standing on my foot.

Speech act analysis, plan recognition are needed to determine the full meaning of an utterance

Discourse
10
Lecture notes HUM1012 (Logic & Language structure) AT
Much of language interpretation is dependent on the preceding discourse/dialogue
Example: Anaphora resolution
a. The councillors refused the women a permit because they feared revolution.
b. The councillors refused the women a permit because they advocated revolution.

More generally the various types of linguistic knowledge are put to work in Deep NL processing systems

Deep Natural Language Processing Systems build a meaning representation (needed e.g., for NL interface to
databases, question answering and good MT) from user input and produces some feedback to the user

In a deep NLP system, each type of linguistic knowledge is encoded in a knowledge base which can be used by one
or several modules of the system

Major publications in the field:


Computational Linguistics
Computational Linguistics is the only publication devoted exclusively to the design and analysis of natural language
processing systems. URL:http://mitpress.mit.edu/journal-home.tcl?issn=08912017

Journal of Natural Language Engineering (JNLE)


Natural Language Engineering is an international journal designed to meet the needs of professionals and
researchers working in all areas of computerised language processing, whether from the perspective of theoretical or
descriptive linguistics, lexicology, computer science or engineering. Its principal aim is to bridge the gap between
traditional computational linguistics research and the implementation of practical applications with potential real-world
use.

Computer Speech and Language (CS&L)


Machine Translation (MT)
Speech Technology
Natural Language & Linguistic Theory (NALA)
Mind & Language
Journal of Logic, Language and Information

General Resources on CL/NLP


1. The Association for Computational Linguistics site: http://www.aclweb.org

2. The ACL NLP/CL Universe:


http://www.aclweb.org/u/db/acl/

3. The Computation and Language E-Print Archive: http://xxx.lanl.gov/archive/cs/

4. The Survey of the State of the Art of Human Language Technology: http://www.cse.ogi.edu/CSLU/HLTsurvey/

5. The Linguistic Data Consortium:


http://www.ldc.upenn.edu/

6. The Language Technology Helpdesk:


http://www.ltg.ed.ac.uk/helpdesk/faq/index.html

Professional Organizations, Associations


1. Association for Computational Linguistics (ACL): http://www.aclweb.org

2. Association for Machine Translation in the Americas (AMTA): http://www.amtaweb.org/

3. Cognitive Science Society:


http://cognitivesciencesociety.org/index.html

4. American Association of Artificial Intelligence (AAAI): http://www.aaai.org/home.html

5. Natural Language Processing Association India (NLPAI): http://nlpai.iiit.ac.in/


At IIIT Hyderabad

11
Lecture notes HUM1012 (Logic & Language structure) AT
RESOURSE LINKS:
http://www1.cs.columbia.edu/~radev/nlpfaq.txt
http://www.aaai.org/aitopics/pmwiki/pmwiki.php/AITopics/NaturalLanguage
http://www.gelbukh.com/clbook/
http://www.gelbukh.com/clbook/Computational-Linguistics.htm#_Toc86751649
http://en.wikipedia.org/wiki/Computational_linguistics
http://en.wikipedia.org/wiki/Natural_language_processing
& Oxford Handbook of Computational Linguistics
Assignments and Class Test
- Based on the reading materials, text-books and class-discussions

- Choose any one of the following questions for your assignment:

1. Sketch a brief history of the emergence of CL/NLP as an interdisciplinary subject.

2. Discuss the various applications of CL/NLP. List some of the systems that are being used.

3. Discuss the linguistic complexity in natural language processing, taking examples from your mother tongue or
English/Hindi.

4. How does a Word-count tool/software work? List the heuristic rules involved in counting words in Hindi.

GOODLUCK!

12

You might also like