TEXT MINING
TEXT MINING
TEXT MINING
■ The primary source of functional data that links clinical medicine,
pharmacology, sequence data, and structure data is in the form of
biomedicine documents in online bibliographic databases such as pubmed.
■ Text mining is defined as automatically extracting this
data from documents, which is published in the form of
unstructured free text, often in several languages.
■ Working with free text is one of the most challenging areas of computer
science. Because natural language is ambiguous and often references data
not contained in the document under study.
■ Data on a particular topic may appear in the main body of text, in a
footnote, in a table, or imbedded in a graphic illustration.
NATURAL LANGUAGE
PROCESSING
■ The most promising approaches to text mining online documents rely
on natural language processing (NLP).
■ NLP is a technology that involves a variety of computational methods
ranging from simple keyword extraction to semantic analysis.
■ The simplest NLP systems work by analyzing and identifying the
documents with recognized keywords such as "protein" or "amino
acid."
■ The contents of the tagged documents can then be copied to a local
database and later reviewed.
NATURAL LANGUAGE
PROCESSING
■ More advanced NLP systems use statistical methods to recognize not only
relevant keywords, but also their distribution within a document. In this
way, it's possible to infer context.
■ For example, an NLP system can identify documents with the keywords
"amino acid", "neurofibromatosis", and "clinical outcome" in the same
paragraph. The result of this more advanced analysis is document clusters,
each of which represents data on a specific topic in a particular context.
■ This capability of identifying documents or document clusters is used by
the typical Web search engines, such as Google or Yahoo!, or the native
PubMed interface.
■ This approach is also used in commercial bibliographic database systems,
such as EndNote®, ProCite®, and Reference Manager®, which
create a local subset of PubMed data by capturing the native field
definitions, such as author name, publication title, and MESH keywords.
Figure - Text Mining with NLP. Simple keyword extraction is useful in identifying
documents, analysis of keyword distribution identifies document clusters, and
semantic analysis can reveal rules and trends.
NATURAL LANGUAGE
PROCESSING
■ The most advanced NLP systems work at the semantic level—the analysis of
how meaning is created by the use and interrelationships of words, phrases,
and sentences in a sentence.
■ These systems, which represent the leading edge of NLP R&D, are less
reliable than systems based on keyword extraction and distribution
techniques because they sometimes formulate incorrect rules and trends,
resulting in erroneous search results.
Figure - The NLP Process
THE PROCESSING PHASE OF NLP
The processing phase of NLP involves one or more of a variety of the following
techniques:
IF <protein name>
AND <experimental method name> are in the same sentence
THEN the <experimental method name> refers to the <protein name>
■ This rule states that if a protein name, such as "hemoglobin", is in the same sentences as
an experimental method, such as "microarray spotting", then microarray spotting refers
to hemoglobin. One obvious problem with heuristic methods is that there are exceptions
to most rules.
■ For example, using the preceding rule on a sentence starting with "Microarray spotting
was not used on the hemoglobin molecule because…" would improperly evaluate the
sentence.
THE ANALYSIS PHASE OF NLP