0% found this document useful (0 votes)
3 views13 pages

TEXT MINING

Text mining involves the automatic extraction of data from unstructured biomedical documents, primarily using natural language processing (NLP) techniques. NLP encompasses various methods from simple keyword extraction to advanced semantic analysis, enabling the identification of relevant document clusters based on context. The processing and analysis phases of NLP utilize techniques such as stemming, tagging, tokenizing, and statistical methods to derive meaningful insights from the text.

Uploaded by

Anas Jamshed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views13 pages

TEXT MINING

Text mining involves the automatic extraction of data from unstructured biomedical documents, primarily using natural language processing (NLP) techniques. NLP encompasses various methods from simple keyword extraction to advanced semantic analysis, enabling the identification of relevant document clusters based on context. The processing and analysis phases of NLP utilize techniques such as stemming, tagging, tokenizing, and statistical methods to derive meaningful insights from the text.

Uploaded by

Anas Jamshed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

TEXT MINING

TEXT MINING
■ The primary source of functional data that links clinical medicine,
pharmacology, sequence data, and structure data is in the form of
biomedicine documents in online bibliographic databases such as pubmed.
■ Text mining is defined as automatically extracting this
data from documents, which is published in the form of
unstructured free text, often in several languages.
■ Working with free text is one of the most challenging areas of computer
science. Because natural language is ambiguous and often references data
not contained in the document under study.
■ Data on a particular topic may appear in the main body of text, in a
footnote, in a table, or imbedded in a graphic illustration.
NATURAL LANGUAGE
PROCESSING
■ The most promising approaches to text mining online documents rely
on natural language processing (NLP).
■ NLP is a technology that involves a variety of computational methods
ranging from simple keyword extraction to semantic analysis.
■ The simplest NLP systems work by analyzing and identifying the
documents with recognized keywords such as "protein" or "amino
acid."
■ The contents of the tagged documents can then be copied to a local
database and later reviewed.
NATURAL LANGUAGE
PROCESSING
■ More advanced NLP systems use statistical methods to recognize not only
relevant keywords, but also their distribution within a document. In this
way, it's possible to infer context.
■ For example, an NLP system can identify documents with the keywords
"amino acid", "neurofibromatosis", and "clinical outcome" in the same
paragraph. The result of this more advanced analysis is document clusters,
each of which represents data on a specific topic in a particular context.
■ This capability of identifying documents or document clusters is used by
the typical Web search engines, such as Google or Yahoo!, or the native
PubMed interface.
■ This approach is also used in commercial bibliographic database systems,
such as EndNote®, ProCite®, and Reference Manager®, which
create a local subset of PubMed data by capturing the native field
definitions, such as author name, publication title, and MESH keywords.
Figure - Text Mining with NLP. Simple keyword extraction is useful in identifying
documents, analysis of keyword distribution identifies document clusters, and
semantic analysis can reveal rules and trends.
NATURAL LANGUAGE
PROCESSING
■ The most advanced NLP systems work at the semantic level—the analysis of
how meaning is created by the use and interrelationships of words, phrases,
and sentences in a sentence.
■ These systems, which represent the leading edge of NLP R&D, are less
reliable than systems based on keyword extraction and distribution
techniques because they sometimes formulate incorrect rules and trends,
resulting in erroneous search results.
Figure - The NLP Process
THE PROCESSING PHASE OF NLP
The processing phase of NLP involves one or more of a variety of the following
techniques:

■ Stemming— Identifying the stem of each word. For example, "hybridized",


"hybridizing", and "hybridization" would be stemmed to "hybrid". As a result,
the analysis phase of the NLP process has to deal with only the stem of each
word, and not every possible permutation.

■ Tagging— Identifying the part of speech represented by each word, such as


noun, verb, or adjective.

■ Tokenizing— Segmenting sentences into words and phrases. This process


determines which words should be retained as phrases, and which ones should
be segmented into individual words. For example, "Type II Diabetes" should be
retained as a word phrase, whereas "A patient with diabetes" would be
THE PROCESSING PHASE OF NLP
■ Core Terms— Significant terms, such as protein names and experimental
method names, are identified, based on a dictionary of core terms. A related
process is ignoring insignificant words, such as "the", "and", and "a".

■ Resolving Abbreviations, Acronyms, and Synonyms— Replacing


abbreviations with the words they represent, and resolving acronyms and
synonyms to a controlled vocabulary. For example, "DM" and "Diabetes
Mellitus" could be resolved to "Type II Diabetes", depending on the controlled
vocabulary.
THE ANALYSIS PHASE OF NLP
The analysis phase of NLP typically involves the use of heuristics, grammar, or statistical
methods.
■ Heuristic approaches rely on a knowledge base of rules that are applied to the
processed text.
■ Heuristic or rule-based analysis uses IF-THEN rules on the processed words and sentences
to infer association or meaning. Consider the following rule:

IF <protein name>
AND <experimental method name> are in the same sentence
THEN the <experimental method name> refers to the <protein name>

■ This rule states that if a protein name, such as "hemoglobin", is in the same sentences as
an experimental method, such as "microarray spotting", then microarray spotting refers
to hemoglobin. One obvious problem with heuristic methods is that there are exceptions
to most rules.
■ For example, using the preceding rule on a sentence starting with "Microarray spotting
was not used on the hemoglobin molecule because…" would improperly evaluate the
sentence.
THE ANALYSIS PHASE OF NLP

■ Grammar-based methods use language models to extract information from


the processed text.
■ Language models serve as templates for the sentence- and phrase level
analysis. These templates tend to be domain-specific. For example, a typical
patient case report submitted by a clinician might read:
■ "The patient was a 45-year-old white male with a chief complaint of
abdominal pain for three days."
■ A template that would be compatible with the sentence is
<patient> <patient age> <race> <sex> <chief complaint><complaint
duration>

■ Statistical methods use mathematical models to derive context and meaning


from words.
THE ANALYSIS PHASE OF NLP
■ Statistical methods use mathematical models to derive context and
meaning from words.
■ Most statistical approaches to the analysis phase of NLP include an
assessment word frequency at the sentence, paragraph, and
document level.
■ Word frequency is relevant because words with the lowest frequency
of occurrence tend to have the greatest meaning and significance in a
document.
■ On the other hand, words with the highest frequency of occurrence,
such as "and", "the", and "a", have relatively little meaning.
Figure - Documents Represented as Word Frequency Vectors. The
vector of a document under analysis (left) is compared to the
standard vector (right) that represents spotting of hemoglobin from
patients

You might also like