0% found this document useful (0 votes)

35 views

Introduction To Text Mining and Natural Language Processing: Judith Risse

text mining

Uploaded by

rizrose17666

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

Introduction To Text Mining and Natural Language Processing: Judith Risse

text mining

Uploaded by

rizrose17666

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 51

Introduction to Text Mining and Natural Language Processing

BIF-30806 January 2010

Judith Risse

Outline

Literature and Databases Natural Language Processing

Information Retrieval Question Answering Information Extraction

Indexing Document Classification Exercises

Definitions

Natural Language Processing (NLP)

the study of automated generation and understanding of natural human languages (Wikipedia)

Text Mining

extract high quality (previously unknown) information from large amounts of unstructured text

Biomedical Literature

communication of scientific discoveries peer-reviewed and community reviewed provides additional information of experimental results base for annotation of biological databases

Literature Databases

NCBI Bookshelf PubMed Central PubMed

currently 19476540 citations (Jan 27, 2010) 5414 journals in Medline unique identifier PMID entries contain author, journal and title info more than 50% also abstracts links to full-text articles Medical Subject Headings (MeSH)

PubMed
21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1950 1953 1956 1959

PubMed growth

No of publications in millions

entries per year total No of entries

1962

1965

1968

1971

1974

1977

1980

1983

1986

1989

1992

1995

1998

2001

2004

2007

Pubmed (3)

NLM 2008

A scientific article

journal specific format

sections print style review letter html pdf

type of article

document format

Article content

Full-text

title authors abstract body

Tables Figures References

Biomedical Language

domain specific terminology

cytosolic, erythroid precursor

e.g. Drosophila gene names: coitus interruptus, lost in

polysemic words

space

acronyms

APC (activated protein C), mdh (malate dehydrogenase)

low frequency words anaphora (references)

Overexpression of FumRs and Frds1 resulted in the best citrate-producing strain in the presence of trace manganese concentrations. This strain gave a maximum yield of .
10

Biomedical Language (2)

synonyms/creating new terms typographical variants

malic dehydrogenase L-malate dehydrogenase NAD-L-malate dehydrogenase malic acid dehydrogenase NAD-dependent malic dehydrogenase NAD-malate dehydrogenase NAD-malic dehydrogenase malate (NAD) dehydrogenase MDH L-malate-NAD+ oxidoreductase
11

Natural Language Processing

create computational models of language multi-disciplinary

information technology, linguistics, artificial intelligence, statistics . machine learning, rule-based, regular expressions

statistical properties of language

grammatical, morphological, syntactic and semantic features

Grammatical Features

Grammar

rules governing a language syntax and morphology noun, verb, adjective, adverb, preposition depends on context in sentence http://www.cst.dk/online/pos_tagger/uk/index.html http://en.wikipedia.org/wiki/Brill_Tagger
13

Part of speech (POS)

Brill tagger (Eric Brill, PhD thesis,1993)

Morphological Features

structure of words inflection

enzyme and enzymes (plural form) catalyse, catalyses, catalysing (verb inflection) earth, earthworm (compounding) dependent, independent (derivation) reduction of words to common base form

word-formation

stemming and lemmatisation

am, are, is be catalyse, catalyses, catalysing catalys

Porter Stemmer (tartarus.org/martin/PorterStemmer)

Syntactic Features

relationships between words in a sentence

noun-phrase, verb-phrase subject object relationships

POS Tagged Sentence

(NNP Pain) (VBD vanished) (IN for) (IN at) (JJS least) (CD three) (NNS months) (IN in) (NNS rats) (WP who) (VBD were) (VBN injected) (IN in) (DT the) (NN spine) (IN with) (DT a) (NN gene) (IN that) (NNS triggers) (VBZ endorphins) (. .)
injected - Verb, past participle in - Preposition the - Determiner spine - Singular noun with - Preposition a - Determiner gene - Singular noun that - Preposition triggers - Plural noun endorphins - Verb, 3rd ps. sing.present . - Final punctuation
16

Pain - Proper singular noun vanished - Verb, past tense for - Preposition at - Preposition least - Superlative adjective three - Cardinal number months - Plural noun in - Preposition rats - Plural noun who - wh-pronoun were - Verb, past tense

Semantic Features

meaning of words given the context dictionaries, thesauri

Gene Ontology

Contextual Analysis

Guilt by association

Co-occurrence analysis bag of words statistical analysis of word frequency

Word frequency

Exercise 1

take a gene/protein name of your interest query pubMed and retrieve 1 abstract

Take a look at what the Porter stemmer does using the abstract Describe what problems might occur from stemming Porter Stemmer

http://maya.cs.depaul.edu/~classes/ds575/porter.h tml
19

Coffee Break

Tasks of NLP

Information Extraction (IE) Question Answering (QA) Information Retrieval (IR)

machine translation text proofing speech recognition optical character recognition (OCR)

Information Retrieval

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Introduction to IR (CambUnivPr,
2008)

Indexing

Tokenization Case Folding (TNFalpha, Tnfalpha tnfalpha Stemming Stop-word removal (e.g. at, be, from, this )

Boolean Queries Vector Space Model queries

Zipfs Law

A small number of words occur very often Those high frequency words are often function words (e.g. prepositions) Most words with low frequency

Boolean Queries

Combination of query terms with boolean operators

AND OR NOT

Google, PubMed high recall, low precision unranked result

The vector space model

(1+logTF)log(N/DF) term weight

term frequency (TF) inverse document frequency (IDF) corpus size (N)

the vector points in word space each dimension corresponds to a word or phrase
Nat Rev Gen(2002):3 pp 601-610 25

IR Evaluation

A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query. Introduction to IR (CambUnivPr,
2008)

document collection test cases of information need, as queries measure of relevance

Evaluation (2)

Precision

What fraction of the returned results are relevant to the information need?

Recall

What fraction of the relevant documents in the collection were returned by the system? harmonic mean of precision and recall (2pr)/(p+r)
27

F-score

Exercise 2

Compare the retrieval of abstracts between PubMed and Phasar (www.bioinformatics.nl/biometa/applet.html or twoquid.cs.ru.nl/applet.html) given the question:

What does prostaglandin inhibit?

How many results do you get? Give examples of answers to the question. Give 5 pmids of papers you would read given the results in each search. Which of the systems was more helpful and why?

Coffee Break

Question Answering

question posed in human language answer extracted from unstructured text more developed in generic domain difficult in biomedical domain

Information Extraction & Text Mining

extract structured information from unstructured text Named Entity Recognition identify relationships

e.g. protein-protein interactions

Information Extraction

extract meaning from a text combines:

pos-tagging ontologies regular expressions

Nat Rev Gen(2002):3 pp 601-610 32

Named Entity Recognition

tagging of biological entities high precision in generic NLP (0.9 F-score) difficult in biology

complex terms, synonyms, disambiguation typographical variations no use of official symbols gene/protein names
33

gene symbols

Challenges of NLP

Abbreviation

punctuation can be confused with end of sentence Wash. (Washington) with wash.

Decimal points apostrophes: To split or not to split?

Challenges (2)

hyphens

single or multiple words? data-base vs. data base vs. database carry-over?

simple stemming

operate operating operates operation operative operatives operational oper

brown car vs Mr. Brown
35

case folding

Anaphora

co-references one expression refering to another

The monkey took the banana and ate it.

strictly only local antecendent statements Sortal anaphora

this gene, the virus

resolution required for increased recall

Exercise 3

compare NER programmes

retrieve one pubMed abstract http://biocreative.sourceforge.net/bionlp_tools_links .html

NLProt TerMine Whatizit (http://www.ebi.ac.uk/webservices/whatizit/info.jsf)

What are the differences in recognized entities? Do they miss any obvious entities?

Indexing

Inverted Index (Inverted File)

for each word in the collection (dictionary) list occurrence and frequency

size of index is proportional to size of corpus remove stopwords, use stemming for more efficient index classic version is a boolean index

can also contain positional information

sparse matrix
38

Example

number of docs containing the term

document ids

total # of occurrences

term position in counted words

deterministic 20 73 89 90 106 173 194 233 243 251 252 255 257 258 267 276 281 304 312 315 326 27 36822 44643 45285 53003 53061 86740 86743 97082 116618 121984 125750 125952 125968 126039 127633 128882 128978 129048 133781 133789 138493 140946 140947 152011 156191 157881 163490 deterrence 1 60 4 30309 30345 30444 30452 detonation 2 263 264 4 131781 131956 131995 132303
39

Suffix Array

A suffix array is an array that contains all the pointers to the text suffixes listed in lexicographical order.

Text is seen as one long string A text suffix is a substring from given position till end of string position refers to beginning of word return all occurrences of string W in large text A
40

Example:
the word: abracadabra 1. create all suffixes 2. sort suffixes on alphabet

3. resulting suffix array

Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring
41

Document Classification

assign a document to a class given its content

manual (ad hoc) rule-based decision tree machine learning approaches

Statistical Text Classification

training documents for each class supervised learning test data or new data training data and test data have to be similar

Nave Bayes

Nave: all words in text are considered independent Bayes: uses Bayes theorem

prior probability

posterior probability

P ( B | A) P ( A) P( A | B) P( B)
44

Basic Probability Theory

Given A represents an event the probability of A occuring is 0 P(A) 1 Joint probability P(A,B) = P(AB) Conditional probability P(A | B) Chain rule P(A,B) = P(A | B)P(B) = P(B | A)P(A)

Application to Document Classification

probability of a word belonging to category C probability of a document belonging to category C given its words

wikipedia.org

Coffee Break

Exercise 4

Try to apply nave Bayes to a selection of sentences using

http://search.cpan.org/~kwilliams/AlgorithmNaiveBayes/ rugby.txt and tennis.txt as training and test data. If you have it implemented try using this in combination with the Porter Stemmer (http://bionlp.stanford.edu/bionlp.pl)

Added Challenge

From sequence to abstract to NER

MSTESMIRDVELAEEALPQKMGGFQNSRRCLCLSLFSFLLVAGATTLFCLLNFGVIGPQR DEKFPNGLPLISSMAQTLTLRSSSQNSSDKPVAHVVANHQVEEQLEWLSQRANALLANGM DLKDNQLVVPADGLYLVYSQVLFKGQGCPDYVLLTHTVSRFAISYQEKVNLLSAVKSPCPKDTPEGAE LKPWYEPIYLGGVFQLEKGDQLSAEVNLPKYLDFAESGQVYFGVIAL

retrieve UniprotID via BLAST (take best hit) retrieve gene name using getz (GeneName field) retrieve relevant abstracts from pubMed in Medline format using eSearch and eFetch with the gene name extract all protein/gene names from these abstracts

http://bionlp.stanford.edu/webservices.html

how do they relate to the original protein? compare to the output of ebiMed using the gene name (http://www.ebi.ac.uk/Rebholzsrv/ebimed/index.jsp)
49

Helpful resources
http://www-nlp.stanford.edu/links/statnlp.html http://nlp.stanford.edu/IRbook/html/htmledition/mybook.html www.biocreative.org Drosophila gene names:

http://www.curioustaxonomy.net/gene/fly.html

Introduction to Information Retrieval

Cambridge University Press ISBN 987-0-521-86571-5 Cambridge University Press ISBN-13 978-0-521-83657-9

The Text Mining Handbook

Manual Compresors Ingersoll Rand 250cfm - TM 5 4310 452 14
100% (1)
Manual Compresors Ingersoll Rand 250cfm - TM 5 4310 452 14
405 pages
Genomics Natural Language Processing
No ratings yet
Genomics Natural Language Processing
10 pages
Lecture of Optimization
No ratings yet
Lecture of Optimization
43 pages
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
No ratings yet
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
51 pages
NLP notes
No ratings yet
NLP notes
203 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
Eymethodhas.: Dbollina (Bio - Mg.edu - Au12
No ratings yet
Eymethodhas.: Dbollina (Bio - Mg.edu - Au12
7 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
Faculty Name: Dr. Humera Khanam Subject Name:NLP
No ratings yet
Faculty Name: Dr. Humera Khanam Subject Name:NLP
206 pages
feature eng
No ratings yet
feature eng
34 pages
nlp 9 que
No ratings yet
nlp 9 que
10 pages
The Sensem Project Syntactico Semantic A PDF
No ratings yet
The Sensem Project Syntactico Semantic A PDF
321 pages
Ijms 25 11811
No ratings yet
Ijms 25 11811
27 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Unit 7-NLP
No ratings yet
Unit 7-NLP
33 pages
Text Mining Research Papers PDF
No ratings yet
Text Mining Research Papers PDF
28 pages
Full download Information Retrieval in Biomedicine Natural Language Processing for Knowledge Integration Premier Reference Source 1st Edition Violaine Prince pdf docx
100% (3)
Full download Information Retrieval in Biomedicine Natural Language Processing for Knowledge Integration Premier Reference Source 1st Edition Violaine Prince pdf docx
51 pages
Brief Bioinform-2005-Cohen-57-71 PDF
No ratings yet
Brief Bioinform-2005-Cohen-57-71 PDF
15 pages
Text Mining
No ratings yet
Text Mining
34 pages
Unit-1-Natural Language Processing Applications
No ratings yet
Unit-1-Natural Language Processing Applications
63 pages
Biomedical Text Processing
No ratings yet
Biomedical Text Processing
5 pages
Natural Language Processing: Neural Question Answering
No ratings yet
Natural Language Processing: Neural Question Answering
37 pages
Module 05 - Learners Guide
No ratings yet
Module 05 - Learners Guide
31 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
Methods in Biomedical Text Mining Raul Rodriguez-Esteban
No ratings yet
Methods in Biomedical Text Mining Raul Rodriguez-Esteban
144 pages
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
No ratings yet
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
80 pages
Annual Review Info Sci Tec - 2005 - Chowdhury - Natural Language Processing-1
No ratings yet
Annual Review Info Sci Tec - 2005 - Chowdhury - Natural Language Processing-1
39 pages
Natural Language Processing: An Introduction: Prakash M Nadkarni, Lucila Ohno-Machado, Wendy W Chapman
No ratings yet
Natural Language Processing: An Introduction: Prakash M Nadkarni, Lucila Ohno-Machado, Wendy W Chapman
8 pages
NLP An Introduction PDF
No ratings yet
NLP An Introduction PDF
8 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
NLP 101 - Machine Learning Seminar 2017
100% (1)
NLP 101 - Machine Learning Seminar 2017
30 pages
(Ebook) Information Retrieval in Biomedicine: Natural Language Processing for Knowledge Integration (Premier Reference Source) by Violaine Prince, Mathieu Roche ISBN 9781605662749, 9781605662756, 1605662747, 1605662755 all chapter instant download
No ratings yet
(Ebook) Information Retrieval in Biomedicine: Natural Language Processing for Knowledge Integration (Premier Reference Source) by Violaine Prince, Mathieu Roche ISBN 9781605662749, 9781605662756, 1605662747, 1605662755 all chapter instant download
76 pages
SpasicIrena Other PDF
No ratings yet
SpasicIrena Other PDF
13 pages
Download Recent Advances in Natural Language Processing III Selected Papers from RANLP 2003 1st Edition Nicolas Nicolov ebook All Chapters PDF
100% (4)
Download Recent Advances in Natural Language Processing III Selected Papers from RANLP 2003 1st Edition Nicolas Nicolov ebook All Chapters PDF
79 pages
Transfer Learning in Biomedical Natural Language Processing: An Evaluation of Bert and Elmo On Ten Benchmarking Datasets
No ratings yet
Transfer Learning in Biomedical Natural Language Processing: An Evaluation of Bert and Elmo On Ten Benchmarking Datasets
8 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Module 1.1
No ratings yet
Module 1.1
9 pages
Brocode OP
No ratings yet
Brocode OP
133 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Information Retrival Unit 1
No ratings yet
Information Retrival Unit 1
29 pages
NLP01 IntroNLP
No ratings yet
NLP01 IntroNLP
68 pages
Future of AI in Biomedicine and Biotechnology - (Chapter 12 Shaping The Future of Healthcare With BERT in Clinical Text... )
No ratings yet
Future of AI in Biomedicine and Biotechnology - (Chapter 12 Shaping The Future of Healthcare With BERT in Clinical Text... )
20 pages
Text
No ratings yet
Text
3 pages
My M-7
No ratings yet
My M-7
44 pages
Large Scale Biomedical Relation Extraction From Unstructured Data
No ratings yet
Large Scale Biomedical Relation Extraction From Unstructured Data
1 page
Question Answering, Information Retrieval, and Retrieval Augmented Generation
No ratings yet
Question Answering, Information Retrieval, and Retrieval Augmented Generation
22 pages
NLP KEY
No ratings yet
NLP KEY
16 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Lodin+Project+Papers
No ratings yet
Lodin+Project+Papers
10 pages
Recent Advances in Natural Language Processing III Selected Papers from RANLP 2003 1st Edition Nicolas Nicolov pdf download
100% (1)
Recent Advances in Natural Language Processing III Selected Papers from RANLP 2003 1st Edition Nicolas Nicolov pdf download
42 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
SNLP
No ratings yet
SNLP
18 pages
Textpresso for Neuroscience
No ratings yet
Textpresso for Neuroscience
10 pages
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
No ratings yet
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
44 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
NLP Merged
100% (1)
NLP Merged
975 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
STD Edit
No ratings yet
STD Edit
38 pages
SAP Road Map For Manufacturing
No ratings yet
SAP Road Map For Manufacturing
20 pages
Fluke 45
No ratings yet
Fluke 45
1 page
Is Banning Porn Websites in India Justified
No ratings yet
Is Banning Porn Websites in India Justified
3 pages
Log 11072019
No ratings yet
Log 11072019
46 pages
Ece 2304 Hydraulics 1
No ratings yet
Ece 2304 Hydraulics 1
54 pages
A Tour of Machine Learning Algorithms
No ratings yet
A Tour of Machine Learning Algorithms
9 pages
Conjoint Analysis
No ratings yet
Conjoint Analysis
16 pages
Session 6 - CIPS & DCVG Survey
No ratings yet
Session 6 - CIPS & DCVG Survey
27 pages
Mis Pepsico Project Report
No ratings yet
Mis Pepsico Project Report
63 pages
The Bcs Professional Examinations Diploma April 2006 Examiners' Report Project Management
No ratings yet
The Bcs Professional Examinations Diploma April 2006 Examiners' Report Project Management
13 pages
3kqyalekbezvjnekpotain Igo Ma 13 Self Erecting Tower Crane
No ratings yet
3kqyalekbezvjnekpotain Igo Ma 13 Self Erecting Tower Crane
4 pages
Install Aware Reviewers Guide
No ratings yet
Install Aware Reviewers Guide
22 pages
PM t12b 6x6 Forwarder MK I
No ratings yet
PM t12b 6x6 Forwarder MK I
384 pages
Operations Management - Case Study
No ratings yet
Operations Management - Case Study
6 pages
Another Perspective in Generating and Using Gray Code-Word: Elektrika
No ratings yet
Another Perspective in Generating and Using Gray Code-Word: Elektrika
7 pages
907continuous & Discontinuous Conduction of DC Motor
0% (1)
907continuous & Discontinuous Conduction of DC Motor
21 pages
Boat Building Exercise - Experience As A Customer: Problems Encountered
No ratings yet
Boat Building Exercise - Experience As A Customer: Problems Encountered
1 page
Smart Glasses
No ratings yet
Smart Glasses
3 pages
Turn OFF Methods of SCR
No ratings yet
Turn OFF Methods of SCR
3 pages
Catalog Mud Pump 25864
100% (1)
Catalog Mud Pump 25864
24 pages
Using I5 OS Commands
No ratings yet
Using I5 OS Commands
18 pages
HUMANITY AND ARTIFICIAL INTELLIGENCE - Samuel Arango Guerra
No ratings yet
HUMANITY AND ARTIFICIAL INTELLIGENCE - Samuel Arango Guerra
1 page
Spesifikasi Reach Stacker PDF
100% (2)
Spesifikasi Reach Stacker PDF
4 pages
Multi Threaded Architectures
No ratings yet
Multi Threaded Architectures
47 pages
Magnetrol Model Tk1
No ratings yet
Magnetrol Model Tk1
4 pages
Field Electrical Engineer Position With DABS in Afghanistan - Jobs
No ratings yet
Field Electrical Engineer Position With DABS in Afghanistan - Jobs
5 pages
MO Paper
No ratings yet
MO Paper
2 pages
Ccnpv7.1 Switch Lab4-2 MST Instructor
No ratings yet
Ccnpv7.1 Switch Lab4-2 MST Instructor
21 pages