SlideShare a Scribd company logo
Pattern Mining to  Chinese Unknown word Extraction 資工碩三  955202037  楊傑程 2008/10/14
Outline Introduction Related Works Unknown Word Detection Unknown Word Extraction Experiments Conclusions
Introduction Since the growing popularity of Chinese, Chinese Text Processing has drawn a great amount of Interests in recent years. Before utilizing knowledge of Chinese texts, some preprocessing work should be done, such as Chinese Word Segmentation. There is no blank to mark word boundaries in Chinese texts.
Introduction Chinese Word Segmentation encounters two major problems: Ambiguity and Unknown Words. Ambiguity One un-segmented Chinese character string has different segmentations according to different context information.  Unknown Words Also known as Out-Of-Vocabulary words (OOV words),  mostly unfamiliar proper nouns or new-born words.  Ex: the sentence “ 王義氣熱衷於研究生命”  would be segmented into “ 王  義氣  熱衷  於  研究  生命” because “ 王義氣”  is a uncommon personal name, which is not in vocabularies.
Introduction- types of unknown words In this paper, we focus on Chinese unknown word problem.  Types of Chinese unknown words Organization names Ex: 華碩電腦 Ex: 總經理、電腦化 Abbreviation Proper Names Ex:  中油、中大 Personal names Ex:  王小明 Derived Words Compounds Ex: 電腦桌、搜尋法 Numeric type  compounds Ex: 1986 年、 19 巷
Introduction- unknown word identification Chinese Word Segmentation Process: Initial Segmentation (Dictionary assisted) Correctly identified words are called known words. Unknown words are wrongly segmented into two or more parts. Ex: personal name  王小明  after initial segmentation,  become  王  小  明  Unknown word identification Characters belong to one unknown word should re-combine together. Ex: re-combine  王  小  明  together as  王小明
Introduction- unknown word identification How does unknown word identification work? A character can be a word ( 馬 ) or part of unknown word ( 馬 + 英 + 九 ). Unknown Word  Detection   Find detection rules to distinguish monosyllabic words from monosyllabic morphemes. Unknown Word  Extraction focus on detected morphemes and combine them.
Introduction- applied techniques  In this paper, we apply  continuity pattern mining  to discover unknown word detection rules. Then, we apply machine learning based methods- classification algorithms and sequential learning methods to extract unknown words. Utilize syntactic information 、 context information and heuristic statistical information. Our unknown word identification method is a  general  method not limited on specific types of unknown words
Related Works- particular methods So far, research on Chinese word segmentation has lasted for a decade.   First, researchers apply different kinds of information to discover different kinds of unknown words (particular). Proper nouns (Chinese personal names 、 transliteration names 、 Organization names)  <[Chen & Li, 1996] 、 [Chen & Chen, 2000]> Patterns, Frequency, Context Information
Related Works- general methods  (Rule-based) Then, researchers start to figure out methods extracting whole kinds of unknown words. Rule-based Detection and Extraction: <[Chen et al., 1998]> Distinguish monosyllabic words and monosyllabic morphemes  <[Chen et al., 2002]> Combine Morphological rules with Statistical rules to extract personal names 、 transliteration names and compound nouns. (Precision: 89%, Recall: 68%) <[Ma et al., 2003]> Utilize context free grammar concept and propose a bottom-up merging algorithm Adopt morphological rules and general rules to extract all kinds of unknown words. ( Precision: 76%, Recall: 57%)
Related Works- general methods  (Machine Learning-based) Sequential Learning: <[T. G. Dietterich, 2002]> Transform sequential learning problem into classification problem Direct method, like HMM 、 CRF <[Goh et. al, 2006]> HMM+SVM, (Precision: 63.8%, Recall: 58.3%) <[Tsai et. al, 2006]> CRF, (Recall: 73%) Indirect method, like Sliding Window  、 Recurrent Sliding Windows
Related Works – Imbalanced Data Imbalance Data Problem Ensemble method <C. Li, 2007> Combine learning ability of multiple base classifiers using voting. Cost-sensitive learning and sampling <G. M. Weiss et. al, 2007> Focus more on minority class examples. <C. Drummond et. al, 2003> Under-sampling is more sensitive than over-sampling. <[Seyda et. al, 2007]> Select the most informative instances.
Unknown Word Detection & Extraction Our idea is similar to [Chen et al, 2002]: Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction Machine learning based – classification algorithms and sequential learning (indirect).  We call: unknown word detection as “Phase 1” unknown word extraction as “Phase 2”.
Unknown Word Detection & Extraction Unknown Word Detection (Detection Rule Mining) Judge Judge Unknown Word Extraction (Machine Learning- Classification) 8/10 corpus +  detection tags (Initial Segmentation) 8/10 corpus 1/10 corpus (Validation) 1/10 corpus (Initial Segmentation) Classification Decision 1/10 corpus + detection tags training testing Phase 1 Phase 2 Rules 1/10 corpus (Validation) Mining tool (Prowl) Model POS tagging POS tagging
Unknown Word Detection Mine detection rules: 8/10 corpus learning Continuity pattern mining Focus on monosyllables.
Unknown word detection- Pattern Mining Pattern Mining: Sequential Pattern: “ 因為… ,  所以…” Required items match pattern order Allow noise in the middle of required items. Continuity Pattern: “ 打 * 球”  => “ 打棒球” : match, “ 打躲避球” : not match Strict definition to each items and order.  Efficient pattern mining
Unknown word detection- Continuity Pattern Mining Prowl <[Huang et. al, 2004]> Starts with 1-frequent pattern Extend to 2 pattern by two adjacent 1-frequent patterns, then evaluate its frequency. Iteratively extends to longer length of patterns.
Encoding Original segmentation label the words based on lexicon matching : known  (Y)  or unknown  (N) “ 葡萄” ,  in the lexicon  => “ 葡萄”  labels as known word (Y) “ 葡萄皮”  ,  not in the lexicon  => “ 葡萄皮”  labels as unknown word (N) Encoding examples: 葡萄 (Na)   葡 (Na) Y  +  萄 (Na) Y 葡萄皮 (Na)   葡 (Na) N  + 萄 (Na) N+  皮 (Na) N
Create Detection Rules Rule pattern: character, pos, label Max length = 3. character within “{ }” is primary character of rule. Ex: ( { 葡 },  萄  ): “ 葡”  be a known word when “ 葡萄”  appears.  Rule Accuracy: Ex: ( { 葡  (Na)},  萄  (Na) ) : =P(#( 葡  (Na) be a known word) | #(  葡  (Na),  萄  (Na) )) (  葡  (Na),  萄  (Na), ) : 2 (  葡  (Na) Y,  萄  (Na), ) : 1 (  葡  (Na) N,  萄  (Na) N, ) : 1 (  葡  (Na) Y,  萄  (Na) Y, ) : 1 (  葡  ,  萄  , ) : 2 (  葡  (Na),  萄  , ) : 2 (  葡  ,  萄  (Na), ) : 2
Unknown Word Extraction Machine Learning Classification Sequential learning
Unknown Word Extraction-  feature ( Pos) We use TnT POS tagger to detect part-of-speech (pos) tags of terms. Kinds of pos tags : Nouns (Na, Nb,…) Verbs (VA, VB, VC,…) Adjectives (A…) Punctuations (Comma, Period,…) …
Unknown Word Extraction-  feature ( term_attribute) After initial segmentation and applying detection rules, each term will have a “ term_attribute ” label itself. Six different “ term_attributes ” are as follows :  ms()  monosyllabic word , Ex:  你、我、他 ms(?)    morphemes of unknown word , Ex:  “ 王 ”、“ 小 ”、“ 明 ”  on “ 王小明 ” ds()    double-syllabic word , Ex:  學校 ps()    poly-syllabic word , Ex:  筆記型電腦 dot()    punctuation , Ex:  “ ,”、 “。”… none()    no above information or new term  Target of unknown word: at least one ms(?) 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () ps()  dot()  ds()  ds()  ms(?)  ms(?)  ms(?)  dot()  ds()  ms()  ms()  ms()
Data Processing- Sliding Window Sequential Supervised Learning Indirect method: transform sequential learning to classification learning Sliding Window We offer three lengths of SVM models to extract different lengths of unknown words , e.g. n= 2.3.4. Each time we choose n+2 (+prefix & suffix) terms as one window, then we shift one token to right to generate another window, and so on.  Window: n+2 terms (n+prefix+suffix) N-gram: n term must exist at least one ms(?) in n terms.   t3 t2 t1 prefix t0 3-gram suffix t4
EX: 3-gram Model discard negative negative negative positive 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () 運動會 ‧ 四年 甲班 王 (?) ‧ 四年 甲班 王 (?) 姿 (?) 四年 甲班 王 (?) 姿 (?) 分 (?) 甲班 王 (?) 姿 (?) 分 (?) ‧ 王 (?) 姿 (?) 分 (?) ‧ 本校
Unknown Word Extraction-  feature (Statistical Information) Statistical information: (exemplified by 3-gram Model), Frequency of 3-gram.  p( prefix | 3-gram), e.g. p( prefix | t1~t3) p( suffix | 3-gram), e.g. p( suffix | t1~t3) p( first term of n | other n-1 consecutive terms), e.g. p( t1 | t2~t3) p( last term of n | other n-1 preceding terms), e.g. p( t3 | t1~t2) p( pos_freq(prefix) / pos_freq(prefix in training positive)) p( pos_freq(suffix) / pos_freq(suffix in training positive)) t3 t2 t1 prefix t0 3-gram suffix t4
Data presentation Format of machine learning usage: Dimension: accumulative term_attribute (6) pos (55) t2 term_attribute (6) pos (55) term_attribute (6) pos (55) prefix t1 …… … statistics (7) term_attribute (6) pos (55) …… suffix
Experiments Unknown word detection. Unknown word extraction.
Unknown Word Detection 8/10 balanced corpus (460m words) as training data. Use Pattern mining tool: Prowl [Huang et al., 2004] 1/10 balanced corpus as validation data. Use accuracy and frequency as threshold of detection rules. 1/10 balanced corpus as real test data (for phase 2): 60.3% precision and 93.6% recall Threshold (Accuracy) Precision Recall F-measure (our system) F-measure (AS system) 0.7 0.9324 0.4305 0.589035 0.71250 0.8 0.9008 0.5289 0.66648 0.752447 0.9 0.8343 0.7148 0.769941 0.76955 0.95 0.764 0.8288 0.795082 0.76553 0.98 0.686 0.8786 0.770446 0.744036 0.76158 0.9092 0.6552 29 0.77033 0.780085 0.787466 0.795082 F-measure 0.8995 0.8932 0.8819 0.8288 Recall 0.6736 0.6924 0.7113 0.764 Precision 19 11 7 3 Fre>=
Unknown Word Extraction 8/10 balanced corpus (460m words) as training data. 1/10 balanced corpus as testing data. Imbalanced data solution: Ensemble method (voting) + under-sampling (random) Use another 1/10 balanced corpus as validation to find sampling ratio: 2-gram: 1:2 (positive: negative) 3-gram: 1:3 4-gram: 1:6
Unknown Word Extraction In judging overlap and conflict problem of different combination of unknown words : <[Chen et al., 2002]> frequency (w) * length (w) . Ex: “ 律師  班  奈  特” , => freq( 律師 + 班 )*3 : freq( 班 + 奈 + 特 )*3 Our method:  First solve identical N-gram  overlap  :  P (combine | overlap)   Ex: “ 單  親  家庭” : P( 單親 | 親 ) : P( 親家庭 | 親 ) Then solve different N-gram conflict :  Real frequency freq (X)-freq (Y), if X is included in Y  ex: X=“ 醫學”、“學院” ,  Y=“ 醫學院”
Extraction result Comparison:  <[Ma et al., 2003]> morphological rules+ statistical rules+ context free grammar rules Precision: 76%, Recall: 57%  Our result 0.627 68.2% 58.1% Total 0.614 67.1% 56.7% 2-gram 0.707 80% 63.3% 3-gram 0.426 70.3% 30.6% 4-gram F1-score Recall Precision n-gram
Ensemble Method Improvement 0.426 0.703 0.306 0.707 0.8 0.633 0.614 0.671 0.567 Censemble 0.336 0.59 0.238 0.66 0.765 0.583 0.594 0.653 0.544 Caverage 0.412 0.662 0.299 0.669 0.776 0.587 0.587 0.645 0.538 C12 0.335 0.554 0.24 0.667 0.74 0.607 0.593 0.668 0.533 C11 0.344 0.662 0.232 0.655 0.723 0.599 0.596 0.661 0.543 C10 0.321 0.635 0.215 0.645 0.715 0.587 0.598 0.657 0.548 C9 0.309 0.486 0.226 0.676 0.813 0.579 0.6 0.673 0.541 C8 0.325 0.703 0.211 0.648 0.691 0.611 0.604 0.66 0.557 C7 0.333 0.608 0.23 0.641 0.735 0.568 0.582 0.636 0.536 C6 0.299 0.554 0.205 0.644 0.779 0.549 0.603 0.66 0.555 C5 0.42 0.676 0.305 0.667 0.796 0.574 0.598 0.645 0.557 C4 0.28 0.378 0.222 0.664 0.81 0.563 0.58 0.633 0.535 C3 0.338 0.743 0.219 0.7 0.791 0.627 0.61 0.657 0.569 C2 0.315 0.419 0.252 0.649 0.808 0.542 0.572 0.64 0.518 C1 F1-Score Recall Precision F1-Score Recall Precision F1-Score Recall Precision 4-gram 3-gram 2-gram 分類 模型
Experiment- One phase What if without unknown word detection? Two phases do work better. 0.627 68.2% 58.1% Two Phases 0.52 71.4% 40.8% One Phase F-score Recall Precision Classification  Performance
Conclusions We adopt two phases method to solve unknown word problems Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction Machine learning based – classification algorithms and sequential learning (indirect). Imbalanced data solution Our experiment prove two phases do work better than one phase. Future work: Utilize Machine learning on detection. Utilize more information (patterns 、 rules) to improve extraction precision.
Ad

Recommended

Unknown Word 08
Unknown Word 08
Jason Yang
 
Logic programming (1)
Logic programming (1)
Nitesh Singh
 
Chaps 1-3-ai-prolog
Chaps 1-3-ai-prolog
saru40
 
10 logic+programming+with+prolog
10 logic+programming+with+prolog
baran19901990
 
Framester and WFD
Framester and WFD
Aldo Gangemi
 
Information extraction for Free Text
Information extraction for Free Text
butest
 
Artificial intelligence and first order logic
Artificial intelligence and first order logic
parsa rafiq
 
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
Sho Takase
 
Lecture: Summarization
Lecture: Summarization
Marina Santini
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
Vsevolod Dyomkin
 
PROLOG: Introduction To Prolog
PROLOG: Introduction To Prolog
DataminingTools Inc
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
Marina Santini
 
Language Models for Information Retrieval
Language Models for Information Retrieval
Nik Spirin
 
Knowledge Patterns SSSW2016
Knowledge Patterns SSSW2016
Aldo Gangemi
 
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
Rommel Carvalho
 
Plc part 4
Plc part 4
Taymoor Nazmy
 
Crash-course in Natural Language Processing
Crash-course in Natural Language Processing
Vsevolod Dyomkin
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
L3 v2
L3 v2
Ekaterina Chernyak
 
Predicate calculus
Predicate calculus
Rajendran
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
KozoChikai
 
1909 paclic
1909 paclic
WarNik Chow
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
WarNik Chow
 
Information Extraction
Information Extraction
Rubén Izquierdo Beviá
 
A look inside the distributionally similar terms
A look inside the distributionally similar terms
Kow Kuroda
 
First order logic
First order logic
Chinmay Patel
 
Dependent Types in Natural Language Semantics
Dependent Types in Natural Language Semantics
Daisuke BEKKI
 
[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法
台灣資料科學年會
 
LiDAR processing for road network asset inventory
LiDAR processing for road network asset inventory
Conor Mc Elhinney
 
Object segmentation in images using EEG signals
Object segmentation in images using EEG signals
Universitat Politècnica de Catalunya
 

More Related Content

What's hot (20)

Lecture: Summarization
Lecture: Summarization
Marina Santini
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
Vsevolod Dyomkin
 
PROLOG: Introduction To Prolog
PROLOG: Introduction To Prolog
DataminingTools Inc
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
Marina Santini
 
Language Models for Information Retrieval
Language Models for Information Retrieval
Nik Spirin
 
Knowledge Patterns SSSW2016
Knowledge Patterns SSSW2016
Aldo Gangemi
 
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
Rommel Carvalho
 
Plc part 4
Plc part 4
Taymoor Nazmy
 
Crash-course in Natural Language Processing
Crash-course in Natural Language Processing
Vsevolod Dyomkin
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
L3 v2
L3 v2
Ekaterina Chernyak
 
Predicate calculus
Predicate calculus
Rajendran
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
KozoChikai
 
1909 paclic
1909 paclic
WarNik Chow
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
WarNik Chow
 
Information Extraction
Information Extraction
Rubén Izquierdo Beviá
 
A look inside the distributionally similar terms
A look inside the distributionally similar terms
Kow Kuroda
 
First order logic
First order logic
Chinmay Patel
 
Dependent Types in Natural Language Semantics
Dependent Types in Natural Language Semantics
Daisuke BEKKI
 
[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法
台灣資料科學年會
 
Lecture: Summarization
Lecture: Summarization
Marina Santini
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
Vsevolod Dyomkin
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
Marina Santini
 
Language Models for Information Retrieval
Language Models for Information Retrieval
Nik Spirin
 
Knowledge Patterns SSSW2016
Knowledge Patterns SSSW2016
Aldo Gangemi
 
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
Rommel Carvalho
 
Crash-course in Natural Language Processing
Crash-course in Natural Language Processing
Vsevolod Dyomkin
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
Predicate calculus
Predicate calculus
Rajendran
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
KozoChikai
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
WarNik Chow
 
A look inside the distributionally similar terms
A look inside the distributionally similar terms
Kow Kuroda
 
Dependent Types in Natural Language Semantics
Dependent Types in Natural Language Semantics
Daisuke BEKKI
 
[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法
台灣資料科學年會
 

Viewers also liked (10)

LiDAR processing for road network asset inventory
LiDAR processing for road network asset inventory
Conor Mc Elhinney
 
Object segmentation in images using EEG signals
Object segmentation in images using EEG signals
Universitat Politècnica de Catalunya
 
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
Niharjyoti Sarangi
 
Wearable Computing - Part III: The Activity Recognition Chain (ARC)
Wearable Computing - Part III: The Activity Recognition Chain (ARC)
Daniel Roggen
 
Text independent speaker recognition system
Text independent speaker recognition system
Deepesh Lekhak
 
Automatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approach
Abdullah al Mamun
 
Module15: Sliding Windows Protocol and Error Control
Module15: Sliding Windows Protocol and Error Control
gondwe Ben
 
Track 1 session 1 - st dev con 2016 - contextual awareness
Track 1 session 1 - st dev con 2016 - contextual awareness
ST_World
 
Track 2 session 1 - st dev con 2016 - avnet - making things real
Track 2 session 1 - st dev con 2016 - avnet - making things real
ST_World
 
Digital Image Processing
Digital Image Processing
Sahil Biswas
 
LiDAR processing for road network asset inventory
LiDAR processing for road network asset inventory
Conor Mc Elhinney
 
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
Niharjyoti Sarangi
 
Wearable Computing - Part III: The Activity Recognition Chain (ARC)
Wearable Computing - Part III: The Activity Recognition Chain (ARC)
Daniel Roggen
 
Text independent speaker recognition system
Text independent speaker recognition system
Deepesh Lekhak
 
Automatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approach
Abdullah al Mamun
 
Module15: Sliding Windows Protocol and Error Control
Module15: Sliding Windows Protocol and Error Control
gondwe Ben
 
Track 1 session 1 - st dev con 2016 - contextual awareness
Track 1 session 1 - st dev con 2016 - contextual awareness
ST_World
 
Track 2 session 1 - st dev con 2016 - avnet - making things real
Track 2 session 1 - st dev con 2016 - avnet - making things real
ST_World
 
Digital Image Processing
Digital Image Processing
Sahil Biswas
 
Ad

Similar to Pattern Mining To Unknown Word Extraction (10 (20)

Statistically-Enhanced New Word Identification
Statistically-Enhanced New Word Identification
Andi Wu
 
Part of speech tagger English - By sadak pramodh
Part of speech tagger English - By sadak pramodh
sadakpramodh
 
Parts of Speect Tagging
Parts of Speect Tagging
theyaseen51
 
NLP Concepts detail explained in details.pptx
NLP Concepts detail explained in details.pptx
FaizRahman56
 
word level analysis
word level analysis
tjs1
 
Parts of speech tagger
Parts of speech tagger
sadakpramodh
 
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Andi Wu
 
7 probability and statistics an introduction
7 probability and statistics an introduction
ThennarasuSakkan
 
Unknown Words Analysis in POS Tagging of Sinhala Language
Unknown Words Analysis in POS Tagging of Sinhala Language
mlaij
 
IRJET- A Review on Part-of-Speech Tagging on Gujarati Language
IRJET- A Review on Part-of-Speech Tagging on Gujarati Language
IRJET Journal
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
Editor IJARCET
 
Adding morphological information to a connectionist Part-Of-Speech tagger
Adding morphological information to a connectionist Part-Of-Speech tagger
Francisco Zamora-Martinez
 
Summary distributed representations_words_phrases
Summary distributed representations_words_phrases
Yue Xiangnan
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Christophe Tricot
 
GSCL2013.A Study of Chinese Word Segmentation Based on the Characteristics of...
GSCL2013.A Study of Chinese Word Segmentation Based on the Characteristics of...
Lifeng (Aaron) Han
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
IDES Editor
 
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
IJCSES Journal
 
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
IJECEIAES
 
Adhyann – a hybrid part of-speech tagger
Adhyann – a hybrid part of-speech tagger
ijitjournal
 
Natural Language Processing
Natural Language Processing
GeekNightHyderabad
 
Statistically-Enhanced New Word Identification
Statistically-Enhanced New Word Identification
Andi Wu
 
Part of speech tagger English - By sadak pramodh
Part of speech tagger English - By sadak pramodh
sadakpramodh
 
Parts of Speect Tagging
Parts of Speect Tagging
theyaseen51
 
NLP Concepts detail explained in details.pptx
NLP Concepts detail explained in details.pptx
FaizRahman56
 
word level analysis
word level analysis
tjs1
 
Parts of speech tagger
Parts of speech tagger
sadakpramodh
 
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Andi Wu
 
7 probability and statistics an introduction
7 probability and statistics an introduction
ThennarasuSakkan
 
Unknown Words Analysis in POS Tagging of Sinhala Language
Unknown Words Analysis in POS Tagging of Sinhala Language
mlaij
 
IRJET- A Review on Part-of-Speech Tagging on Gujarati Language
IRJET- A Review on Part-of-Speech Tagging on Gujarati Language
IRJET Journal
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
Editor IJARCET
 
Adding morphological information to a connectionist Part-Of-Speech tagger
Adding morphological information to a connectionist Part-Of-Speech tagger
Francisco Zamora-Martinez
 
Summary distributed representations_words_phrases
Summary distributed representations_words_phrases
Yue Xiangnan
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Christophe Tricot
 
GSCL2013.A Study of Chinese Word Segmentation Based on the Characteristics of...
GSCL2013.A Study of Chinese Word Segmentation Based on the Characteristics of...
Lifeng (Aaron) Han
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
IDES Editor
 
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
IJCSES Journal
 
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
IJECEIAES
 
Adhyann – a hybrid part of-speech tagger
Adhyann – a hybrid part of-speech tagger
ijitjournal
 
Ad

Recently uploaded (20)

LDMMIA Yoga S10 Free Workshop Grad Level
LDMMIA Yoga S10 Free Workshop Grad Level
LDM & Mia eStudios
 
2025 June Year 9 Presentation: Subject selection.pptx
2025 June Year 9 Presentation: Subject selection.pptx
mansk2
 
Q1_TLE 8_Week 1- Day 1 tools and equipment
Q1_TLE 8_Week 1- Day 1 tools and equipment
clairenotado3
 
How to Customize Quotation Layouts in Odoo 18
How to Customize Quotation Layouts in Odoo 18
Celine George
 
Tanja Vujicic - PISA for Schools contact Info
Tanja Vujicic - PISA for Schools contact Info
EduSkills OECD
 
Paper 106 | Ambition and Corruption: A Comparative Analysis of ‘The Great Gat...
Paper 106 | Ambition and Corruption: A Comparative Analysis of ‘The Great Gat...
Rajdeep Bavaliya
 
CRYPTO TRADING COURSE BY FINANCEWORLD.IO
CRYPTO TRADING COURSE BY FINANCEWORLD.IO
AndrewBorisenko3
 
June 2025 Progress Update With Board Call_In process.pptx
June 2025 Progress Update With Board Call_In process.pptx
International Society of Service Innovation Professionals
 
Aprendendo Arquitetura Framework Salesforce - Dia 02
Aprendendo Arquitetura Framework Salesforce - Dia 02
Mauricio Alexandre Silva
 
ENGLISH-5 Q1 Lesson 1.pptx - Story Elements
ENGLISH-5 Q1 Lesson 1.pptx - Story Elements
Mayvel Nadal
 
Filipino 9 Maikling Kwento Ang Ama Panitikang Asiyano
Filipino 9 Maikling Kwento Ang Ama Panitikang Asiyano
sumadsadjelly121997
 
How payment terms are configured in Odoo 18
How payment terms are configured in Odoo 18
Celine George
 
List View Components in Odoo 18 - Odoo Slides
List View Components in Odoo 18 - Odoo Slides
Celine George
 
Code Profiling in Odoo 18 - Odoo 18 Slides
Code Profiling in Odoo 18 - Odoo 18 Slides
Celine George
 
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
SHERAZ AHMAD LONE
 
NSUMD_M1 Library Orientation_June 11, 2025.pptx
NSUMD_M1 Library Orientation_June 11, 2025.pptx
Julie Sarpy
 
How to Manage Different Customer Addresses in Odoo 18 Accounting
How to Manage Different Customer Addresses in Odoo 18 Accounting
Celine George
 
HistoPathology Ppt. Arshita Gupta for Diploma
HistoPathology Ppt. Arshita Gupta for Diploma
arshitagupta674
 
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
THE PSYCHOANALYTIC OF THE BLACK CAT BY EDGAR ALLAN POE (1).pdf
THE PSYCHOANALYTIC OF THE BLACK CAT BY EDGAR ALLAN POE (1).pdf
nabilahk908
 
LDMMIA Yoga S10 Free Workshop Grad Level
LDMMIA Yoga S10 Free Workshop Grad Level
LDM & Mia eStudios
 
2025 June Year 9 Presentation: Subject selection.pptx
2025 June Year 9 Presentation: Subject selection.pptx
mansk2
 
Q1_TLE 8_Week 1- Day 1 tools and equipment
Q1_TLE 8_Week 1- Day 1 tools and equipment
clairenotado3
 
How to Customize Quotation Layouts in Odoo 18
How to Customize Quotation Layouts in Odoo 18
Celine George
 
Tanja Vujicic - PISA for Schools contact Info
Tanja Vujicic - PISA for Schools contact Info
EduSkills OECD
 
Paper 106 | Ambition and Corruption: A Comparative Analysis of ‘The Great Gat...
Paper 106 | Ambition and Corruption: A Comparative Analysis of ‘The Great Gat...
Rajdeep Bavaliya
 
CRYPTO TRADING COURSE BY FINANCEWORLD.IO
CRYPTO TRADING COURSE BY FINANCEWORLD.IO
AndrewBorisenko3
 
Aprendendo Arquitetura Framework Salesforce - Dia 02
Aprendendo Arquitetura Framework Salesforce - Dia 02
Mauricio Alexandre Silva
 
ENGLISH-5 Q1 Lesson 1.pptx - Story Elements
ENGLISH-5 Q1 Lesson 1.pptx - Story Elements
Mayvel Nadal
 
Filipino 9 Maikling Kwento Ang Ama Panitikang Asiyano
Filipino 9 Maikling Kwento Ang Ama Panitikang Asiyano
sumadsadjelly121997
 
How payment terms are configured in Odoo 18
How payment terms are configured in Odoo 18
Celine George
 
List View Components in Odoo 18 - Odoo Slides
List View Components in Odoo 18 - Odoo Slides
Celine George
 
Code Profiling in Odoo 18 - Odoo 18 Slides
Code Profiling in Odoo 18 - Odoo 18 Slides
Celine George
 
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
SHERAZ AHMAD LONE
 
NSUMD_M1 Library Orientation_June 11, 2025.pptx
NSUMD_M1 Library Orientation_June 11, 2025.pptx
Julie Sarpy
 
How to Manage Different Customer Addresses in Odoo 18 Accounting
How to Manage Different Customer Addresses in Odoo 18 Accounting
Celine George
 
HistoPathology Ppt. Arshita Gupta for Diploma
HistoPathology Ppt. Arshita Gupta for Diploma
arshitagupta674
 
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
THE PSYCHOANALYTIC OF THE BLACK CAT BY EDGAR ALLAN POE (1).pdf
THE PSYCHOANALYTIC OF THE BLACK CAT BY EDGAR ALLAN POE (1).pdf
nabilahk908
 

Pattern Mining To Unknown Word Extraction (10

  • 1. Pattern Mining to Chinese Unknown word Extraction 資工碩三 955202037 楊傑程 2008/10/14
  • 2. Outline Introduction Related Works Unknown Word Detection Unknown Word Extraction Experiments Conclusions
  • 3. Introduction Since the growing popularity of Chinese, Chinese Text Processing has drawn a great amount of Interests in recent years. Before utilizing knowledge of Chinese texts, some preprocessing work should be done, such as Chinese Word Segmentation. There is no blank to mark word boundaries in Chinese texts.
  • 4. Introduction Chinese Word Segmentation encounters two major problems: Ambiguity and Unknown Words. Ambiguity One un-segmented Chinese character string has different segmentations according to different context information. Unknown Words Also known as Out-Of-Vocabulary words (OOV words), mostly unfamiliar proper nouns or new-born words. Ex: the sentence “ 王義氣熱衷於研究生命” would be segmented into “ 王 義氣 熱衷 於 研究 生命” because “ 王義氣” is a uncommon personal name, which is not in vocabularies.
  • 5. Introduction- types of unknown words In this paper, we focus on Chinese unknown word problem. Types of Chinese unknown words Organization names Ex: 華碩電腦 Ex: 總經理、電腦化 Abbreviation Proper Names Ex: 中油、中大 Personal names Ex: 王小明 Derived Words Compounds Ex: 電腦桌、搜尋法 Numeric type compounds Ex: 1986 年、 19 巷
  • 6. Introduction- unknown word identification Chinese Word Segmentation Process: Initial Segmentation (Dictionary assisted) Correctly identified words are called known words. Unknown words are wrongly segmented into two or more parts. Ex: personal name 王小明 after initial segmentation, become 王 小 明 Unknown word identification Characters belong to one unknown word should re-combine together. Ex: re-combine 王 小 明 together as 王小明
  • 7. Introduction- unknown word identification How does unknown word identification work? A character can be a word ( 馬 ) or part of unknown word ( 馬 + 英 + 九 ). Unknown Word Detection Find detection rules to distinguish monosyllabic words from monosyllabic morphemes. Unknown Word Extraction focus on detected morphemes and combine them.
  • 8. Introduction- applied techniques In this paper, we apply continuity pattern mining to discover unknown word detection rules. Then, we apply machine learning based methods- classification algorithms and sequential learning methods to extract unknown words. Utilize syntactic information 、 context information and heuristic statistical information. Our unknown word identification method is a general method not limited on specific types of unknown words
  • 9. Related Works- particular methods So far, research on Chinese word segmentation has lasted for a decade. First, researchers apply different kinds of information to discover different kinds of unknown words (particular). Proper nouns (Chinese personal names 、 transliteration names 、 Organization names) <[Chen & Li, 1996] 、 [Chen & Chen, 2000]> Patterns, Frequency, Context Information
  • 10. Related Works- general methods (Rule-based) Then, researchers start to figure out methods extracting whole kinds of unknown words. Rule-based Detection and Extraction: <[Chen et al., 1998]> Distinguish monosyllabic words and monosyllabic morphemes <[Chen et al., 2002]> Combine Morphological rules with Statistical rules to extract personal names 、 transliteration names and compound nouns. (Precision: 89%, Recall: 68%) <[Ma et al., 2003]> Utilize context free grammar concept and propose a bottom-up merging algorithm Adopt morphological rules and general rules to extract all kinds of unknown words. ( Precision: 76%, Recall: 57%)
  • 11. Related Works- general methods (Machine Learning-based) Sequential Learning: <[T. G. Dietterich, 2002]> Transform sequential learning problem into classification problem Direct method, like HMM 、 CRF <[Goh et. al, 2006]> HMM+SVM, (Precision: 63.8%, Recall: 58.3%) <[Tsai et. al, 2006]> CRF, (Recall: 73%) Indirect method, like Sliding Window 、 Recurrent Sliding Windows
  • 12. Related Works – Imbalanced Data Imbalance Data Problem Ensemble method <C. Li, 2007> Combine learning ability of multiple base classifiers using voting. Cost-sensitive learning and sampling <G. M. Weiss et. al, 2007> Focus more on minority class examples. <C. Drummond et. al, 2003> Under-sampling is more sensitive than over-sampling. <[Seyda et. al, 2007]> Select the most informative instances.
  • 13. Unknown Word Detection & Extraction Our idea is similar to [Chen et al, 2002]: Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction Machine learning based – classification algorithms and sequential learning (indirect). We call: unknown word detection as “Phase 1” unknown word extraction as “Phase 2”.
  • 14. Unknown Word Detection & Extraction Unknown Word Detection (Detection Rule Mining) Judge Judge Unknown Word Extraction (Machine Learning- Classification) 8/10 corpus + detection tags (Initial Segmentation) 8/10 corpus 1/10 corpus (Validation) 1/10 corpus (Initial Segmentation) Classification Decision 1/10 corpus + detection tags training testing Phase 1 Phase 2 Rules 1/10 corpus (Validation) Mining tool (Prowl) Model POS tagging POS tagging
  • 15. Unknown Word Detection Mine detection rules: 8/10 corpus learning Continuity pattern mining Focus on monosyllables.
  • 16. Unknown word detection- Pattern Mining Pattern Mining: Sequential Pattern: “ 因為… , 所以…” Required items match pattern order Allow noise in the middle of required items. Continuity Pattern: “ 打 * 球” => “ 打棒球” : match, “ 打躲避球” : not match Strict definition to each items and order. Efficient pattern mining
  • 17. Unknown word detection- Continuity Pattern Mining Prowl <[Huang et. al, 2004]> Starts with 1-frequent pattern Extend to 2 pattern by two adjacent 1-frequent patterns, then evaluate its frequency. Iteratively extends to longer length of patterns.
  • 18. Encoding Original segmentation label the words based on lexicon matching : known (Y) or unknown (N) “ 葡萄” , in the lexicon => “ 葡萄” labels as known word (Y) “ 葡萄皮” , not in the lexicon => “ 葡萄皮” labels as unknown word (N) Encoding examples: 葡萄 (Na)  葡 (Na) Y + 萄 (Na) Y 葡萄皮 (Na)  葡 (Na) N + 萄 (Na) N+ 皮 (Na) N
  • 19. Create Detection Rules Rule pattern: character, pos, label Max length = 3. character within “{ }” is primary character of rule. Ex: ( { 葡 }, 萄 ): “ 葡” be a known word when “ 葡萄” appears. Rule Accuracy: Ex: ( { 葡 (Na)}, 萄 (Na) ) : =P(#( 葡 (Na) be a known word) | #( 葡 (Na), 萄 (Na) )) ( 葡 (Na), 萄 (Na), ) : 2 ( 葡 (Na) Y, 萄 (Na), ) : 1 ( 葡 (Na) N, 萄 (Na) N, ) : 1 ( 葡 (Na) Y, 萄 (Na) Y, ) : 1 ( 葡 , 萄 , ) : 2 ( 葡 (Na), 萄 , ) : 2 ( 葡 , 萄 (Na), ) : 2
  • 20. Unknown Word Extraction Machine Learning Classification Sequential learning
  • 21. Unknown Word Extraction- feature ( Pos) We use TnT POS tagger to detect part-of-speech (pos) tags of terms. Kinds of pos tags : Nouns (Na, Nb,…) Verbs (VA, VB, VC,…) Adjectives (A…) Punctuations (Comma, Period,…) …
  • 22. Unknown Word Extraction- feature ( term_attribute) After initial segmentation and applying detection rules, each term will have a “ term_attribute ” label itself. Six different “ term_attributes ” are as follows : ms() monosyllabic word , Ex: 你、我、他 ms(?) morphemes of unknown word , Ex: “ 王 ”、“ 小 ”、“ 明 ” on “ 王小明 ” ds() double-syllabic word , Ex: 學校 ps() poly-syllabic word , Ex: 筆記型電腦 dot() punctuation , Ex: “ ,”、 “。”… none() no above information or new term Target of unknown word: at least one ms(?) 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () ps() dot() ds() ds() ms(?) ms(?) ms(?) dot() ds() ms() ms() ms()
  • 23. Data Processing- Sliding Window Sequential Supervised Learning Indirect method: transform sequential learning to classification learning Sliding Window We offer three lengths of SVM models to extract different lengths of unknown words , e.g. n= 2.3.4. Each time we choose n+2 (+prefix & suffix) terms as one window, then we shift one token to right to generate another window, and so on. Window: n+2 terms (n+prefix+suffix) N-gram: n term must exist at least one ms(?) in n terms. t3 t2 t1 prefix t0 3-gram suffix t4
  • 24. EX: 3-gram Model discard negative negative negative positive 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () 運動會 ‧ 四年 甲班 王 (?) ‧ 四年 甲班 王 (?) 姿 (?) 四年 甲班 王 (?) 姿 (?) 分 (?) 甲班 王 (?) 姿 (?) 分 (?) ‧ 王 (?) 姿 (?) 分 (?) ‧ 本校
  • 25. Unknown Word Extraction- feature (Statistical Information) Statistical information: (exemplified by 3-gram Model), Frequency of 3-gram. p( prefix | 3-gram), e.g. p( prefix | t1~t3) p( suffix | 3-gram), e.g. p( suffix | t1~t3) p( first term of n | other n-1 consecutive terms), e.g. p( t1 | t2~t3) p( last term of n | other n-1 preceding terms), e.g. p( t3 | t1~t2) p( pos_freq(prefix) / pos_freq(prefix in training positive)) p( pos_freq(suffix) / pos_freq(suffix in training positive)) t3 t2 t1 prefix t0 3-gram suffix t4
  • 26. Data presentation Format of machine learning usage: Dimension: accumulative term_attribute (6) pos (55) t2 term_attribute (6) pos (55) term_attribute (6) pos (55) prefix t1 …… … statistics (7) term_attribute (6) pos (55) …… suffix
  • 27. Experiments Unknown word detection. Unknown word extraction.
  • 28. Unknown Word Detection 8/10 balanced corpus (460m words) as training data. Use Pattern mining tool: Prowl [Huang et al., 2004] 1/10 balanced corpus as validation data. Use accuracy and frequency as threshold of detection rules. 1/10 balanced corpus as real test data (for phase 2): 60.3% precision and 93.6% recall Threshold (Accuracy) Precision Recall F-measure (our system) F-measure (AS system) 0.7 0.9324 0.4305 0.589035 0.71250 0.8 0.9008 0.5289 0.66648 0.752447 0.9 0.8343 0.7148 0.769941 0.76955 0.95 0.764 0.8288 0.795082 0.76553 0.98 0.686 0.8786 0.770446 0.744036 0.76158 0.9092 0.6552 29 0.77033 0.780085 0.787466 0.795082 F-measure 0.8995 0.8932 0.8819 0.8288 Recall 0.6736 0.6924 0.7113 0.764 Precision 19 11 7 3 Fre>=
  • 29. Unknown Word Extraction 8/10 balanced corpus (460m words) as training data. 1/10 balanced corpus as testing data. Imbalanced data solution: Ensemble method (voting) + under-sampling (random) Use another 1/10 balanced corpus as validation to find sampling ratio: 2-gram: 1:2 (positive: negative) 3-gram: 1:3 4-gram: 1:6
  • 30. Unknown Word Extraction In judging overlap and conflict problem of different combination of unknown words : <[Chen et al., 2002]> frequency (w) * length (w) . Ex: “ 律師 班 奈 特” , => freq( 律師 + 班 )*3 : freq( 班 + 奈 + 特 )*3 Our method: First solve identical N-gram overlap : P (combine | overlap) Ex: “ 單 親 家庭” : P( 單親 | 親 ) : P( 親家庭 | 親 ) Then solve different N-gram conflict : Real frequency freq (X)-freq (Y), if X is included in Y ex: X=“ 醫學”、“學院” , Y=“ 醫學院”
  • 31. Extraction result Comparison: <[Ma et al., 2003]> morphological rules+ statistical rules+ context free grammar rules Precision: 76%, Recall: 57% Our result 0.627 68.2% 58.1% Total 0.614 67.1% 56.7% 2-gram 0.707 80% 63.3% 3-gram 0.426 70.3% 30.6% 4-gram F1-score Recall Precision n-gram
  • 32. Ensemble Method Improvement 0.426 0.703 0.306 0.707 0.8 0.633 0.614 0.671 0.567 Censemble 0.336 0.59 0.238 0.66 0.765 0.583 0.594 0.653 0.544 Caverage 0.412 0.662 0.299 0.669 0.776 0.587 0.587 0.645 0.538 C12 0.335 0.554 0.24 0.667 0.74 0.607 0.593 0.668 0.533 C11 0.344 0.662 0.232 0.655 0.723 0.599 0.596 0.661 0.543 C10 0.321 0.635 0.215 0.645 0.715 0.587 0.598 0.657 0.548 C9 0.309 0.486 0.226 0.676 0.813 0.579 0.6 0.673 0.541 C8 0.325 0.703 0.211 0.648 0.691 0.611 0.604 0.66 0.557 C7 0.333 0.608 0.23 0.641 0.735 0.568 0.582 0.636 0.536 C6 0.299 0.554 0.205 0.644 0.779 0.549 0.603 0.66 0.555 C5 0.42 0.676 0.305 0.667 0.796 0.574 0.598 0.645 0.557 C4 0.28 0.378 0.222 0.664 0.81 0.563 0.58 0.633 0.535 C3 0.338 0.743 0.219 0.7 0.791 0.627 0.61 0.657 0.569 C2 0.315 0.419 0.252 0.649 0.808 0.542 0.572 0.64 0.518 C1 F1-Score Recall Precision F1-Score Recall Precision F1-Score Recall Precision 4-gram 3-gram 2-gram 分類 模型
  • 33. Experiment- One phase What if without unknown word detection? Two phases do work better. 0.627 68.2% 58.1% Two Phases 0.52 71.4% 40.8% One Phase F-score Recall Precision Classification Performance
  • 34. Conclusions We adopt two phases method to solve unknown word problems Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction Machine learning based – classification algorithms and sequential learning (indirect). Imbalanced data solution Our experiment prove two phases do work better than one phase. Future work: Utilize Machine learning on detection. Utilize more information (patterns 、 rules) to improve extraction precision.