SlideShare a Scribd company logo
Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding
IR and IE IR (Information Retrieval) Retrieves relevant documents from collections Information theory, probabilistic theory, and statistics IE (Information Extraction) Extracts relevant information from documents Machine learning, computational linguistics, and natural language processing
History of IE Large amount of both online and offline textual data. Message Understanding Conference (MUC) Quantitative evaluation of IE systems Tasks Latin American terrorism Joint ventures Microelectronics Company management changes
Evaluation Metrics Precision Recall F-measure
Web Documents Unstructured (Free) Text  Regular sentences and paragraphs Linguistic techniques, e.g., NLP Structured Text Itemized information Uniform syntactic clues, e.g., table understanding Semistructured Text   Ungrammatical, telegraphic (e.g., missing attributes, multi-value attributes, …) Specialized programs, e.g., wrappers
Approaches to IE Knowledge Engineering Grammars are constructed by hand Domain patterns are discovered by human experts through introspection and inspection of a corpus Much laborious tuning and “hill climbing” Machine Learning Use statistical methods when possible Learn rules from annotated corpora Learn rules from interaction with user
Knowledge Engineering Advantages With skills and experience, good performing systems are not conceptually hard to develop. The best performing systems have been hand crafted. Disadvantages Very laborious development process Some changes to specifications can be hard to accommodate Required expertise may not be available
Machine Learning  Advantages Domain portability is relatively straightforward System expertise is not required for customization “ Data driven” rule acquisition ensures full coverage of examples  Disadvantages Training data may not exist, and may be very expensive to acquire Large volume of training data may be required Changes to specifications may require reannotation of large quantities of training data
Wrapper A specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or relational tables) Challenge: recognizing the data of interest among many other uninterested pieces of text Tasks Source understanding Data processing
Free Text AutoSlog Liep Palka Hasten Crystal WebFoot WHISK
AutoSlog [1993] The Parliament building  was bombed by Carlos.
LIEP [1995] The Parliament building  was bombed by  Carlos .
PALKA [1995] The Parliament building  was bombed by  Carlos .
HASTEN [1995] The Parliament building  was bombed by  Carlos . Egraphs ( SemanticLabel, StructuralElement )
CRYSTAL [1995] The Parliament building  was bombed by  Carlos .
CRYSTAL + Webfoot [1997]
WHISK [1999] The Parliament building  was bombed by  Carlos. WHISK Rule: *( PhyObj )*@passive *F ‘bombed’ * {PP ‘by’ *F ( Person )} Context-based patterns
Web Documents Semistructured and Unstructured RAPIER (E. Califf, 1997) SRV (D. Freitag, 1998) WHISK (S. Soderland, 1998) Semistructured and Structured WIEN (N. Kushmerick, 1997) SoftMealy (C-H. Hsu, 1998) STALKER (I. Muslea, S. Minton, C. Knoblock, 1998)
Inductive Learning Task Inductive Inference Learning Systems Zero-order First-order, e.g., Inductive Logic Programming (ILP)
RAPIER [1997] Inductive Logic Programming Extraction Rules Syntactic information Semantic information Advantage Efficient learning (bottom-up) Drawback Single-slot extraction
RAPIER Rule
SRV [1998] Relational Algorithm (top-down) Features  Simple features (e.g., length, character type, …) Relational features (e.g., next-token, …) Advantages Expressive rule representation Drawbacks Single-slot rule generation Large-volume of training data
SRV Rule
WHISK [1998] Covering Algorithm (top-down) Advantages Learn multi-slot extraction rules Handle various order of items-to-be-extracted Handle document types from free text to structured text Drawbacks Must see all the permutations of items Less expressive feature set Need large volume of training data
WHISK Rule
WIEN [1997] Assumes Items are always in fixed, known order Introduces several types of wrappers Advantages Fast to learn and extract Drawbacks Can not handle permutations and missing items Must label entire pages Does not use semantic classes
WIEN Rule
SoftMealy [1998] Learns a transducer Advantages Learns order of items Allows item permutations and missing items Allows both the use of semantic classes and disjunctions Drawbacks Must see all possible permutations Can not use delimiters that do not immediately precede and follow the relevant items
SoftMealy Rule
STALKER [1998,1999,2001] Hierarchical Information Extraction Embedded Catalog Tree (ECT) Formalism Advantages Extracts nested data Allows item permutations and missing items Need not see all of the permutations One hard-to-extract item does not affect others Drawbacks Does not exploit item order
STALKER Rule
Web IE Tools  (main technique used) Wrapper languages  (TSIMMIS, Web-OQL)   HTML-aware  (X4F, XWRAP, RoadRunner, Lixto)   NLP-based  (RAPIER, SRV, WHISK)   Inductive learning  (WIEN, SoftMealy, Stalker)   Modeling-based  (NoDoSE, DEByE)   Ontology-based  (BYU ontology)
Degree of Automation Trade-off: page lay-out dependent RoadRunner Assume target pages were automatically generated from some data sources The only fully automatic wrapper generator BYU ontology Manually created with graphical editing tool Extraction process fully automatic
Support of Complex Objects Complex objects: nested objects, graphs, trees, complex tables, … Earlier tools do not support extracting from complex objects, like RAPIER, SRV, WHISK, and WIEN. BYU ontology Support
Page Contents Semistructured data (table type, richly tagged) Semistructured text (text type, rarely tagged) NLP-based tools: text type only Other tools (except ontology-based): table type only BYU ontology: both types
Ease of Use HTML-aware tools, easiest to use Wrapper languages, hardest to use Other tools, in the middle
Output XML is the best output format for data sharing on the Web.
Support for Non-HTML Sources NLP-based and ontology-based, automatically support Other tools, may support but need additional helper like syntactical and semantic analyzer BYU ontology support
Resilience and Adaptiveness Resilience: continuing to work properly in the occurrence of changes in the target pages Adaptiveness: working properly with pages from some other sources but in the same application domain Only BYU ontology has both the features.
Summary of Qualitative Analysis
Graphical Perspective of Qualitative Analysis
X means the information extraction system  has the capability; X* means the information extraction system  has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability. Nested_ data Free Resilient  Permuta_tions Missing items Multi-slot Single-slot Semi Struc_ ture Name X X X ? X X ? ? X X X X ROAD_ RUNNER X X X AutoSlog X X X X X X X BYU Onto ? X* X X X X X WHISK ? X X X X X SRV ? X X X X X RAPIER X X * X X X STALKER X* X X X X X SoftMealy X X X WIEN
Problem of IE  (unstructured documents) Meaning Knowledge Information Data Source Target Information Extraction
Problem of IE  (structured documents) Meaning Knowledge Information Data Source Target Information Extraction
Problem of IE  (semistructured documents) Meaning Knowledge Information Data Source Target Information Extraction
Solution of IE  (the Semantic Web) Meaning Knowledge Information Data Source Target Information Extraction

More Related Content

PDF
Semantic Web - Ontology 101
PPTX
Techniques of information retrieval
PPT
215 oodb
PDF
StaTIX - Statistical Type Inference on Linked Data
PPT
OODB
PPTX
RDF Graph Data Management in Oracle Database and NoSQL Platforms
PPTX
Deriving an Emergent Relational Schema from RDF Data
PPT
Semantic Web - Ontology 101
Techniques of information retrieval
215 oodb
StaTIX - Statistical Type Inference on Linked Data
OODB
RDF Graph Data Management in Oracle Database and NoSQL Platforms
Deriving an Emergent Relational Schema from RDF Data

What's hot (12)

PDF
Semantic Technologies in ST&DL
PDF
Knowledge Patterns for the Web: extraction, transformation, and reuse
PPTX
Exploring Content with Wikipedia
PDF
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
PDF
Managing RDF data with graph databases
PPT
Hub102 - Lesson4 - Data Structure
PPTX
Data Dictionary
PPT
A Framework for Ontology Usage Analysis
PDF
Ontologies and semantic web
PPT
Week12
PPTX
AnIML: A New Analytical Data Standard
Semantic Technologies in ST&DL
Knowledge Patterns for the Web: extraction, transformation, and reuse
Exploring Content with Wikipedia
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
Managing RDF data with graph databases
Hub102 - Lesson4 - Data Structure
Data Dictionary
A Framework for Ontology Usage Analysis
Ontologies and semantic web
Week12
AnIML: A New Analytical Data Standard
Ad

Similar to osm.cs.byu.edu (20)

PPT
PhD Presentation
PPTX
Search Me: Using Lucene.Net
PPT
DB and IR Integration
PPT
DB-IR-ranking
PPT
Integrating a Domain Ontology Development Environment and an Ontology Search ...
PDF
Introduction to libre « fulltext » technology
PPT
Structured Dynamics' Semantic Technologies Product Stack
PPTX
Semantic Web, Ontology, and Ontology Learning: Introduction
PDF
21 domino mohan-1
PPT
Apache Tika: 1 point Oh!
PPT
XML In The Real World - Use Cases For Oracle XMLDB
PDF
Crawling the Web for Structured Documents
PPT
Aggregation for searching complex information spaces
PPTX
Web Information Systems Introduction and Origin of World Wide Web
PPT
03 Object Dbms Technology
PPT
slis-asist
PPT
slis-asist
PPS
Semantic Web in Action: Ontology-driven information search, integration and a...
PDF
A Logic-Based Approach To Semantic Information Extraction
PDF
IR with lucene
PhD Presentation
Search Me: Using Lucene.Net
DB and IR Integration
DB-IR-ranking
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Introduction to libre « fulltext » technology
Structured Dynamics' Semantic Technologies Product Stack
Semantic Web, Ontology, and Ontology Learning: Introduction
21 domino mohan-1
Apache Tika: 1 point Oh!
XML In The Real World - Use Cases For Oracle XMLDB
Crawling the Web for Structured Documents
Aggregation for searching complex information spaces
Web Information Systems Introduction and Origin of World Wide Web
03 Object Dbms Technology
slis-asist
slis-asist
Semantic Web in Action: Ontology-driven information search, integration and a...
A Logic-Based Approach To Semantic Information Extraction
IR with lucene
Ad

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

osm.cs.byu.edu

  • 1. Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding
  • 2. IR and IE IR (Information Retrieval) Retrieves relevant documents from collections Information theory, probabilistic theory, and statistics IE (Information Extraction) Extracts relevant information from documents Machine learning, computational linguistics, and natural language processing
  • 3. History of IE Large amount of both online and offline textual data. Message Understanding Conference (MUC) Quantitative evaluation of IE systems Tasks Latin American terrorism Joint ventures Microelectronics Company management changes
  • 4. Evaluation Metrics Precision Recall F-measure
  • 5. Web Documents Unstructured (Free) Text Regular sentences and paragraphs Linguistic techniques, e.g., NLP Structured Text Itemized information Uniform syntactic clues, e.g., table understanding Semistructured Text Ungrammatical, telegraphic (e.g., missing attributes, multi-value attributes, …) Specialized programs, e.g., wrappers
  • 6. Approaches to IE Knowledge Engineering Grammars are constructed by hand Domain patterns are discovered by human experts through introspection and inspection of a corpus Much laborious tuning and “hill climbing” Machine Learning Use statistical methods when possible Learn rules from annotated corpora Learn rules from interaction with user
  • 7. Knowledge Engineering Advantages With skills and experience, good performing systems are not conceptually hard to develop. The best performing systems have been hand crafted. Disadvantages Very laborious development process Some changes to specifications can be hard to accommodate Required expertise may not be available
  • 8. Machine Learning Advantages Domain portability is relatively straightforward System expertise is not required for customization “ Data driven” rule acquisition ensures full coverage of examples Disadvantages Training data may not exist, and may be very expensive to acquire Large volume of training data may be required Changes to specifications may require reannotation of large quantities of training data
  • 9. Wrapper A specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or relational tables) Challenge: recognizing the data of interest among many other uninterested pieces of text Tasks Source understanding Data processing
  • 10. Free Text AutoSlog Liep Palka Hasten Crystal WebFoot WHISK
  • 11. AutoSlog [1993] The Parliament building was bombed by Carlos.
  • 12. LIEP [1995] The Parliament building was bombed by Carlos .
  • 13. PALKA [1995] The Parliament building was bombed by Carlos .
  • 14. HASTEN [1995] The Parliament building was bombed by Carlos . Egraphs ( SemanticLabel, StructuralElement )
  • 15. CRYSTAL [1995] The Parliament building was bombed by Carlos .
  • 17. WHISK [1999] The Parliament building was bombed by Carlos. WHISK Rule: *( PhyObj )*@passive *F ‘bombed’ * {PP ‘by’ *F ( Person )} Context-based patterns
  • 18. Web Documents Semistructured and Unstructured RAPIER (E. Califf, 1997) SRV (D. Freitag, 1998) WHISK (S. Soderland, 1998) Semistructured and Structured WIEN (N. Kushmerick, 1997) SoftMealy (C-H. Hsu, 1998) STALKER (I. Muslea, S. Minton, C. Knoblock, 1998)
  • 19. Inductive Learning Task Inductive Inference Learning Systems Zero-order First-order, e.g., Inductive Logic Programming (ILP)
  • 20. RAPIER [1997] Inductive Logic Programming Extraction Rules Syntactic information Semantic information Advantage Efficient learning (bottom-up) Drawback Single-slot extraction
  • 22. SRV [1998] Relational Algorithm (top-down) Features Simple features (e.g., length, character type, …) Relational features (e.g., next-token, …) Advantages Expressive rule representation Drawbacks Single-slot rule generation Large-volume of training data
  • 24. WHISK [1998] Covering Algorithm (top-down) Advantages Learn multi-slot extraction rules Handle various order of items-to-be-extracted Handle document types from free text to structured text Drawbacks Must see all the permutations of items Less expressive feature set Need large volume of training data
  • 26. WIEN [1997] Assumes Items are always in fixed, known order Introduces several types of wrappers Advantages Fast to learn and extract Drawbacks Can not handle permutations and missing items Must label entire pages Does not use semantic classes
  • 28. SoftMealy [1998] Learns a transducer Advantages Learns order of items Allows item permutations and missing items Allows both the use of semantic classes and disjunctions Drawbacks Must see all possible permutations Can not use delimiters that do not immediately precede and follow the relevant items
  • 30. STALKER [1998,1999,2001] Hierarchical Information Extraction Embedded Catalog Tree (ECT) Formalism Advantages Extracts nested data Allows item permutations and missing items Need not see all of the permutations One hard-to-extract item does not affect others Drawbacks Does not exploit item order
  • 32. Web IE Tools (main technique used) Wrapper languages (TSIMMIS, Web-OQL) HTML-aware (X4F, XWRAP, RoadRunner, Lixto) NLP-based (RAPIER, SRV, WHISK) Inductive learning (WIEN, SoftMealy, Stalker) Modeling-based (NoDoSE, DEByE) Ontology-based (BYU ontology)
  • 33. Degree of Automation Trade-off: page lay-out dependent RoadRunner Assume target pages were automatically generated from some data sources The only fully automatic wrapper generator BYU ontology Manually created with graphical editing tool Extraction process fully automatic
  • 34. Support of Complex Objects Complex objects: nested objects, graphs, trees, complex tables, … Earlier tools do not support extracting from complex objects, like RAPIER, SRV, WHISK, and WIEN. BYU ontology Support
  • 35. Page Contents Semistructured data (table type, richly tagged) Semistructured text (text type, rarely tagged) NLP-based tools: text type only Other tools (except ontology-based): table type only BYU ontology: both types
  • 36. Ease of Use HTML-aware tools, easiest to use Wrapper languages, hardest to use Other tools, in the middle
  • 37. Output XML is the best output format for data sharing on the Web.
  • 38. Support for Non-HTML Sources NLP-based and ontology-based, automatically support Other tools, may support but need additional helper like syntactical and semantic analyzer BYU ontology support
  • 39. Resilience and Adaptiveness Resilience: continuing to work properly in the occurrence of changes in the target pages Adaptiveness: working properly with pages from some other sources but in the same application domain Only BYU ontology has both the features.
  • 41. Graphical Perspective of Qualitative Analysis
  • 42. X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability. Nested_ data Free Resilient Permuta_tions Missing items Multi-slot Single-slot Semi Struc_ ture Name X X X ? X X ? ? X X X X ROAD_ RUNNER X X X AutoSlog X X X X X X X BYU Onto ? X* X X X X X WHISK ? X X X X X SRV ? X X X X X RAPIER X X * X X X STALKER X* X X X X X SoftMealy X X X WIEN
  • 43. Problem of IE (unstructured documents) Meaning Knowledge Information Data Source Target Information Extraction
  • 44. Problem of IE (structured documents) Meaning Knowledge Information Data Source Target Information Extraction
  • 45. Problem of IE (semistructured documents) Meaning Knowledge Information Data Source Target Information Extraction
  • 46. Solution of IE (the Semantic Web) Meaning Knowledge Information Data Source Target Information Extraction