SlideShare a Scribd company logo
Language as a Social Sensor
to operate with Knowledge
Marko Grobelnik
Jozef Stefan Institute, Slovenia
Marko.Grobelnik@ijs.si
Dubrovnik, Sep 30th 2016
Reflection on what should be the goal of NLP
• The (mostly) forgotten long term aim of NLP is to understand the text
• …and not so much ‘processing’ itself (as NLP suggests)
• The curse of shallow solutions working well enough for too many
problems, made people (and researchers) happy for too long
• …as much as information retrieval and text mining are useful, they delayed
development of “text understanding”
Language vs. World
• …if we agree with the above statement, then at this point in time, we
have ‘language’, but the ‘world’ is more or less missing
• So – so what a ‘world’ or ‘world model’ could be?
Language is really a social sensor…
• Nature’s physical reality is very complex…
• …but manifests itself in a simple and structured way
• Humans need a mechanism to capture the complexity they need to
survive, evolve and communicate
• …that’s why the language appeared as a necessity
• Consequently, human language is a reflection of the world in which
we live and our perception of it:
• Some of the key properties: Uncertainty, dynamics, compressed information
Nature
Human Human Human
Perception PerceptionPerception
Language Language
Common
Understanding
Nature is complex – but whenever
Nature gets optimized it gets towards
a simple and clear structure
(crystallization as an obvious process
of getting structure)
Human perception is just a simplified reflection
of how Nature shows itself
Language is a means how to communicate
the perception – kind of a sensor for the structures
beneath (since it is optimized, it has
a form of a crystal)
Common understanding of the Nature we
call Knowledge – it still emits clear structures
(clear Knowledge has nice crystal structure)
Crystallization of the Nature, Perception, Language and Knowledge
Positioning language towards knowledge
• Language has a difficult task to encode the Nature’s complexity in an
efficient way for humans…
• …to describe the Nature
• …to express uncertainty, not fully understanding the Nature’s complexity
• …to be efficient when communicate
• …to reflect dynamics of the changing environment
• …to abstract physical reality in an abstract forms, what we call Knowledge
Why we need representing knowledge in a
formal way?
• The key element to operate with knowledge is “Reasoning”
• Since we cannot express all the facts in a formalized way, we need a
mechanism to combine knowledge fragments to derive new
knowledge
• …this is called reasoning
Popular ways to encode and reasoning with
knowledge?
• In the current science we have several ways to express the
knowledge, with an aim to encode the complexity of the world:
• …simple forms of knowledge expressed as a collection of points in high
dimensional spaces
• Efficient, due to linear and other algebras and corresponding tools
• Most popular nowadays – machine learning, statistics, text-mining, statistical NLP are
using mostly these forms
• Reasoning is often straightforward
• …probabilistic structures such as Bayesian networks
• Expressive, but more expensive to encode and still manageable to be used for reasoning
• …various kinds of logic to formulate ontological knowledge
• Very expressive, not always easy to be used for reasoning
CYC KNOWLEDGE BASE
Thing
Universe
isa
isa
Celestial
Body
isa
located in
Planet
subclass
Earth
isa
Animal
isa
Human
subclas
s
Physics
Money
Mathematics
Chemistry
Time
Learning
FoodVehicles
Event
Education
School
Language
LoveEmotions Going for a
walk
Death
Cat
Euro
Working
Words
Driving
RainStabbing someone
Nature
Tree
Hatred
Fear
Physics
Time
Learning
Vehicles
Event
Education
School
Emotions
Going for a
walk
Death
Cat
EuroWords
Driving
Rain
Stabbing someone
Nature
Tree
Hatred
Fear
Planet
Earth
isa
Human
Physics
Money
Mathematics
Chemistry
Time
Learning
FoodVehicles
Event
Education
Languag
e LoveEmotions Going for a
walk
Cat
Euro
Working
Words
Driving Rain
Tree
Hatred
Fear
Learning
Vehicles
Event
Education
School
Emotions
Euro
Driving
Stabbing someone
Hatred
Fear
Structure of a Common Sense Knowledge
(CycKB at http://opencyc.org/)
Model of the world…
• …beyond surface knowledge
• …to interconnect contextualized fragments
Why?
• To make reasoning capable of connecting
isolated fragments of knowledge
• To derive new knowledge beyond
materialized factual knowledge
World model
Top-down KA
Bottom-up KA
Multimodal data
Why we need a
World model?
Simple forms of knowledge
extraction and reasoning
What can be extracted from a document?
• Lexical level
• Tokenization – extracting tokens from a document (words, separators, …)
• Sentence splitting – set of sentences to be further processed
• Linguistic level
• Part-of-Speech – assigning word types (nouns, verbs, adjectives, …)
• Deep Parsing – constructing parse trees from sentences
• Triple extraction – subject-predicate-object triple extraction
• Name entity extraction – identifying names of people, places, organizations
• Semantic level
• Co-reference resolution – replacing pronouns with corresponding names;
merging different surface forms of names into single entity
• Semantic labeling – assigning semantic identifiers to names (e.g.
LOD/DBpedia/Freebase) including disambiguation
• Topic classification – assigning topic categories to a document (e.g. DMoz)
• Summarization – assigning importance to parts of a document
• Fact extraction – extracting relevant facts from a document
Wikipedia as a World model
(http://wikifier.org) [Demo]
Annotation, Disambiguation of general texts into Wikipedia Concepts with a changing vocabulary in 100 language
Global Media as a playground to understand social
dynamics through shallow knowledge extraction
(http://eventregistry.org/) [Demo]
Imported articles: 150M
Identified events: 5M (2014-2016)
News sources: 154,969
Unique concepts: 2,698,213
Categories: 5,015
Event description through entities and Semantic keywords
Collection of events
described through
Entity relatedness
Collection of events
described through
trending concepts
Collection of events
described through
three level categorization
Events identified across languages
Collection of events
described through
a story-line of related events
Linguistic processing on Semantically
augmented texts
• The goal is to use traditional corpus linguistic tools on the top of
semantically enriched texts
• Exmaple: “UN” string -> “United Nations” concept -> “Organization” higher level
concept -> …
• The purpose is to reuse existing tools for many languages to accurately extract micro-
context within the text
• Using SketchEngine (https://www.sketchengine.co.uk/) to preprocess the
NewsFeed.ijs.si documents (100M+ docs)
• Covering the following languages: Arabic, Catalan, Czech, German, English, film,
French, Croatian, Hungarian, Italian, Korean, Dutch, Polish, Russian, Spanish, Serbian
and Swedish
• Login: https://ondra.sketchengine.co.uk/ / username: test / password: preview
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep 2016
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep 2016
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep 2016
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep 2016
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep 2016
Infobox extraction for events:
(structured event representation)
• Structured event representation describes an event
by its “Event Type” and corresponding information
slots to be filled
• Event Types should be taken from “Event Taxonomy”
• …at this stage of development this level of
representation still requires human intervention to
achieve high accuracy (Precision/Recall) extraction
• Example on the right – Wikipedia event infobox:
• 2011 Tōhoku earthquake and tsunami
Deeper means to model and
reason with knowledge
One of the challenges for the future: Micro-reading
• It is “easier” to understand millions of documents than a single document
• …reading and understanding a single document is micro-reading
• The following experiment is on how much knowledge we can extract from
individual documents
• …extraction is in a form of first order inferentially productive Cyc logic
• …allowing us full reasoning to identify new facts
• …minimizing human involvement, optimizing precision and recall
Document Assertions Reasoning Dialogue
Disambiguation with a
world model (CycKB)
World model used as a set of common-sense semantic
constraints to disambiguate text
Cyc Knowledge Base and
Reasoning
Cycorp © 2006
The Cyc Ontology
Thing
Intangible
Thing
Individual
Temporal
Thing
Spatial
Thing
Partially
Tangible
Thing
Paths
Sets
Relations
Logic
Math
Human
Artifacts
Social
Relations,
Culture
Human
Anatomy &
Physiology
Emotion
Perception
Belief
Human
Behavior &
Actions
Products
Devices
Conceptual
Works
Vehicles
Buildings
Weapons
Mechanical
& Electrical
Devices
Software
Literature
Works of Art
Language
Agent
Organizations
Organizational
Actions
Organizational
Plans
Types of
Organizations
Human
Organizations
Nations
Governments
Geo-Politics
Business,
Military
Organizations
Law
Business &
Commerce
Politics
Warfare
Professions
Occupations
Purchasing
Shopping
Travel
Communication
Transportation
& Logistics
Social
Activities
Everyday
Living
Sports
Recreation
Entertainment
Artifacts
Movement
State Change
Dynamics
Materials
Parts
Statics
Physical
Agents
Borders
Geometry
Events
Scripts
Spatial
Paths
Actors
Actions
Plans
Goals
Time
Agents
Space
Physical
Objects
Human
Beings
Organ-
ization
Human
Activities
Living
Things
Social
Behavior
Life
Forms
Animals
Plants
Ecology
Natural
Geography
Earth &
Solar System
Political
Geography
Weather
General Knowledge about Various Domains
Specific data, facts, and observations
Cycorp © 2006
Cyc Reasoning
Modules
Interface to
External Data Sources
CycAPI
Knowledge
EntryTools
User Interface
(with Natural Language Dialog)
Data
Bases
Web
Pages
Text
Sources
Other
KBs
Cyc Ontology & Knowledge
Base
Cyc High-level Architecture
Cycorp © 2006
Thing
Intangible
Thing
Individual
Temporal
Thing
Spatial
Thing
Partially
Tangible
Thing
Paths
Sets
Relations
Logic
Math
Human
Artifacts
Social
Relations,
Culture
Human
Anatomy &
Physiology
Emotion
Perception
Belief
Human
Behavior &
Actions
Products
Devices
Conceptual
Works
Vehicles
Buildings
Weapons
Mechanical
& Electrical
Devices
Software
Literature
Works of Art
Language
Agent
Organizations
Organizational
Actions
Organizational
Plans
Types of
Organizations
Human
Organizations
Nations
Governments
Geo-Politics
Business,
Military
Organizations
Law
Business &
Commerce
Politics
Warfare
Professions
Occupations
Purchasing
Shopping
Travel
Communication
Transportation
& Logistics
Social
Activities
Everyday
Living
Sports
Recreation
Entertainment
Artifacts
Movement
State Change
Dynamics
Materials
Parts
Statics
Physical
Agents
Borders
Geometry
Events
Scripts
Spatial
Paths
Actors
Actions
Plans
Goals
Time
Agents
Space
Physical
Objects
Human
Beings
Organ-
ization
Human
Activities
Living
Things
Social
Behavior
Life
Forms
Animals
Plants
Ecology
Natural
Geography
Earth &
Solar System
Political
Geography
Weather
General Knowledge about Terrorism
Specific data, facts, and observations
about terrorist groups and activities
General Knowledge about Terrorism:
Terrorist groups are capable of directing assassinations:
(implies
(isa ?GROUP TerroristGroup)
(behaviorCapable ?GROUP AssassinatingSomeone directingAgent))
…
If a terrorist group considers an agent an enemy, that agent is vulnerable to an attack by that group:
(implies
(and
(isa ?GROUP TerroristGroup)
(considersAsEnemy ?GROUP ?TARGET))
(vulnerableTo ?GROUP ?TARGET TerroristAttack))
Cyc KB Extended w/Domain Knowledge
Cycorp © 2006
Thing
Intangible
Thing
Individual
Temporal
Thing
Spatial
Thing
Partially
Tangible
Thing
Paths
Sets
Relations
Logic
Math
Human
Artifacts
Social
Relations,
Culture
Human
Anatomy &
Physiology
Emotion
Perception
Belief
Human
Behavior &
Actions
Products
Devices
Conceptual
Works
Vehicles
Buildings
Weapons
Mechanical
& Electrical
Devices
Software
Literature
Works of Art
Language
Agent
Organizations
Organizational
Actions
Organizational
Plans
Types of
Organizations
Human
Organizations
Nations
Governments
Geo-Politics
Business,
Military
Organizations
Law
Business &
Commerce
Politics
Warfare
Professions
Occupations
Purchasing
Shopping
Travel
Communication
Transportation
& Logistics
Social
Activities
Everyday
Living
Sports
Recreation
Entertainment
Artifacts
Movement
State Change
Dynamics
Materials
Parts
Statics
Physical
Agents
Borders
Geometry
Events
Scripts
Spatial
Paths
Actors
Actions
Plans
Goals
Time
Agents
Space
Physical
Objects
Human
Beings
Organ-
ization
Human
Activities
Living
Things
Social
Behavior
Life
Forms
Animals
Plants
Ecology
Natural
Geography
Earth &
Solar System
Political
Geography
Weather
General Knowledge about Terrorism
Specific data, facts, and observations
about terrorist groups and activities
Specific Facts about Al Qaida:
(basedInRegion AlQaida Afghanistan) Al-Qaida is based in Afghanistan.
(hasBeliefSystems AlQaida IslamicFundamentalistBeliefs) Al-Qaida has Islamic fundamentalist beliefs.
(hasLeaders AlQaida OsamaBinLaden) Al-Qaida is led by Osama bin Laden.
…
(affiliatedWith AlQaida AlQudsMosqueOrganization) Al-Qaida is affiliated with the Al Quds Mosque.
(affiliatedWith AlQaida SudaneseIntelligenceService) Al-Qaida is affiliated with the Sudanese Intell Service
…
(sponsors AlQaida HarakatUlAnsar) Al-Qaida sponsors Harakat ul-Ansar.
(sponsors AlQaida LaskarJihad) Al-Qaida sponsors Laskar Jihad.
…
(performedBy EmbassyBombingInNairobi AlQaida) Al-Qaida bombed the Embassy in Nairobi.
(performedBy EmbassyBombingInTanzania AlQaida) Al-Qaida bombed the Embassy in Tanzania.
Cyc KB Extended w/Domain Knowledge
Example of automatic translating
text into Cyc Logic and back to text
Source: “Galileo Galilei was an Italian physicist and astronomer.”
Learn Logic:(#$and (#$isa #$GalileoGalilei #$ItalianPerson)
(#$isa #$GalileoGalilei #$Physicist)
(#$isa #$GalileoGalilei #$Astronomer))
Fact: Galileo was an Italian, a physicist, and an astronomer.
Source: “Galileo was born in Pisa on Feburary 15, 1564.”
Learn Logic:(#$and (#$birthDate #$GalileoGalilei
(#$DayFn 15
(#$MonthFn #$February
(#$YearFn 1564))))
(#$birthPlace #$GalileoGalilei #$CityOfPisaItaly))
Fact: Galileo was born on February 15, 1564 and he was born in Pisa.
Source: “Albert Einstein was born in 1879 in Ulm, Germany.”
Learn Logic: (#$birthDate #$AlbertEinstein (#$YearFn 1879))
Fact: Albert Einstein was born in 1879.
Example of text and extracted Cyc assertions (1/2)
Automatically Extracted Assertions:
• (isa ?V1 ProsecutingEvent)
• (agent ?V1 RudyGiuliani)
• (genls Entity Agent)
• (isa RudyGiuliani Agent)
• (isa RudyGiuliani Entity)
• (isa ?V3 OrganizingEvent)
• (patient ?V3 (IntersectionFn
OrganizedCrime WallStreet))
• (isa (IntersectionFn OrganizedCrime
WallStreet) Patient)
• (genls Entity Patient)
• (isa OrganizedCrime Patient)
• (isa OrganizedCrime Entity)
• (isa WallStreet Patient)
• (isa WallStreet Entity)
Sentence:
He prosecuted a number of high-profile cases, including ones
against organized crime and Wall_Street financiers.
Example of text and extracted Cyc assertions (2/2)
Automatically Extracted Assertions:
• (isa ?V1 SubstitutingEvent)
• (temporal ?V1 Lincoln)
• (genls Entity Agent)
• (isa Lincoln Agent)
• (genls Person Entity)
• (isa Lincoln Entity)
• (isa Lincoln Person)
• (isa ?V3 SucceedingEvent)
• (temporal ?V3 Grant)
• (isa Grant Agent)
• (isa Grant Entity)
• (isa Grant Person)
Sentence:
Each time a general failed, Lincoln substituted another
until finally Grant succeeded in 1865.
Reasoning on extracted assertions (Cyc)
Query:
(and
(isa ?Per Person)
(birthDate ?Per ?BD)
(occursBefore ?BD WorldWarII)
(thereExistsAtLeast 2 ?Role
(lifeRole ?Per ?Role)
(roleInIndustry ?Role FilmIndustry)
)
)
Answers:
Sir Derek_George_Jacobi
Sir Alexander_Korda
Victor Lonzo_Fleming
John_Francis_Junkin
Cornel_Wilde
George_Stevens
Bertrand_Blier
NL Query:
People born before World War II who had at least two roles in the film industry KB?
Text query
Query (semi) automatically
translated in the
First Order Logic
Answers to the query
Cyc’s front-end: “Cyc Analytic Environment” – querying (1/2)
Who has a motive
for the
assassination of
Rafik Hariri?
Query & Answer
Justification
Sources for
Reasoning and
Justification
Cyc’s front-end: “Cyc Analytic Environment” – justification (2/2)
Some of the challenges for the future
• Background knowledge in a form of a World Model
• …to have knowledge contextualized
• Representing and scalable reasoning knowledge with
operational soft logic
• …to decrease brittleness of logic and increase scale
• Economically viable structured knowledge acquisition with
high precision and recall
• …to increase the reach of what we can acquire
• Emphasizing understanding vs. applying black box models

More Related Content

PPTX
Global Media Monitor - Marko Grobelnik
PPTX
From Text To Reasoning - Marko Grobelnik - SWANK Workshop Stanford - 16 Apr 2014
PDF
Lecture: Semantic Word Clouds
PPTX
semantic web & natural language
PPTX
General Introduction for Semantic Web and Linked Open Data
PDF
Information Extraction
PDF
Digital Humanities and “Digital” Social Sciences
PPTX
Semantic engagement
Global Media Monitor - Marko Grobelnik
From Text To Reasoning - Marko Grobelnik - SWANK Workshop Stanford - 16 Apr 2014
Lecture: Semantic Word Clouds
semantic web & natural language
General Introduction for Semantic Web and Linked Open Data
Information Extraction
Digital Humanities and “Digital” Social Sciences
Semantic engagement

What's hot (14)

PPT
Digital Humanities Research
PDF
State of Tools for NLP in Danish: 2018
PPTX
Digital Humanities: An Introduction
PDF
Relation Extraction
PDF
Vuorikari Multilingual Tagging behaviour by teachers
PPTX
Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...
PDF
Semantic engagement handouts
PPTX
Introduction to nlp
PDF
Modern text mining – understanding a million comments in 60 minutes
PPTX
MA in Digital Humanities
PPTX
Information Extraction
PDF
Intro to nlp
PPTX
Zoss High-Level Text Analysis and Techniques
PDF
Best Practices for Large Scale Text Mining Processing
Digital Humanities Research
State of Tools for NLP in Danish: 2018
Digital Humanities: An Introduction
Relation Extraction
Vuorikari Multilingual Tagging behaviour by teachers
Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...
Semantic engagement handouts
Introduction to nlp
Modern text mining – understanding a million comments in 60 minutes
MA in Digital Humanities
Information Extraction
Intro to nlp
Zoss High-Level Text Analysis and Techniques
Best Practices for Large Scale Text Mining Processing
Ad

Viewers also liked (20)

PPT
Shipping Damage-Coated Pipe
PPTX
Entertainmentofthe80s
PPTX
Managerial economics
PDF
CII -TNTDPC DISCOVER INNOVATION - For Automotive & Automobile Industry
PPT
Writing Workshop PPT
PPSX
EdVard munch presentazione
PPTX
3d Views Portfolio
PDF
Olivia 2009
PDF
Aula android 03
PPTX
¿Querer es poder?
PPTX
Presentation Purdue Springer Lecture on Economics & Innovation March 2016
PPT
3 -Day end of the year slide show.
PDF
Tenth India Innovation Summit 2014 - Innovation for Inclusive Growth
PPT
HSC Multimedia
PPTX
Mobile marketing-basics-101
PPTX
Adicción o Libertad. El bienestar emocional y las adicciones
PDF
As caras do entroido ourensan
PPTX
Strategische inzet ict 100610
PDF
Augmented Reality Overview
PPT
Cámbiate Transvulcania
Shipping Damage-Coated Pipe
Entertainmentofthe80s
Managerial economics
CII -TNTDPC DISCOVER INNOVATION - For Automotive & Automobile Industry
Writing Workshop PPT
EdVard munch presentazione
3d Views Portfolio
Olivia 2009
Aula android 03
¿Querer es poder?
Presentation Purdue Springer Lecture on Economics & Innovation March 2016
3 -Day end of the year slide show.
Tenth India Innovation Summit 2014 - Innovation for Inclusive Growth
HSC Multimedia
Mobile marketing-basics-101
Adicción o Libertad. El bienestar emocional y las adicciones
As caras do entroido ourensan
Strategische inzet ict 100610
Augmented Reality Overview
Cámbiate Transvulcania
Ad

Similar to Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep 2016 (20)

PDF
Natural language processing (nlp)
PPT
Introduction
PPTX
67adbec38d786876897979898797946f_ppt.pptx
PDF
artificial intelligence Chapter 6 - NLP.pdf
PDF
NOVA Data Science Meetup 1/19/2017 - Presentation 2
PPTX
Chapter #1 Introduction to NConfigure and administer Server LP.pptx
PPTX
lecture 1 intro NLP_lecture 1 intro NLP.pptx
PPT
NLP introduced and in 47 slides Lecture 1.ppt
PPT
1 Introduction.ppt
PPTX
Natural Language Processing (NLP).pptx
PPT
Spiral of Knowledge by Nitin Desai.ppt
PPTX
AI material for you computer science.pptx
PPTX
Visual literacy
PDF
Natural Language Processing
PPTX
Beyond document retrieval using semantic annotations
PDF
Deep Learning for NLP: An Introduction to Neural Word Embeddings
PDF
Natural language processing module 1 chapter 1
PPTX
Introduction to NLP.pptx
PPTX
Ontology
Natural language processing (nlp)
Introduction
67adbec38d786876897979898797946f_ppt.pptx
artificial intelligence Chapter 6 - NLP.pdf
NOVA Data Science Meetup 1/19/2017 - Presentation 2
Chapter #1 Introduction to NConfigure and administer Server LP.pptx
lecture 1 intro NLP_lecture 1 intro NLP.pptx
NLP introduced and in 47 slides Lecture 1.ppt
1 Introduction.ppt
Natural Language Processing (NLP).pptx
Spiral of Knowledge by Nitin Desai.ppt
AI material for you computer science.pptx
Visual literacy
Natural Language Processing
Beyond document retrieval using semantic annotations
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Natural language processing module 1 chapter 1
Introduction to NLP.pptx
Ontology

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
PDF
Foundation of Data Science unit number two notes
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Data Science Trends & Career Guide---ppt
PPT
Quality review (1)_presentation of this 21
PPTX
batch data Retailer Data management Project.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Computer network topology notes for revision
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Azure Data management Engineer project.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Reliability_Chapter_ presentation 1221.5784
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Foundation of Data Science unit number two notes
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
1_Introduction to advance data techniques.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Clinical guidelines as a resource for EBP(1).pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
Data Science Trends & Career Guide---ppt
Quality review (1)_presentation of this 21
batch data Retailer Data management Project.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Moving the Public Sector (Government) to a Digital Adoption
Business Ppt On Nestle.pptx huunnnhhgfvu
Computer network topology notes for revision
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Azure Data management Engineer project.pptx
Miokarditis (Inflamasi pada Otot Jantung)

Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep 2016

  • 1. Language as a Social Sensor to operate with Knowledge Marko Grobelnik Jozef Stefan Institute, Slovenia [email protected] Dubrovnik, Sep 30th 2016
  • 2. Reflection on what should be the goal of NLP • The (mostly) forgotten long term aim of NLP is to understand the text • …and not so much ‘processing’ itself (as NLP suggests) • The curse of shallow solutions working well enough for too many problems, made people (and researchers) happy for too long • …as much as information retrieval and text mining are useful, they delayed development of “text understanding”
  • 3. Language vs. World • …if we agree with the above statement, then at this point in time, we have ‘language’, but the ‘world’ is more or less missing • So – so what a ‘world’ or ‘world model’ could be?
  • 4. Language is really a social sensor… • Nature’s physical reality is very complex… • …but manifests itself in a simple and structured way • Humans need a mechanism to capture the complexity they need to survive, evolve and communicate • …that’s why the language appeared as a necessity • Consequently, human language is a reflection of the world in which we live and our perception of it: • Some of the key properties: Uncertainty, dynamics, compressed information
  • 5. Nature Human Human Human Perception PerceptionPerception Language Language Common Understanding Nature is complex – but whenever Nature gets optimized it gets towards a simple and clear structure (crystallization as an obvious process of getting structure) Human perception is just a simplified reflection of how Nature shows itself Language is a means how to communicate the perception – kind of a sensor for the structures beneath (since it is optimized, it has a form of a crystal) Common understanding of the Nature we call Knowledge – it still emits clear structures (clear Knowledge has nice crystal structure) Crystallization of the Nature, Perception, Language and Knowledge
  • 6. Positioning language towards knowledge • Language has a difficult task to encode the Nature’s complexity in an efficient way for humans… • …to describe the Nature • …to express uncertainty, not fully understanding the Nature’s complexity • …to be efficient when communicate • …to reflect dynamics of the changing environment • …to abstract physical reality in an abstract forms, what we call Knowledge
  • 7. Why we need representing knowledge in a formal way? • The key element to operate with knowledge is “Reasoning” • Since we cannot express all the facts in a formalized way, we need a mechanism to combine knowledge fragments to derive new knowledge • …this is called reasoning
  • 8. Popular ways to encode and reasoning with knowledge? • In the current science we have several ways to express the knowledge, with an aim to encode the complexity of the world: • …simple forms of knowledge expressed as a collection of points in high dimensional spaces • Efficient, due to linear and other algebras and corresponding tools • Most popular nowadays – machine learning, statistics, text-mining, statistical NLP are using mostly these forms • Reasoning is often straightforward • …probabilistic structures such as Bayesian networks • Expressive, but more expensive to encode and still manageable to be used for reasoning • …various kinds of logic to formulate ontological knowledge • Very expressive, not always easy to be used for reasoning
  • 9. CYC KNOWLEDGE BASE Thing Universe isa isa Celestial Body isa located in Planet subclass Earth isa Animal isa Human subclas s Physics Money Mathematics Chemistry Time Learning FoodVehicles Event Education School Language LoveEmotions Going for a walk Death Cat Euro Working Words Driving RainStabbing someone Nature Tree Hatred Fear Physics Time Learning Vehicles Event Education School Emotions Going for a walk Death Cat EuroWords Driving Rain Stabbing someone Nature Tree Hatred Fear Planet Earth isa Human Physics Money Mathematics Chemistry Time Learning FoodVehicles Event Education Languag e LoveEmotions Going for a walk Cat Euro Working Words Driving Rain Tree Hatred Fear Learning Vehicles Event Education School Emotions Euro Driving Stabbing someone Hatred Fear Structure of a Common Sense Knowledge (CycKB at http://opencyc.org/)
  • 10. Model of the world… • …beyond surface knowledge • …to interconnect contextualized fragments Why? • To make reasoning capable of connecting isolated fragments of knowledge • To derive new knowledge beyond materialized factual knowledge World model Top-down KA Bottom-up KA Multimodal data Why we need a World model?
  • 11. Simple forms of knowledge extraction and reasoning
  • 12. What can be extracted from a document? • Lexical level • Tokenization – extracting tokens from a document (words, separators, …) • Sentence splitting – set of sentences to be further processed • Linguistic level • Part-of-Speech – assigning word types (nouns, verbs, adjectives, …) • Deep Parsing – constructing parse trees from sentences • Triple extraction – subject-predicate-object triple extraction • Name entity extraction – identifying names of people, places, organizations • Semantic level • Co-reference resolution – replacing pronouns with corresponding names; merging different surface forms of names into single entity • Semantic labeling – assigning semantic identifiers to names (e.g. LOD/DBpedia/Freebase) including disambiguation • Topic classification – assigning topic categories to a document (e.g. DMoz) • Summarization – assigning importance to parts of a document • Fact extraction – extracting relevant facts from a document
  • 13. Wikipedia as a World model (http://wikifier.org) [Demo] Annotation, Disambiguation of general texts into Wikipedia Concepts with a changing vocabulary in 100 language
  • 14. Global Media as a playground to understand social dynamics through shallow knowledge extraction (http://eventregistry.org/) [Demo] Imported articles: 150M Identified events: 5M (2014-2016) News sources: 154,969 Unique concepts: 2,698,213 Categories: 5,015
  • 15. Event description through entities and Semantic keywords
  • 16. Collection of events described through Entity relatedness
  • 17. Collection of events described through trending concepts
  • 18. Collection of events described through three level categorization
  • 20. Collection of events described through a story-line of related events
  • 21. Linguistic processing on Semantically augmented texts • The goal is to use traditional corpus linguistic tools on the top of semantically enriched texts • Exmaple: “UN” string -> “United Nations” concept -> “Organization” higher level concept -> … • The purpose is to reuse existing tools for many languages to accurately extract micro- context within the text • Using SketchEngine (https://www.sketchengine.co.uk/) to preprocess the NewsFeed.ijs.si documents (100M+ docs) • Covering the following languages: Arabic, Catalan, Czech, German, English, film, French, Croatian, Hungarian, Italian, Korean, Dutch, Polish, Russian, Spanish, Serbian and Swedish • Login: https://ondra.sketchengine.co.uk/ / username: test / password: preview
  • 27. Infobox extraction for events: (structured event representation) • Structured event representation describes an event by its “Event Type” and corresponding information slots to be filled • Event Types should be taken from “Event Taxonomy” • …at this stage of development this level of representation still requires human intervention to achieve high accuracy (Precision/Recall) extraction • Example on the right – Wikipedia event infobox: • 2011 Tōhoku earthquake and tsunami
  • 28. Deeper means to model and reason with knowledge
  • 29. One of the challenges for the future: Micro-reading • It is “easier” to understand millions of documents than a single document • …reading and understanding a single document is micro-reading • The following experiment is on how much knowledge we can extract from individual documents • …extraction is in a form of first order inferentially productive Cyc logic • …allowing us full reasoning to identify new facts • …minimizing human involvement, optimizing precision and recall Document Assertions Reasoning Dialogue
  • 30. Disambiguation with a world model (CycKB) World model used as a set of common-sense semantic constraints to disambiguate text
  • 31. Cyc Knowledge Base and Reasoning
  • 32. Cycorp © 2006 The Cyc Ontology Thing Intangible Thing Individual Temporal Thing Spatial Thing Partially Tangible Thing Paths Sets Relations Logic Math Human Artifacts Social Relations, Culture Human Anatomy & Physiology Emotion Perception Belief Human Behavior & Actions Products Devices Conceptual Works Vehicles Buildings Weapons Mechanical & Electrical Devices Software Literature Works of Art Language Agent Organizations Organizational Actions Organizational Plans Types of Organizations Human Organizations Nations Governments Geo-Politics Business, Military Organizations Law Business & Commerce Politics Warfare Professions Occupations Purchasing Shopping Travel Communication Transportation & Logistics Social Activities Everyday Living Sports Recreation Entertainment Artifacts Movement State Change Dynamics Materials Parts Statics Physical Agents Borders Geometry Events Scripts Spatial Paths Actors Actions Plans Goals Time Agents Space Physical Objects Human Beings Organ- ization Human Activities Living Things Social Behavior Life Forms Animals Plants Ecology Natural Geography Earth & Solar System Political Geography Weather General Knowledge about Various Domains Specific data, facts, and observations
  • 33. Cycorp © 2006 Cyc Reasoning Modules Interface to External Data Sources CycAPI Knowledge EntryTools User Interface (with Natural Language Dialog) Data Bases Web Pages Text Sources Other KBs Cyc Ontology & Knowledge Base Cyc High-level Architecture
  • 34. Cycorp © 2006 Thing Intangible Thing Individual Temporal Thing Spatial Thing Partially Tangible Thing Paths Sets Relations Logic Math Human Artifacts Social Relations, Culture Human Anatomy & Physiology Emotion Perception Belief Human Behavior & Actions Products Devices Conceptual Works Vehicles Buildings Weapons Mechanical & Electrical Devices Software Literature Works of Art Language Agent Organizations Organizational Actions Organizational Plans Types of Organizations Human Organizations Nations Governments Geo-Politics Business, Military Organizations Law Business & Commerce Politics Warfare Professions Occupations Purchasing Shopping Travel Communication Transportation & Logistics Social Activities Everyday Living Sports Recreation Entertainment Artifacts Movement State Change Dynamics Materials Parts Statics Physical Agents Borders Geometry Events Scripts Spatial Paths Actors Actions Plans Goals Time Agents Space Physical Objects Human Beings Organ- ization Human Activities Living Things Social Behavior Life Forms Animals Plants Ecology Natural Geography Earth & Solar System Political Geography Weather General Knowledge about Terrorism Specific data, facts, and observations about terrorist groups and activities General Knowledge about Terrorism: Terrorist groups are capable of directing assassinations: (implies (isa ?GROUP TerroristGroup) (behaviorCapable ?GROUP AssassinatingSomeone directingAgent)) … If a terrorist group considers an agent an enemy, that agent is vulnerable to an attack by that group: (implies (and (isa ?GROUP TerroristGroup) (considersAsEnemy ?GROUP ?TARGET)) (vulnerableTo ?GROUP ?TARGET TerroristAttack)) Cyc KB Extended w/Domain Knowledge
  • 35. Cycorp © 2006 Thing Intangible Thing Individual Temporal Thing Spatial Thing Partially Tangible Thing Paths Sets Relations Logic Math Human Artifacts Social Relations, Culture Human Anatomy & Physiology Emotion Perception Belief Human Behavior & Actions Products Devices Conceptual Works Vehicles Buildings Weapons Mechanical & Electrical Devices Software Literature Works of Art Language Agent Organizations Organizational Actions Organizational Plans Types of Organizations Human Organizations Nations Governments Geo-Politics Business, Military Organizations Law Business & Commerce Politics Warfare Professions Occupations Purchasing Shopping Travel Communication Transportation & Logistics Social Activities Everyday Living Sports Recreation Entertainment Artifacts Movement State Change Dynamics Materials Parts Statics Physical Agents Borders Geometry Events Scripts Spatial Paths Actors Actions Plans Goals Time Agents Space Physical Objects Human Beings Organ- ization Human Activities Living Things Social Behavior Life Forms Animals Plants Ecology Natural Geography Earth & Solar System Political Geography Weather General Knowledge about Terrorism Specific data, facts, and observations about terrorist groups and activities Specific Facts about Al Qaida: (basedInRegion AlQaida Afghanistan) Al-Qaida is based in Afghanistan. (hasBeliefSystems AlQaida IslamicFundamentalistBeliefs) Al-Qaida has Islamic fundamentalist beliefs. (hasLeaders AlQaida OsamaBinLaden) Al-Qaida is led by Osama bin Laden. … (affiliatedWith AlQaida AlQudsMosqueOrganization) Al-Qaida is affiliated with the Al Quds Mosque. (affiliatedWith AlQaida SudaneseIntelligenceService) Al-Qaida is affiliated with the Sudanese Intell Service … (sponsors AlQaida HarakatUlAnsar) Al-Qaida sponsors Harakat ul-Ansar. (sponsors AlQaida LaskarJihad) Al-Qaida sponsors Laskar Jihad. … (performedBy EmbassyBombingInNairobi AlQaida) Al-Qaida bombed the Embassy in Nairobi. (performedBy EmbassyBombingInTanzania AlQaida) Al-Qaida bombed the Embassy in Tanzania. Cyc KB Extended w/Domain Knowledge
  • 36. Example of automatic translating text into Cyc Logic and back to text Source: “Galileo Galilei was an Italian physicist and astronomer.” Learn Logic:(#$and (#$isa #$GalileoGalilei #$ItalianPerson) (#$isa #$GalileoGalilei #$Physicist) (#$isa #$GalileoGalilei #$Astronomer)) Fact: Galileo was an Italian, a physicist, and an astronomer. Source: “Galileo was born in Pisa on Feburary 15, 1564.” Learn Logic:(#$and (#$birthDate #$GalileoGalilei (#$DayFn 15 (#$MonthFn #$February (#$YearFn 1564)))) (#$birthPlace #$GalileoGalilei #$CityOfPisaItaly)) Fact: Galileo was born on February 15, 1564 and he was born in Pisa. Source: “Albert Einstein was born in 1879 in Ulm, Germany.” Learn Logic: (#$birthDate #$AlbertEinstein (#$YearFn 1879)) Fact: Albert Einstein was born in 1879.
  • 37. Example of text and extracted Cyc assertions (1/2) Automatically Extracted Assertions: • (isa ?V1 ProsecutingEvent) • (agent ?V1 RudyGiuliani) • (genls Entity Agent) • (isa RudyGiuliani Agent) • (isa RudyGiuliani Entity) • (isa ?V3 OrganizingEvent) • (patient ?V3 (IntersectionFn OrganizedCrime WallStreet)) • (isa (IntersectionFn OrganizedCrime WallStreet) Patient) • (genls Entity Patient) • (isa OrganizedCrime Patient) • (isa OrganizedCrime Entity) • (isa WallStreet Patient) • (isa WallStreet Entity) Sentence: He prosecuted a number of high-profile cases, including ones against organized crime and Wall_Street financiers.
  • 38. Example of text and extracted Cyc assertions (2/2) Automatically Extracted Assertions: • (isa ?V1 SubstitutingEvent) • (temporal ?V1 Lincoln) • (genls Entity Agent) • (isa Lincoln Agent) • (genls Person Entity) • (isa Lincoln Entity) • (isa Lincoln Person) • (isa ?V3 SucceedingEvent) • (temporal ?V3 Grant) • (isa Grant Agent) • (isa Grant Entity) • (isa Grant Person) Sentence: Each time a general failed, Lincoln substituted another until finally Grant succeeded in 1865.
  • 39. Reasoning on extracted assertions (Cyc) Query: (and (isa ?Per Person) (birthDate ?Per ?BD) (occursBefore ?BD WorldWarII) (thereExistsAtLeast 2 ?Role (lifeRole ?Per ?Role) (roleInIndustry ?Role FilmIndustry) ) ) Answers: Sir Derek_George_Jacobi Sir Alexander_Korda Victor Lonzo_Fleming John_Francis_Junkin Cornel_Wilde George_Stevens Bertrand_Blier NL Query: People born before World War II who had at least two roles in the film industry KB?
  • 40. Text query Query (semi) automatically translated in the First Order Logic Answers to the query Cyc’s front-end: “Cyc Analytic Environment” – querying (1/2) Who has a motive for the assassination of Rafik Hariri?
  • 41. Query & Answer Justification Sources for Reasoning and Justification Cyc’s front-end: “Cyc Analytic Environment” – justification (2/2)
  • 42. Some of the challenges for the future • Background knowledge in a form of a World Model • …to have knowledge contextualized • Representing and scalable reasoning knowledge with operational soft logic • …to decrease brittleness of logic and increase scale • Economically viable structured knowledge acquisition with high precision and recall • …to increase the reach of what we can acquire • Emphasizing understanding vs. applying black box models