SlideShare a Scribd company logo
Analysis of ‘Unstructured’ Data Seth Grimes Alta Plana Corporation 301-270-0795 --  http://altaplana.com ASA Chicago Chapter Proliferation of Digital Information and Recent Uses in Statistical Applications  May 15, 2009
Introduction Seth Grimes – Principal Consultant with Alta Plana Corporation. Contributing Editor,  IntelligentEnterprise.com . Channel Expert,  B-Eye-Network.com . Founding Chair, Text Analytics Summit,  textanalyticsnews.com . ASA member since ???. Board of Directors,  the Council of Professional Associations on Federal Statistics (COPAFS).
Perspectives Assumption #1: You’re a statistician, business analyst, or other “end user.”  You have lots of text, and you want an automated way to deal with it. Assumption #2: Enriching analysis. You’re interested in enriching an existing business intelligence (BI) / data-mining / analytical initiative to encompass information from textual sources. Caveat: I’m going to look exclusively at text.
Context What’s statistics?  My definition(s): Summary characterization of a data set. Identification of the mathematical model, accommodating randomness and uncertainty, that best suits a data set. The application of a probabilistic data model for predictive purposes. Text analytics applies stats to text in +/- all 3 senses.  But what’s data?
Are these data sets? “ Unstructured” data
Are these data sets? “ Unstructured” data
www.stanford.edu/%7ernusse/wntwindow.html Axin and Frat1 interact with dvl and GSK, bridging Dvl to GSK in Wnt-mediated regulation of LEF-1. Wnt proteins transduce their signals through dishevelled (Dvl) proteins to inhibit glycogen synthase kinase 3beta (GSK), leading to the accumulation of cytosolic beta-catenin and activation of TCF/LEF-1 transcription factors. To understand the mechanism by which Dvl acts through GSK to regulate LEF-1, we investigated the roles of Axin and Frat1 in Wnt-mediated activation of LEF-1 in mammalian cells. We found that Dvl interacts with Axin and with Frat1, both of which interact with GSK. Similarly, the Frat1 homolog GBP binds Xenopus Dishevelled in an interaction that requires GSK. We also found that Dvl, Axin and GSK can form a ternary complex bridged by Axin, and that Frat1 can be recruited into this complex probably by Dvl. The observation that the Dvl-binding domain of either Frat1 or Axin was able to inhibit Wnt-1-induced LEF-1 activation suggests that the interactions between Dvl and Axin and between Dvl and Frat may be important for this signaling pathway. Furthermore, Wnt-1 appeared to promote the disintegration of the Frat1-Dvl-GSK-Axin complex, resulting in the dissociation of GSK from Axin. Thus, formation of the quaternary complex may be an important step in Wnt signaling, by which Dvl recruits Frat1, leading to Frat1-mediated dissociation of GSK from Axin. www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&cmd=Retrieve&list_uids=10428961&dopt=Abstract More “unstructured” data
 
 
Text descriptive statistics Search Engine Optimization and Web analytics are two sides of a  findability  coin. SEO relies on text descriptive statistics. Web analytics looks at site visits, at transactional and behavioural information.  Put this aside for now. Text descriptive statistics can give us an idea of the “whatness” of a document.
New York Times , September 8, 1957
“ Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the auto-abstract.” H.P. Luhn,  The Automatic Creation of Literature Abstracts ,  IBM Journal , 1958.
Text-BI:  Back to the Future Side note:  What is business intelligence (BI)?  A 1958 definition: In this paper,  business is a collection of activities carried on for whatever purpose , be it science, technology, commerce, industry, law, government, defense, et cetera...  The notion of intelligence  is also defined here... as  “the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.” Hans Peter Luhn,  A Business Intelligence System ,  IBM Journal , October 1958
Document input and processing Knowledge handling is key
 
Text modelling The text content of a document can be considered an unordered “bag of words.” Particular documents are points in a high-dimensional vector space. Salton, Wong & Yang, “A Vector Space Model for Automatic Indexing,” November 1975.
Text modelling We might construct a  document-term matrix ... D1 = "I like databases" D2 = "I hate hate databases" and use a weighting such as  TF-IDF (term frequency–inverse document frequency)… in computing the cosine of the angle between weighted doc-vectors to determine similarity. http://en.wikipedia.org/wiki/Term-document_matrix I like hate databases D1 1 1 0 1 D2 1 0 2 1
Text modelling Analytical methods make text tractable. Latent semantic indexing utilizing singular value decomposition for term reduction / feature selection. Creates a new, reduced concept space. Takes care of synonymy, polysemy, stemming, etc. Classification technologies / methods: Naive Bayes. Support Vector Machine. K-nearest neighbor.
Text modelling In the form of  query-document similarity , this is Information Retrieval 101. See, for instance, Salton & Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” 1988. If we want to get more out of text, we have to do more...
Consider: Web pages, E-mail, news & blog articles, forum postings, and other social media. Contact-center notes and transcripts. Surveys, feedback forms, warranty claims. And every kind of corporate documents imaginable. These sources may contain “traditional” data. The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq composite gained 6.84, or 0.32 percent, to 2,162.78. Unstructured sources
Unstructured sources Sources may mix fact and sentiment: When you walk in the foyer of the hotel it seems quite inviting but the room was very basis and smelt very badly of stale cigarette smoke, it would have been nice to be asked if we wanted a non smoking room, I know the room was very cheap but I found this very off putting to have to sleep with the smell, and it was to cold to leave the window open. Excellent location for restaurants and bars Overall I would never sell/buy a Motorola V3 unless it is demanded. My life would be way better without this phone being around (I am being 100% serious) Motorola should pay me directly for all the problems I have had with these phones. :-(
The “unstructured” data challenge “ The bulk of information value is perceived as coming from data in relational tables. The reason is that data that is structured is easy to mine and analyze.” –  Prabhakar Raghavan, Yahoo Research, former CTO of enterprise-search vendor Verity (now part of Autonomy) ‏ Yet it’s a truism that 80% of enterprise information is in “unstructured” form.
Applications What do people do with electronic documents? Publish, Manage, and Archive. Index and Search. Categorize and Classify according to  metadata  & contents. Information Extraction. For textual documents, text analytics enhances #1 & #2 and enables #3 & #4. You need linguistics to do #1 & #4 well, to deal with  Semantics .
Search is not the answer Relevance? Concepts? Articles from a forum site Articles from 1987
Search Search involves – Words & phrases: search terms & natural language. Qualifiers: include/exclude, and/or, not, etc. Answers involve – Entities: names, e-mail addresses, phone numbers Concepts: abstractions of entities. Facts and relationships. Abstract attributes, e.g., “expensive,” “comfortable” Opinions, sentiments: attitudinal information. Data.
Search Search is not enough. Search helps you find things you already know about.  It doesn’t help you  discover  things you’re unaware of. Search results often lack  relevance . Search finds documents, not  knowledge . Search doesn’t enable  unified analytics  that links data from textual and transactional sources. Text analytics can make search better...
Smarter search Text analytics enables results that suit the information and the user, e.g., answers –
Presentation of search results can be enhanced by discovery. This slide and the next show dynamic, clustered search results from Grokker… live.grokker.com/grokker.html?query=text%20analytics&Yahoo=true&Wikipedia=true&numResults=250
… with a zoomable display. Clustering here utilizes statistical (text) data mining techniques to identifying cohesive groupings of retrieved documents.
More results clustering... A dynamic network viz.: the Touch-Graph Google-Browser applet touchgraph.com/ TGGoogleBrowser.php ?start=text%20analytics
Beyond search
Data  Mining Text  Mining Data Retrieval Information Retrieval Search/Query (goal-oriented) ‏ Discovery (opportunistic) ‏ Fielded Data Documents Based on Je Wei Liang,  www.database.cis.nctu.edu.tw/seminars/2003F/TWM/slides/p.ppt Text mining
Semantic Search BI Search Data  Mining Text  Mining Data Retrieval Information Retrieval Search/Query (goal-oriented) Discovery (opportunistic) Fielded Data Documents Where’s Text Analytics? Text analytics
Text analytics Text analytics automates what researchers, writers, scholars, and all the rest of us have been doing for years.  Text analytics – Applies linguistic and/or statistical techniques to extract concepts and patterns  that can be applied to categorize and classify documents, audio, video, images. Transforms “unstructured” information into data  for application of traditional analysis techniques. Unlocks meaning and relationships  in large volumes of information that were previously unprocessable by computer.
Text analytics Typical steps in text analytics include – Retrieve  documents for analysis.  Apply statistical &/ linguistic &/ structural techniques to  identify, tag, and extract  entities, concepts, relationships, and events (features) within document sets. Apply statistical pattern-matching & similarity techniques to  classify  documents and organize extracted features according to a specified or generated categorization / taxonomy. –  via a  pipeline  of statistical & linguistic steps.
Text analytics Why do we need linguistics? The Dow  fell  46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq composite  gained  6.84, or 0.32 percent, to 2,162.78. The Dow  gained  46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq composite  fell  6.84, or 0.32 percent, to 2,162.78. John pushed  Max .  He  fell. John  pushed Max.  He  laughed. (Examples from Luca Scagliarini, Expert System;  Laure Vieu and Patrick Saint-Dizier .)
New York Times , September 8, 1957 Anaphora / coreference External reference
 
 
 
Information extraction When we understand, for instance, parts of speech – <subject> <verb> <object> – we’re in a position to discern facts and relationships. Let's see text augmentation (tagging) in action.  We'll use GATE, an open-source tool...
 
 
 
 
Example: E-mail What else can we extract?  Let’s look at an e-mail message –  Date: Sun, 13 Mar 2005 19:58:39 -0500 From: Adam L. Buchsbaum <alb@research.att.com> To: Seth Grimes <grimes@altaplana.com> Subject: Re: Papers on analysis on streaming data seth, you should contact divesh srivastava, divesh@research.att.com regarding at&t labs data streaming technology. adam
Example: E-mail An e-mail message is “semi-structured.” From semi-structured text, it’s especially easy to extract metadata. There are many forms of s-s information...
Example: Survey The respondent is invited to explain his/her attitude:
Example: Survey A survey of this type, like an e-mail message, is “semi-structured.” Exploit what is structured in interpreting and using the free text. Use the  metadata  that describes the information and its provenance. Sentiment extraction comes into play for  Voice of the Customer  /  Customer Experience Management  applications.
Sentiment / opinion extraction – Applications include: Reputation management. Competitive intelligence. Quality improvement. Trend spotting. Sources include: Wikis, blogs, forums, and newsgroups. Media stories and product reviews. Contact-center notes and transcripts. Customer feedback via Web-site forms and e-mail. Survey verbatims. Attitudinal data
We need to –  Identify and access candidate sources. Extract sentiment to databases. Correlate expressed sentiment to measures such as: Sales by product, location, time, etc. Defects by part, circumstances, etc. And information such as – Customer information and customer’s transactions. Correlation depends on semantic agreement: are we talking about the same things? Attitudinal data
Consider text from Dell’s  IdeaStorm.com  –  “ Dell really... REALLY need to stop overcharging... and when i say overcharing... i mean atleast double what you would pay to pick up the ram yourself.” What Sentiment is expressed or implied? Subject? Polarity / Valence? Intensity? Mood? Opinion? Sentiment and opinion
Take law enforcement as an example–  Sources: case files, crime reports, incident and victimization databases, legal documents Targets: crime patterns, criminal investigation, networks Example: law enforcement
Example: law enforcement An Attensity  law- enforcement  example –  NLP to  identify roles and relationships.
Example: law enforcement
MedTAKMI = Text Analysis and Knowledge MIning for Biomedical Documents ( ibm.com ).  (Project dates back several years, but it’s a great example.) Goal is to extract relationships among biomedical entities (e.g. proteins and genes), from patterns such as “A inhibits B” and “A activates B.” Work starts with a “syntactic parser” that identifies entities and basic binary (a noun and a verb) and ternary (two nouns and a verb) relationships. Case study: IBM’s MedTAKMI
MEDLINE from the National Center for Biotechnology Information hosts links to many widely used information sources such as the  PubMed database of 18 million biomedical journal abstracts.  Visit  www.ncbi.nlm.nih.gov . Case study: IBM’s MedTAKMI
 
 
I conducted a study on  Voice of the Customer  (VOC) text analytics last year.  I polled: individuals with experience applying VOC text analytics to real-world business problems at their organizations. a number of vendor representatives and industry analysts. VOC research study
Information analyzed
ROI measured, planned & achieved
Solution providers What should a prospective user look for? Response Percent deep sentiment/opinion extraction  80% ability to use specialized dictionaries or taxonomies 76% broad information extraction capability 60% adaptation for particular sectors, e.g., hospitality, retail, health care,  communications 56% predictive-analytics integration 48% BI (business intelligence) integration 48% support for multiple languages 48% ability to create custom workflows 32% low cost 32% hosted or &quot;as a service&quot; option 32% specialized VoC analysis interface 24%
Key message If you analyze only transactional data, you miss opportunity or incur risk... “ Industries such as travel and hospitality and retail live and die on customer experience.” --  Clarabridge CEO Sid Banerjee “ Organizations embracing text analytics all report having an epiphany moment when they suddenly knew more than before.”  -- Philip Russom, the Data Warehousing Institute
The vendor marketplace

More Related Content

What's hot (20)

PPTX
Big Data - The 5 Vs Everyone Must Know
Bernard Marr
 
PPT
FILE STRUCTURE IN DBMS
Abhishek Dutta
 
PDF
What is Data Science
Ioannis Kourouklides
 
PDF
Lecture2 big data life cycle
hktripathy
 
PPTX
Introduction to Data Science
Srishti44
 
PPTX
Structure system analysis and design method -SSADM
FLYMAN TECHNOLOGY LIMITED
 
PPTX
Rule Based Algorithms.pptx
RoshanSuvedi1
 
PDF
Data Governance and Metadata Management
DATAVERSITY
 
PDF
You Need a Data Catalog. Do You Know Why?
Precisely
 
PPT
BI Presentation
Dhiren Gala
 
PPTX
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
 
PDF
Data Mining & Data Warehousing Lecture Notes
FellowBuddy.com
 
PDF
Etl overview training
Mondy Holten
 
PPTX
Data science life cycle
Manoj Mishra
 
PPTX
Big data
factscomputersoftware
 
PPTX
Introduction to Information Retrieval
Roi Blanco
 
PPTX
Data quality and data profiling
Shailja Khurana
 
PPT
Big data ppt
IDBI Bank Ltd.
 
PDF
Data warehouse architecture
pcherukumalla
 
PPTX
Data cleansing
kunaljain1701
 
Big Data - The 5 Vs Everyone Must Know
Bernard Marr
 
FILE STRUCTURE IN DBMS
Abhishek Dutta
 
What is Data Science
Ioannis Kourouklides
 
Lecture2 big data life cycle
hktripathy
 
Introduction to Data Science
Srishti44
 
Structure system analysis and design method -SSADM
FLYMAN TECHNOLOGY LIMITED
 
Rule Based Algorithms.pptx
RoshanSuvedi1
 
Data Governance and Metadata Management
DATAVERSITY
 
You Need a Data Catalog. Do You Know Why?
Precisely
 
BI Presentation
Dhiren Gala
 
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
 
Data Mining & Data Warehousing Lecture Notes
FellowBuddy.com
 
Etl overview training
Mondy Holten
 
Data science life cycle
Manoj Mishra
 
Introduction to Information Retrieval
Roi Blanco
 
Data quality and data profiling
Shailja Khurana
 
Big data ppt
IDBI Bank Ltd.
 
Data warehouse architecture
pcherukumalla
 
Data cleansing
kunaljain1701
 

Similar to Analysis of ‘Unstructured’ Data (20)

PPTX
Text Analytics for Dummies 2010
Seth Grimes
 
PPTX
Text Analytics Overview, 2011
Seth Grimes
 
PPTX
An Introduction to Text Analytics: 2013 Workshop presentation
Seth Grimes
 
PPT
Predictive Text Analytics
Seth Grimes
 
PDF
Data Science - Part XI - Text Analytics
Derek Kane
 
PDF
Getting Started with Unstructured Data
Christine Connors
 
PPT
Text Mining
sathish sak
 
PDF
Text Mining and Visualization
Seth Grimes
 
PPT
Text Analytics for Semantic Computing
Meena Nagarajan
 
PPTX
Text mining and analytics v6 - p1
Dave King
 
PPTX
Text mining introduction-1
Sumit Sony
 
PPTX
Text Analytics Past, Present & Future
Seth Grimes
 
DOC
Semi-automatic Text MiningNK
butest
 
PPT
Copy of 10text (2)
Uma Se
 
PPT
Chapter 10 Data Mining Techniques
Houw Liong The
 
PPTX
Fundamentals Concepts on Text Analytics.pptx
aini658222
 
PPTX
How to start for machine learning career
BigAnalytics .me
 
PPTX
How to Build a Semantic Search System
Trey Grainger
 
PPTX
Search, Signals & Sense: An Analytics Fueled Vision
Seth Grimes
 
Text Analytics for Dummies 2010
Seth Grimes
 
Text Analytics Overview, 2011
Seth Grimes
 
An Introduction to Text Analytics: 2013 Workshop presentation
Seth Grimes
 
Predictive Text Analytics
Seth Grimes
 
Data Science - Part XI - Text Analytics
Derek Kane
 
Getting Started with Unstructured Data
Christine Connors
 
Text Mining
sathish sak
 
Text Mining and Visualization
Seth Grimes
 
Text Analytics for Semantic Computing
Meena Nagarajan
 
Text mining and analytics v6 - p1
Dave King
 
Text mining introduction-1
Sumit Sony
 
Text Analytics Past, Present & Future
Seth Grimes
 
Semi-automatic Text MiningNK
butest
 
Copy of 10text (2)
Uma Se
 
Chapter 10 Data Mining Techniques
Houw Liong The
 
Fundamentals Concepts on Text Analytics.pptx
aini658222
 
How to start for machine learning career
BigAnalytics .me
 
How to Build a Semantic Search System
Trey Grainger
 
Search, Signals & Sense: An Analytics Fueled Vision
Seth Grimes
 
Ad

More from Seth Grimes (20)

PPT
Recent Advances in Natural Language Processing
Seth Grimes
 
PPTX
Creating an AI Startup: What You Need to Know
Seth Grimes
 
PPT
NLP 2020: What Works and What's Next
Seth Grimes
 
PDF
Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Seth Grimes
 
PDF
From Customer Emotions to Actionable Insights, with Peter Dorrington
Seth Grimes
 
PDF
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
Seth Grimes
 
PPT
Emotion AI
Seth Grimes
 
PPT
Text Analytics Market Trends
Seth Grimes
 
PPTX
Text Analytics for NLPers
Seth Grimes
 
PPTX
Our FinTech Future – AI’s Opportunities and Challenges?
Seth Grimes
 
PDF
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Seth Grimes
 
PDF
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
Seth Grimes
 
PDF
Fairness in Machine Learning and AI
Seth Grimes
 
PDF
Classification with Memes–Uber case study
Seth Grimes
 
PDF
Aspect Detection for Sentiment / Emotion Analysis
Seth Grimes
 
PPTX
Content AI: From Potential to Practice
Seth Grimes
 
PPT
Text Analytics Market Insights: What's Working and What's Next
Seth Grimes
 
PPTX
An Industry Perspective on Subjectivity, Sentiment, and Social
Seth Grimes
 
PPTX
The Insight Value of Social Sentiment
Seth Grimes
 
PDF
Text Analytics 2014: User Perspectives on Solutions and Providers
Seth Grimes
 
Recent Advances in Natural Language Processing
Seth Grimes
 
Creating an AI Startup: What You Need to Know
Seth Grimes
 
NLP 2020: What Works and What's Next
Seth Grimes
 
Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Seth Grimes
 
From Customer Emotions to Actionable Insights, with Peter Dorrington
Seth Grimes
 
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
Seth Grimes
 
Emotion AI
Seth Grimes
 
Text Analytics Market Trends
Seth Grimes
 
Text Analytics for NLPers
Seth Grimes
 
Our FinTech Future – AI’s Opportunities and Challenges?
Seth Grimes
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Seth Grimes
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
Seth Grimes
 
Fairness in Machine Learning and AI
Seth Grimes
 
Classification with Memes–Uber case study
Seth Grimes
 
Aspect Detection for Sentiment / Emotion Analysis
Seth Grimes
 
Content AI: From Potential to Practice
Seth Grimes
 
Text Analytics Market Insights: What's Working and What's Next
Seth Grimes
 
An Industry Perspective on Subjectivity, Sentiment, and Social
Seth Grimes
 
The Insight Value of Social Sentiment
Seth Grimes
 
Text Analytics 2014: User Perspectives on Solutions and Providers
Seth Grimes
 
Ad

Recently uploaded (20)

PPTX
Business profile making an example ppt for small scales
Bindu222929
 
PPTX
Oil and Gas EPC Market Size & Share | Growth - 2034
Aman Bansal
 
PDF
HOW TO RECOVER LOST CRYPTOCURRENCY - VISIT iBOLT CYBER HACKER COMPANY
diegovalentin771
 
PPTX
Micro Battery Market Size & Share | Growth - 2034
Aman Bansal
 
PDF
3rd Edition of Human Resources Management Awards
resources7371
 
PPTX
Sustainability Strategy ESG Goals and Green Transformation Insights.pptx
presentifyai
 
PDF
Choosing the Right Packaging for Your Products – Sriram Enterprises, Tirunelveli
SRIRAM ENTERPRISES, TIRUNELVELI
 
PPTX
Smarter call Reporting with Callation.pptx
Callation us
 
DOCX
How to Build Digital Income From Scratch Without Tech Skills or Experience
legendarybook73
 
PPTX
SYMCA LGP - Social Enterprise Exchange.pptx
Social Enterprise Exchange
 
PPTX
25 Future Mega Trends Reshaping the World in 2025 and Beyond
presentifyai
 
PDF
Reflect, Refine & Implement In-Person Business Growth Workshop.pdf
TheoRuby
 
PPTX
Hackathon - Technology - Idea Submission Template -HackerEarth.pptx
nanster236
 
PDF
BeMetals_Presentation_July_2025 .pdf
DerekIwanaka2
 
PPTX
Bovine Pericardial Tissue Patch for Pediatric Surgery
TisgenxInc
 
PDF
"Complete Guide to the Partner Visa 2025
Zealand Immigration
 
PDF
Robbie Teehan - Owns The Pro Composer
Robbie Teehan
 
PDF
How do we fix the Messed Up Corporation’s System diagram?
YukoSoma
 
DOCX
DiscoveryBit The 21st century seen.docx
seomehk
 
PDF
_How Freshers Can Find the Best IT Companies in Jaipur with Salarite.pdf
SALARITE
 
Business profile making an example ppt for small scales
Bindu222929
 
Oil and Gas EPC Market Size & Share | Growth - 2034
Aman Bansal
 
HOW TO RECOVER LOST CRYPTOCURRENCY - VISIT iBOLT CYBER HACKER COMPANY
diegovalentin771
 
Micro Battery Market Size & Share | Growth - 2034
Aman Bansal
 
3rd Edition of Human Resources Management Awards
resources7371
 
Sustainability Strategy ESG Goals and Green Transformation Insights.pptx
presentifyai
 
Choosing the Right Packaging for Your Products – Sriram Enterprises, Tirunelveli
SRIRAM ENTERPRISES, TIRUNELVELI
 
Smarter call Reporting with Callation.pptx
Callation us
 
How to Build Digital Income From Scratch Without Tech Skills or Experience
legendarybook73
 
SYMCA LGP - Social Enterprise Exchange.pptx
Social Enterprise Exchange
 
25 Future Mega Trends Reshaping the World in 2025 and Beyond
presentifyai
 
Reflect, Refine & Implement In-Person Business Growth Workshop.pdf
TheoRuby
 
Hackathon - Technology - Idea Submission Template -HackerEarth.pptx
nanster236
 
BeMetals_Presentation_July_2025 .pdf
DerekIwanaka2
 
Bovine Pericardial Tissue Patch for Pediatric Surgery
TisgenxInc
 
"Complete Guide to the Partner Visa 2025
Zealand Immigration
 
Robbie Teehan - Owns The Pro Composer
Robbie Teehan
 
How do we fix the Messed Up Corporation’s System diagram?
YukoSoma
 
DiscoveryBit The 21st century seen.docx
seomehk
 
_How Freshers Can Find the Best IT Companies in Jaipur with Salarite.pdf
SALARITE
 

Analysis of ‘Unstructured’ Data

  • 1. Analysis of ‘Unstructured’ Data Seth Grimes Alta Plana Corporation 301-270-0795 -- http://altaplana.com ASA Chicago Chapter Proliferation of Digital Information and Recent Uses in Statistical Applications May 15, 2009
  • 2. Introduction Seth Grimes – Principal Consultant with Alta Plana Corporation. Contributing Editor, IntelligentEnterprise.com . Channel Expert, B-Eye-Network.com . Founding Chair, Text Analytics Summit, textanalyticsnews.com . ASA member since ???. Board of Directors, the Council of Professional Associations on Federal Statistics (COPAFS).
  • 3. Perspectives Assumption #1: You’re a statistician, business analyst, or other “end user.” You have lots of text, and you want an automated way to deal with it. Assumption #2: Enriching analysis. You’re interested in enriching an existing business intelligence (BI) / data-mining / analytical initiative to encompass information from textual sources. Caveat: I’m going to look exclusively at text.
  • 4. Context What’s statistics? My definition(s): Summary characterization of a data set. Identification of the mathematical model, accommodating randomness and uncertainty, that best suits a data set. The application of a probabilistic data model for predictive purposes. Text analytics applies stats to text in +/- all 3 senses. But what’s data?
  • 5. Are these data sets? “ Unstructured” data
  • 6. Are these data sets? “ Unstructured” data
  • 7. www.stanford.edu/%7ernusse/wntwindow.html Axin and Frat1 interact with dvl and GSK, bridging Dvl to GSK in Wnt-mediated regulation of LEF-1. Wnt proteins transduce their signals through dishevelled (Dvl) proteins to inhibit glycogen synthase kinase 3beta (GSK), leading to the accumulation of cytosolic beta-catenin and activation of TCF/LEF-1 transcription factors. To understand the mechanism by which Dvl acts through GSK to regulate LEF-1, we investigated the roles of Axin and Frat1 in Wnt-mediated activation of LEF-1 in mammalian cells. We found that Dvl interacts with Axin and with Frat1, both of which interact with GSK. Similarly, the Frat1 homolog GBP binds Xenopus Dishevelled in an interaction that requires GSK. We also found that Dvl, Axin and GSK can form a ternary complex bridged by Axin, and that Frat1 can be recruited into this complex probably by Dvl. The observation that the Dvl-binding domain of either Frat1 or Axin was able to inhibit Wnt-1-induced LEF-1 activation suggests that the interactions between Dvl and Axin and between Dvl and Frat may be important for this signaling pathway. Furthermore, Wnt-1 appeared to promote the disintegration of the Frat1-Dvl-GSK-Axin complex, resulting in the dissociation of GSK from Axin. Thus, formation of the quaternary complex may be an important step in Wnt signaling, by which Dvl recruits Frat1, leading to Frat1-mediated dissociation of GSK from Axin. www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&cmd=Retrieve&list_uids=10428961&dopt=Abstract More “unstructured” data
  • 8.  
  • 9.  
  • 10. Text descriptive statistics Search Engine Optimization and Web analytics are two sides of a findability coin. SEO relies on text descriptive statistics. Web analytics looks at site visits, at transactional and behavioural information. Put this aside for now. Text descriptive statistics can give us an idea of the “whatness” of a document.
  • 11. New York Times , September 8, 1957
  • 12. “ Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the auto-abstract.” H.P. Luhn, The Automatic Creation of Literature Abstracts , IBM Journal , 1958.
  • 13. Text-BI: Back to the Future Side note: What is business intelligence (BI)? A 1958 definition: In this paper, business is a collection of activities carried on for whatever purpose , be it science, technology, commerce, industry, law, government, defense, et cetera... The notion of intelligence is also defined here... as “the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.” Hans Peter Luhn, A Business Intelligence System , IBM Journal , October 1958
  • 14. Document input and processing Knowledge handling is key
  • 15.  
  • 16. Text modelling The text content of a document can be considered an unordered “bag of words.” Particular documents are points in a high-dimensional vector space. Salton, Wong & Yang, “A Vector Space Model for Automatic Indexing,” November 1975.
  • 17. Text modelling We might construct a document-term matrix ... D1 = &quot;I like databases&quot; D2 = &quot;I hate hate databases&quot; and use a weighting such as TF-IDF (term frequency–inverse document frequency)… in computing the cosine of the angle between weighted doc-vectors to determine similarity. http://en.wikipedia.org/wiki/Term-document_matrix I like hate databases D1 1 1 0 1 D2 1 0 2 1
  • 18. Text modelling Analytical methods make text tractable. Latent semantic indexing utilizing singular value decomposition for term reduction / feature selection. Creates a new, reduced concept space. Takes care of synonymy, polysemy, stemming, etc. Classification technologies / methods: Naive Bayes. Support Vector Machine. K-nearest neighbor.
  • 19. Text modelling In the form of query-document similarity , this is Information Retrieval 101. See, for instance, Salton & Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” 1988. If we want to get more out of text, we have to do more...
  • 20. Consider: Web pages, E-mail, news & blog articles, forum postings, and other social media. Contact-center notes and transcripts. Surveys, feedback forms, warranty claims. And every kind of corporate documents imaginable. These sources may contain “traditional” data. The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq composite gained 6.84, or 0.32 percent, to 2,162.78. Unstructured sources
  • 21. Unstructured sources Sources may mix fact and sentiment: When you walk in the foyer of the hotel it seems quite inviting but the room was very basis and smelt very badly of stale cigarette smoke, it would have been nice to be asked if we wanted a non smoking room, I know the room was very cheap but I found this very off putting to have to sleep with the smell, and it was to cold to leave the window open. Excellent location for restaurants and bars Overall I would never sell/buy a Motorola V3 unless it is demanded. My life would be way better without this phone being around (I am being 100% serious) Motorola should pay me directly for all the problems I have had with these phones. :-(
  • 22. The “unstructured” data challenge “ The bulk of information value is perceived as coming from data in relational tables. The reason is that data that is structured is easy to mine and analyze.” – Prabhakar Raghavan, Yahoo Research, former CTO of enterprise-search vendor Verity (now part of Autonomy) ‏ Yet it’s a truism that 80% of enterprise information is in “unstructured” form.
  • 23. Applications What do people do with electronic documents? Publish, Manage, and Archive. Index and Search. Categorize and Classify according to metadata & contents. Information Extraction. For textual documents, text analytics enhances #1 & #2 and enables #3 & #4. You need linguistics to do #1 & #4 well, to deal with Semantics .
  • 24. Search is not the answer Relevance? Concepts? Articles from a forum site Articles from 1987
  • 25. Search Search involves – Words & phrases: search terms & natural language. Qualifiers: include/exclude, and/or, not, etc. Answers involve – Entities: names, e-mail addresses, phone numbers Concepts: abstractions of entities. Facts and relationships. Abstract attributes, e.g., “expensive,” “comfortable” Opinions, sentiments: attitudinal information. Data.
  • 26. Search Search is not enough. Search helps you find things you already know about. It doesn’t help you discover things you’re unaware of. Search results often lack relevance . Search finds documents, not knowledge . Search doesn’t enable unified analytics that links data from textual and transactional sources. Text analytics can make search better...
  • 27. Smarter search Text analytics enables results that suit the information and the user, e.g., answers –
  • 28. Presentation of search results can be enhanced by discovery. This slide and the next show dynamic, clustered search results from Grokker… live.grokker.com/grokker.html?query=text%20analytics&Yahoo=true&Wikipedia=true&numResults=250
  • 29. … with a zoomable display. Clustering here utilizes statistical (text) data mining techniques to identifying cohesive groupings of retrieved documents.
  • 30. More results clustering... A dynamic network viz.: the Touch-Graph Google-Browser applet touchgraph.com/ TGGoogleBrowser.php ?start=text%20analytics
  • 32. Data Mining Text Mining Data Retrieval Information Retrieval Search/Query (goal-oriented) ‏ Discovery (opportunistic) ‏ Fielded Data Documents Based on Je Wei Liang, www.database.cis.nctu.edu.tw/seminars/2003F/TWM/slides/p.ppt Text mining
  • 33. Semantic Search BI Search Data Mining Text Mining Data Retrieval Information Retrieval Search/Query (goal-oriented) Discovery (opportunistic) Fielded Data Documents Where’s Text Analytics? Text analytics
  • 34. Text analytics Text analytics automates what researchers, writers, scholars, and all the rest of us have been doing for years. Text analytics – Applies linguistic and/or statistical techniques to extract concepts and patterns that can be applied to categorize and classify documents, audio, video, images. Transforms “unstructured” information into data for application of traditional analysis techniques. Unlocks meaning and relationships in large volumes of information that were previously unprocessable by computer.
  • 35. Text analytics Typical steps in text analytics include – Retrieve documents for analysis. Apply statistical &/ linguistic &/ structural techniques to identify, tag, and extract entities, concepts, relationships, and events (features) within document sets. Apply statistical pattern-matching & similarity techniques to classify documents and organize extracted features according to a specified or generated categorization / taxonomy. – via a pipeline of statistical & linguistic steps.
  • 36. Text analytics Why do we need linguistics? The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq composite gained 6.84, or 0.32 percent, to 2,162.78. The Dow gained 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq composite fell 6.84, or 0.32 percent, to 2,162.78. John pushed Max . He fell. John pushed Max. He laughed. (Examples from Luca Scagliarini, Expert System; Laure Vieu and Patrick Saint-Dizier .)
  • 37. New York Times , September 8, 1957 Anaphora / coreference External reference
  • 38.  
  • 39.  
  • 40.  
  • 41. Information extraction When we understand, for instance, parts of speech – <subject> <verb> <object> – we’re in a position to discern facts and relationships. Let's see text augmentation (tagging) in action. We'll use GATE, an open-source tool...
  • 42.  
  • 43.  
  • 44.  
  • 45.  
  • 46. Example: E-mail What else can we extract? Let’s look at an e-mail message – Date: Sun, 13 Mar 2005 19:58:39 -0500 From: Adam L. Buchsbaum <[email protected]> To: Seth Grimes <[email protected]> Subject: Re: Papers on analysis on streaming data seth, you should contact divesh srivastava, [email protected] regarding at&t labs data streaming technology. adam
  • 47. Example: E-mail An e-mail message is “semi-structured.” From semi-structured text, it’s especially easy to extract metadata. There are many forms of s-s information...
  • 48. Example: Survey The respondent is invited to explain his/her attitude:
  • 49. Example: Survey A survey of this type, like an e-mail message, is “semi-structured.” Exploit what is structured in interpreting and using the free text. Use the metadata that describes the information and its provenance. Sentiment extraction comes into play for Voice of the Customer / Customer Experience Management applications.
  • 50. Sentiment / opinion extraction – Applications include: Reputation management. Competitive intelligence. Quality improvement. Trend spotting. Sources include: Wikis, blogs, forums, and newsgroups. Media stories and product reviews. Contact-center notes and transcripts. Customer feedback via Web-site forms and e-mail. Survey verbatims. Attitudinal data
  • 51. We need to – Identify and access candidate sources. Extract sentiment to databases. Correlate expressed sentiment to measures such as: Sales by product, location, time, etc. Defects by part, circumstances, etc. And information such as – Customer information and customer’s transactions. Correlation depends on semantic agreement: are we talking about the same things? Attitudinal data
  • 52. Consider text from Dell’s IdeaStorm.com – “ Dell really... REALLY need to stop overcharging... and when i say overcharing... i mean atleast double what you would pay to pick up the ram yourself.” What Sentiment is expressed or implied? Subject? Polarity / Valence? Intensity? Mood? Opinion? Sentiment and opinion
  • 53. Take law enforcement as an example– Sources: case files, crime reports, incident and victimization databases, legal documents Targets: crime patterns, criminal investigation, networks Example: law enforcement
  • 54. Example: law enforcement An Attensity law- enforcement example – NLP to identify roles and relationships.
  • 56. MedTAKMI = Text Analysis and Knowledge MIning for Biomedical Documents ( ibm.com ). (Project dates back several years, but it’s a great example.) Goal is to extract relationships among biomedical entities (e.g. proteins and genes), from patterns such as “A inhibits B” and “A activates B.” Work starts with a “syntactic parser” that identifies entities and basic binary (a noun and a verb) and ternary (two nouns and a verb) relationships. Case study: IBM’s MedTAKMI
  • 57. MEDLINE from the National Center for Biotechnology Information hosts links to many widely used information sources such as the PubMed database of 18 million biomedical journal abstracts. Visit www.ncbi.nlm.nih.gov . Case study: IBM’s MedTAKMI
  • 58.  
  • 59.  
  • 60. I conducted a study on Voice of the Customer (VOC) text analytics last year. I polled: individuals with experience applying VOC text analytics to real-world business problems at their organizations. a number of vendor representatives and industry analysts. VOC research study
  • 62. ROI measured, planned & achieved
  • 63. Solution providers What should a prospective user look for? Response Percent deep sentiment/opinion extraction 80% ability to use specialized dictionaries or taxonomies 76% broad information extraction capability 60% adaptation for particular sectors, e.g., hospitality, retail, health care, communications 56% predictive-analytics integration 48% BI (business intelligence) integration 48% support for multiple languages 48% ability to create custom workflows 32% low cost 32% hosted or &quot;as a service&quot; option 32% specialized VoC analysis interface 24%
  • 64. Key message If you analyze only transactional data, you miss opportunity or incur risk... “ Industries such as travel and hospitality and retail live and die on customer experience.” -- Clarabridge CEO Sid Banerjee “ Organizations embracing text analytics all report having an epiphany moment when they suddenly knew more than before.” -- Philip Russom, the Data Warehousing Institute

Editor's Notes

  • #14: This course is, in essence, about the information enterprises have and how they use it and how they could better use it. First we look at enterprise information in light of business goals in order to characterize the “unstructured” information gap. We then look at how that information, or at least the textual variety, may be structured for use. Then we look at a few uses, at enriching search, surely one of today’s killer apps, and at enhancing business intelligence via search.
  • #22: This course is, in essence, about the information enterprises have and how they use it and how they could better use it. First we look at enterprise information in light of business goals in order to characterize the “unstructured” information gap. We then look at how that information, or at least the textual variety, may be structured for use. Then we look at a few uses, at enriching search, surely one of today’s killer apps, and at enhancing business intelligence via search.
  • #34: We earlier used a diagram that showed the relationship between search and discovery and operations on fielded data and on free-text documents. We will take those two methods, search and discovery, and add a third, analysis to the picture. In the intersection of search and analysis we have BI search and in the intersection of search and discovery we have semantic search.