SlideShare a Scribd company logo
© 2019 The MITRE Corporation. All rights reserved.
Apache Tika
Tim Allison
tallison@apache.org, @_tallison
April 24, 2019
Haystack Conference
Approved for Public Release;
Distribution Unlimited. Case
Number 18-3138-6
| 2 |
© 2019 The MITRE Corporation. All rights reserved.
Overview
▪ What is Tika
▪ tika-eval
▪ Running Tika safely
▪ Coming out in 1.21 and beyond
| 3 |
© 2019 The MITRE Corporation. All rights reserved.
Text/Metadata Extraction
| 4 |
© 2019 The MITRE Corporation. All rights reserved.
Things Can Happen
▪ Tired:
– Exceptions
– Unsupported file formats
– Encrypted files
– Garbled text
– Missing text
▪ Wired:
– OOM
– Seg fault
– Infinite loops
– Multithreaded garbage collector pegging all CPU resources
| 5 |
Stands up on Soap Box
| 6 |
© 2019 The MITRE Corporation. All rights reserved.
Upgrade from PDFBox 1.8.6->1.8.7
| 7 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
| 8 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
You don’t have a search system.
| 9 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
👍You’ve got a neat, little demo!👍
You don’t have a search system.
| 10 |
Steps Off of Soap Box
| 11 |
© 2019 The MITRE Corporation. All rights reserved.
tika-eval
▪ Profile individual runs
▪ Compare two runs
▪ Exceptions by mime
▪ Out of vocabulary (OOV) statistics
| 12 |
© 2019 The MITRE Corporation. All rights reserved.
tika-eval: Eating our own dog food
▪ 3 million files (~1 TB) from Common Crawl and govdocs1 hosted on a
public virtual machine, provided by Rackspace
▪ Code to profile a single run or compare two runs before release
▪ Evaluation methodology co-developed with and now co-run by open
source colleagues (around the world) on the MSOffice parser project
and the PDF parser project
| 13 |
© 2019 The MITRE Corporation. All rights reserved.
Tika 1.21 and beyond
▪ Tika 1.21
– csv/tsv detector and parser (Apache commons-csv)
– Improved zip-based (.docx, .pptx, .xlsx) file detection and parsing
▪ Beyond
– Modularize tika-eval and include stats within the extract for scalability and aggregation of
stats w/in Solr/Elastic
– Increase coverage/speed of zip-based file detection; can we move entirely to streaming
detection?
– Improve language coverage/lang id component w/in tika-eval
▪ Help!
– What do you need?
– How can you help us help you?

More Related Content

Similar to Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison (10)

PPT
Apache Tika: 1 point Oh!
Chris Mattmann
 
PDF
Understanding information content with apache tika
Sutthipong Kuruhongsa
 
PDF
Understanding information content with apache tika
Sutthipong Kuruhongsa
 
PDF
What's new with Apache Tika?
gagravarr
 
PPT
Apache Tika end-to-end
gagravarr
 
KEY
Content extraction with apache tika
Jukka Zitting
 
PPT
Scientific data curation and processing with Apache Tika
Chris Mattmann
 
PDF
Evaluating Text Extraction at Scale: A case study from Apache Tika
Tim Allison
 
ODP
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
gagravarr
 
PDF
Test.pdf
oizo0
 
Apache Tika: 1 point Oh!
Chris Mattmann
 
Understanding information content with apache tika
Sutthipong Kuruhongsa
 
Understanding information content with apache tika
Sutthipong Kuruhongsa
 
What's new with Apache Tika?
gagravarr
 
Apache Tika end-to-end
gagravarr
 
Content extraction with apache tika
Jukka Zitting
 
Scientific data curation and processing with Apache Tika
Chris Mattmann
 
Evaluating Text Extraction at Scale: A case study from Apache Tika
Tim Allison
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
gagravarr
 
Test.pdf
oizo0
 

More from OpenSource Connections (20)

PDF
Why User Behavior Insights? KMWorld Enterprise Search & Discovery 2024
OpenSource Connections
 
PDF
Test driven relevancy
OpenSource Connections
 
PDF
How To Structure Your Search Team for Success
OpenSource Connections
 
PPT
The right path to making search relevant - Taxonomy Bootcamp London 2019
OpenSource Connections
 
PDF
Payloads and OCR with Solr
OpenSource Connections
 
PPTX
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
OpenSource Connections
 
PPTX
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
OpenSource Connections
 
PPTX
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
OpenSource Connections
 
PDF
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
OpenSource Connections
 
PPTX
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
OpenSource Connections
 
PPTX
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
 
PPTX
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
 
PPTX
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
OpenSource Connections
 
PDF
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
OpenSource Connections
 
PDF
Haystack 2019 - Architectural considerations on search relevancy in the conte...
OpenSource Connections
 
PPTX
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
OpenSource Connections
 
PPTX
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
OpenSource Connections
 
PPTX
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
OpenSource Connections
 
PDF
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
OpenSource Connections
 
Why User Behavior Insights? KMWorld Enterprise Search & Discovery 2024
OpenSource Connections
 
Test driven relevancy
OpenSource Connections
 
How To Structure Your Search Team for Success
OpenSource Connections
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
OpenSource Connections
 
Payloads and OCR with Solr
OpenSource Connections
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
OpenSource Connections
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
OpenSource Connections
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
OpenSource Connections
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
OpenSource Connections
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
OpenSource Connections
 
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
OpenSource Connections
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
OpenSource Connections
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
OpenSource Connections
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
OpenSource Connections
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
OpenSource Connections
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
OpenSource Connections
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
OpenSource Connections
 
Ad

Recently uploaded (20)

PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
deep dive data management sharepoint apps.ppt
novaprofk
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Ad

Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison

  • 1. © 2019 The MITRE Corporation. All rights reserved. Apache Tika Tim Allison [email protected], @_tallison April 24, 2019 Haystack Conference Approved for Public Release; Distribution Unlimited. Case Number 18-3138-6
  • 2. | 2 | © 2019 The MITRE Corporation. All rights reserved. Overview ▪ What is Tika ▪ tika-eval ▪ Running Tika safely ▪ Coming out in 1.21 and beyond
  • 3. | 3 | © 2019 The MITRE Corporation. All rights reserved. Text/Metadata Extraction
  • 4. | 4 | © 2019 The MITRE Corporation. All rights reserved. Things Can Happen ▪ Tired: – Exceptions – Unsupported file formats – Encrypted files – Garbled text – Missing text ▪ Wired: – OOM – Seg fault – Infinite loops – Multithreaded garbage collector pegging all CPU resources
  • 5. | 5 | Stands up on Soap Box
  • 6. | 6 | © 2019 The MITRE Corporation. All rights reserved. Upgrade from PDFBox 1.8.6->1.8.7
  • 7. | 7 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two…
  • 8. | 8 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… You don’t have a search system.
  • 9. | 9 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… 👍You’ve got a neat, little demo!👍 You don’t have a search system.
  • 10. | 10 | Steps Off of Soap Box
  • 11. | 11 | © 2019 The MITRE Corporation. All rights reserved. tika-eval ▪ Profile individual runs ▪ Compare two runs ▪ Exceptions by mime ▪ Out of vocabulary (OOV) statistics
  • 12. | 12 | © 2019 The MITRE Corporation. All rights reserved. tika-eval: Eating our own dog food ▪ 3 million files (~1 TB) from Common Crawl and govdocs1 hosted on a public virtual machine, provided by Rackspace ▪ Code to profile a single run or compare two runs before release ▪ Evaluation methodology co-developed with and now co-run by open source colleagues (around the world) on the MSOffice parser project and the PDF parser project
  • 13. | 13 | © 2019 The MITRE Corporation. All rights reserved. Tika 1.21 and beyond ▪ Tika 1.21 – csv/tsv detector and parser (Apache commons-csv) – Improved zip-based (.docx, .pptx, .xlsx) file detection and parsing ▪ Beyond – Modularize tika-eval and include stats within the extract for scalability and aggregation of stats w/in Solr/Elastic – Increase coverage/speed of zip-based file detection; can we move entirely to streaming detection? – Improve language coverage/lang id component w/in tika-eval ▪ Help! – What do you need? – How can you help us help you?