Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison

0 likes367 views

This document discusses Apache Tika, a tool for extracting text and metadata from various file formats. It describes how Tika works and some challenges that may occur such as exceptions, unsupported formats, or memory issues. The document also mentions a tool called tika-eval that profiles Tika runs and exceptions. Future plans for Tika include improved CSV, ZIP file parsing and detection as well as more modularized statistics collection and language identification.

Data & Analytics

© 2019 The MITRE Corporation. All rights reserved.
Apache Tika
Tim Allison
tallison@apache.org, @_tallison
April 24, 2019
Haystack Conference
Approved for Public Release;
Distribution Unlimited. Case
Number 18-3138-6

| 2 |
© 2019 The MITRE Corporation. All rights reserved.
Overview
▪ What is Tika
▪ tika-eval
▪ Running Tika safely
▪ Coming out in 1.21 and beyond

| 3 |
© 2019 The MITRE Corporation. All rights reserved.
Text/Metadata Extraction

| 4 |
© 2019 The MITRE Corporation. All rights reserved.
Things Can Happen
▪ Tired:
– Exceptions
– Unsupported file formats
– Encrypted files
– Garbled text
– Missing text
▪ Wired:
– OOM
– Seg fault
– Infinite loops
– Multithreaded garbage collector pegging all CPU resources

| 6 |
© 2019 The MITRE Corporation. All rights reserved.
Upgrade from PDFBox 1.8.6->1.8.7

| 7 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…

| 8 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
You don’t have a search system.

| 9 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
👍You’ve got a neat, little demo!👍
You don’t have a search system.

| 13 |
© 2019 The MITRE Corporation. All rights reserved.
Tika 1.21 and beyond
▪ Tika 1.21
– csv/tsv detector and parser (Apache commons-csv)
– Improved zip-based (.docx, .pptx, .xlsx) file detection and parsing
▪ Beyond
– Modularize tika-eval and include stats within the extract for scalability and aggregation of
stats w/in Solr/Elastic
– Increase coverage/speed of zip-based file detection; can we move entirely to streaming
detection?
– Improve language coverage/lang id component w/in tika-eval
▪ Help!
– What do you need?
– How can you help us help you?

More Related Content

Similar to Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison (10)

PPT

Apache Tika: 1 point Oh!Chris Mattmann

PDF

Understanding information content with apache tikaSutthipong Kuruhongsa

PDF

Understanding information content with apache tikaSutthipong Kuruhongsa

PDF

What's new with Apache Tika?gagravarr

PPT

Apache Tika end-to-endgagravarr

KEY

Content extraction with apache tikaJukka Zitting

PPT

Scientific data curation and processing with Apache TikaChris Mattmann

PDF

Evaluating Text Extraction at Scale: A case study from Apache TikaTim Allison

ODP

What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...gagravarr

PDF

Test.pdfoizo0

Apache Tika: 1 point Oh!Chris Mattmann

Understanding information content with apache tikaSutthipong Kuruhongsa

What's new with Apache Tika?gagravarr

Apache Tika end-to-endgagravarr

Content extraction with apache tikaJukka Zitting

Scientific data curation and processing with Apache TikaChris Mattmann

Evaluating Text Extraction at Scale: A case study from Apache TikaTim Allison

What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...gagravarr

Test.pdfoizo0

More from OpenSource Connections (20)

PDF

Why User Behavior Insights? KMWorld Enterprise Search & Discovery 2024OpenSource Connections

PDF

EncoresOpenSource Connections

PDF

Test driven relevancyOpenSource Connections

PDF

How To Structure Your Search Team for SuccessOpenSource Connections

PPT

The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections

PDF

Payloads and OCR with SolrOpenSource Connections

PPTX

Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullOpenSource Connections

PPTX

Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections

PPTX

Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajOpenSource Connections

PDF

Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...OpenSource Connections

PPTX

Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections

PPTX

Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections

PPTX

Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections

PPTX

Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...OpenSource Connections

PDF

Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections

PDF

Haystack 2019 - Architectural considerations on search relevancy in the conte...OpenSource Connections

PPTX

Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...OpenSource Connections

PPTX

Haystack 2019 - Establishing a relevance focused culture in a large organizat...OpenSource Connections

PPTX

Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections

PDF

2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections