SlideShare a Scribd company logo
Dealing with
Unstructured Data
Scaling to Infinity
Image: Boykung/Shutterstock
Image: John Hammink
Dealing with Unstructured Data: Scaling to Infinity
There are many sources of
information
Copyright ©2014 Treasure Data. All Rights Reserved.
Results Push
Results Push
SQL
Big Data Simplified: One ApproachAppServers
Multi-structured Events
• register
• login
• start_event
• purchase
• etc
SQL-based
Ad-hoc Queries
SQL-based Dashboards
DBs & Data Marts
Other Apps
Results Push
Familiar &
Table-oriented
Infinite & Economical
Cloud Data Store
✓App log data
✓Mobile event data
✓Sensor data
✓Telemetry
Mobile SDKs
Web SDK
Multi-structured Events
Multi-structured Events
Multi-structured Events
Multi-structured Events
Agent
Agent
Agent
Agent Agent
Agent
Agent
Agent
Embedded SDKs
Server-side Agents
Copyright ©2014 Treasure Data. All Rights Reserved.
What is the point of all this data?
BI
Business
Intelligence
Using Very Large
Sets of Data
Dealing with Unstructured Data: Scaling to Infinity
Copyright ©2015 Treasure Data. All Rights
Reserved.
Service Launched
Series A Funding
100 Customers
Selected by Gartner as
Cool Vendor in Big Data
10 Trillion
Records
5 Trillion Records
Treasure Data By the Numbers (Jan-2015):
13T+ records of data imported since launch
500K+ records imported each second
1.5 Trillion+ records imported each month
12B records sent per day by one customer
13 Trillion Records
Series B Funding
Data Records Stored in the Treasure Data Cloud Service
0
3500000000000
7000000000000
10500000000000
14000000000000
Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13 Oct-13 Dec-13 Feb-14 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14
8
Last 2 years
Statistics
Total Records
Stored
25
Trillion
Managed &
Supported
24 * 7 *
365
Uptime
99.99%
New Records /
second
1
Million Daily Twitter
volume
100x
1 0 1 1 0
0 0 1 0 1
1 1 0 0
0 0 1
24 / 7
A solution?
• There are trade-offs to consider
• Any trade off should make it easy to collect data
• Easy does it! un- and semi-structured data (multi-
structured data)
• Open source means it’s free; also means that you need
someone on hand to maintain and implement
• Cloud storage means you don’t have to scale and/or
shard; tradeoff means performance hit against bare metal
Image: John Hammink
Image: Dreamstime
Images: Lightspring/Shutterstock, John Hammink, Treasure Data
There are a few intro to
Data Science blogs at
blog.treasuredata.com!
What does a pipeline need?
Open vs. Closed source
Image: Heather Craig/Shutterstock
Images: PC World, Data-Hive, Wallpapersmela
or
or
?
LAMBDA ARCHITECTURE
# logs from a file
<source>
type tail
path /var/log/
httpd.log
format apache2
tag web.access
</source>
# logs from client
libraries
<source>
type forward
port 24224
</source>
# store logs to ES and
HDFS
<match *.*>
type copy
<store>
type elasticsearch
logstash_format
LESS SIMPLE FORWARDING
Before fluentd
Multi- structured data
• un-structured data
better for data for
ultimate use in
statistics
fluentd!
http://www.fluentd.org/
http://msgpack.org/
an open-source bulk data loader that helps data
transfer between various databases, storages, file
formats, and cloud services
embulk.org/docs
Dealing with Unstructured Data: Scaling to Infinity
Dealing with Unstructured Data: Scaling to Infinity
Hivemall
Hivemall is a scalable machine learning library that
runs on Apache Hive.
Hivemall is designed to be scalable to the number
of training instances as well as the number of
training features.
• Classification
• Regression
• Recommendation
• k-nearest neighbor
• Anomaly Detection
• Feature Engineering
https://github.com/myui/hivemall
The Hadoop Story on MongoDB
Image courtesy of Steven Francia @ Docker
Questions?

More Related Content

What's hot (20)

PPTX
E-Commerce and MongoDB at Backcountry.com
MongoDB
 
PPTX
Everything you need to know about external sharing in OneDrive, SharePoint, a...
Drew Madelung
 
PPTX
Dutch Information Worker User Group - January 2022 - eDiscovery and Microsoft...
Albert Hoitingh
 
PPTX
What your IT Doesn't Know about Publishing DITA Content
ctnitchie
 
PDF
O365Engage17 - Protecting O365 Data in a Modern World
NCCOMMS
 
PPTX
What’s new in SharePoint 2016!
AntonioMaio2
 
PDF
Hybrid Dilemma: Dividing Content Between Azure, Office 365 & SharePoint 2016
Adam Levithan
 
PPTX
OneDrive & SharePoint Better Together
Drew Madelung
 
PDF
Data Security and Protection in DevOps
Karen Lopez
 
PPTX
SharePoint 2013 ediscovery overview
Elie Kash
 
PDF
Oracle Document Cloud Service
Arush Jain
 
PDF
SharePoint Saturday Ottawa - How secure is my data in office 365?
AntonioMaio2
 
PDF
Good to Great SharePoint Governance
NCCOMMS
 
PPTX
Oracle documents cloud service
Getting value from IoT, Integration and Data Analytics
 
PDF
O365Engage17 - Skype for Business Cloud PBX in the Real World
NCCOMMS
 
PDF
Delve and the Office Graph for IT- Pros & Admins
SPC Adriatics
 
PPTX
SharePoint Migration Series: Success Takes Three Actions
Adam Levithan
 
PPTX
Is BCS Dead?
Jeff Fried
 
PPTX
Governance is Not An Option
spsnyc
 
PPTX
Navigating the Mess of a Shared drive Migration to SharePoint
Joanne Klein
 
E-Commerce and MongoDB at Backcountry.com
MongoDB
 
Everything you need to know about external sharing in OneDrive, SharePoint, a...
Drew Madelung
 
Dutch Information Worker User Group - January 2022 - eDiscovery and Microsoft...
Albert Hoitingh
 
What your IT Doesn't Know about Publishing DITA Content
ctnitchie
 
O365Engage17 - Protecting O365 Data in a Modern World
NCCOMMS
 
What’s new in SharePoint 2016!
AntonioMaio2
 
Hybrid Dilemma: Dividing Content Between Azure, Office 365 & SharePoint 2016
Adam Levithan
 
OneDrive & SharePoint Better Together
Drew Madelung
 
Data Security and Protection in DevOps
Karen Lopez
 
SharePoint 2013 ediscovery overview
Elie Kash
 
Oracle Document Cloud Service
Arush Jain
 
SharePoint Saturday Ottawa - How secure is my data in office 365?
AntonioMaio2
 
Good to Great SharePoint Governance
NCCOMMS
 
O365Engage17 - Skype for Business Cloud PBX in the Real World
NCCOMMS
 
Delve and the Office Graph for IT- Pros & Admins
SPC Adriatics
 
SharePoint Migration Series: Success Takes Three Actions
Adam Levithan
 
Is BCS Dead?
Jeff Fried
 
Governance is Not An Option
spsnyc
 
Navigating the Mess of a Shared drive Migration to SharePoint
Joanne Klein
 

Viewers also liked (9)

PDF
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Peter Wren-Hilton
 
PPTX
Unstructured data processing webinar 06272016
George Roth
 
PPTX
Hotsos 2013 - Creating Structure in Unstructured Data
Marco Gralike
 
PPT
Lecture 11 Unstructured Data and the Data Warehouse
phanleson
 
PPTX
The Analytic System: Finding Patterns in the Data
Health Catalyst
 
PPSX
Unstructured Data in BI
Monaheng Diaho
 
PDF
Analyzing Unstructured Data in Hadoop Webinar
Datameer
 
PPT
Analysis of ‘Unstructured’ Data
Seth Grimes
 
PPTX
Using Hadoop as a platform for Master Data Management
DataWorks Summit
 
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Peter Wren-Hilton
 
Unstructured data processing webinar 06272016
George Roth
 
Hotsos 2013 - Creating Structure in Unstructured Data
Marco Gralike
 
Lecture 11 Unstructured Data and the Data Warehouse
phanleson
 
The Analytic System: Finding Patterns in the Data
Health Catalyst
 
Unstructured Data in BI
Monaheng Diaho
 
Analyzing Unstructured Data in Hadoop Webinar
Datameer
 
Analysis of ‘Unstructured’ Data
Seth Grimes
 
Using Hadoop as a platform for Master Data Management
DataWorks Summit
 
Ad

Similar to Dealing with Unstructured Data: Scaling to Infinity (20)

PDF
Big data rmoug
Gwen (Chen) Shapira
 
PPTX
Sharing a Startup’s Big Data Lessons
George Stathis
 
PDF
The Big Data Developer (@pavlobaron)
Pavlo Baron
 
PPTX
Essential Data Engineering for Data Scientist
SoftServe
 
PDF
The Evolving Landscape of Data Engineering
Andrei Savu
 
PPSX
Big data with Hadoop - Introduction
Tomy Rhymond
 
PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
PDF
Introduction to Big Data
IMC Institute
 
PPTX
UNIT II Evaluating NoSQL for various .pptx
Rahul Borate
 
PDF
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
Betacowork
 
PPTX
Session 10 handling bigger data
bodaceacat
 
PPTX
Session 10 handling bigger data
Sara-Jayne Terp
 
PDF
Survey of Big Data Infrastructures
m.a.kirn
 
PPTX
Bigdata Unit1.pptx
ssuser92282c
 
PDF
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET Journal
 
PPT
SQL or NoSQL, that is the question!
Andraz Tori
 
PDF
Big Data Science Workshop Documentation V1.0
Abdelrahman Astro
 
PPTX
Introduction to Polyglot Persistence
Antonios Giannopoulos
 
PDF
Big Data
Putchong Uthayopas
 
PPTX
The Big Data Stack
Zubair Nabi
 
Big data rmoug
Gwen (Chen) Shapira
 
Sharing a Startup’s Big Data Lessons
George Stathis
 
The Big Data Developer (@pavlobaron)
Pavlo Baron
 
Essential Data Engineering for Data Scientist
SoftServe
 
The Evolving Landscape of Data Engineering
Andrei Savu
 
Big data with Hadoop - Introduction
Tomy Rhymond
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
Introduction to Big Data
IMC Institute
 
UNIT II Evaluating NoSQL for various .pptx
Rahul Borate
 
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
Betacowork
 
Session 10 handling bigger data
bodaceacat
 
Session 10 handling bigger data
Sara-Jayne Terp
 
Survey of Big Data Infrastructures
m.a.kirn
 
Bigdata Unit1.pptx
ssuser92282c
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET Journal
 
SQL or NoSQL, that is the question!
Andraz Tori
 
Big Data Science Workshop Documentation V1.0
Abdelrahman Astro
 
Introduction to Polyglot Persistence
Antonios Giannopoulos
 
The Big Data Stack
Zubair Nabi
 
Ad

More from Great Wide Open (20)

PDF
The Little Meetup That Could
Great Wide Open
 
PDF
Lightning Talk - 5 Hacks to Getting the Job of Your Dreams
Great Wide Open
 
PDF
You Don't Know Node: Quick Intro to 6 Core Features
Great Wide Open
 
PDF
Hidden Features in HTTP
Great Wide Open
 
PPTX
Using Cryptography Properly in Applications
Great Wide Open
 
PDF
Lightning Talk - Getting Students Involved In Open Source
Great Wide Open
 
PDF
How Constraints Cultivate Growth
Great Wide Open
 
PDF
Inner Source 101
Great Wide Open
 
PDF
Running MySQL on Linux
Great Wide Open
 
PDF
Search is the new UI
Great Wide Open
 
PDF
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
PPTX
The Current Messaging Landscape
Great Wide Open
 
PDF
Apache httpd v2.4
Great Wide Open
 
PDF
Understanding Open Source Class 101
Great Wide Open
 
PDF
Thinking in Git
Great Wide Open
 
PDF
Antifragile Design
Great Wide Open
 
PDF
Elasticsearch for SQL Users
Great Wide Open
 
PPTX
Open Source Security Tools for Big Data
Great Wide Open
 
PDF
Access by Default
Great Wide Open
 
PDF
Migrating to Free Software: a Reference Protocol for LibreOffce
Great Wide Open
 
The Little Meetup That Could
Great Wide Open
 
Lightning Talk - 5 Hacks to Getting the Job of Your Dreams
Great Wide Open
 
You Don't Know Node: Quick Intro to 6 Core Features
Great Wide Open
 
Hidden Features in HTTP
Great Wide Open
 
Using Cryptography Properly in Applications
Great Wide Open
 
Lightning Talk - Getting Students Involved In Open Source
Great Wide Open
 
How Constraints Cultivate Growth
Great Wide Open
 
Inner Source 101
Great Wide Open
 
Running MySQL on Linux
Great Wide Open
 
Search is the new UI
Great Wide Open
 
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
The Current Messaging Landscape
Great Wide Open
 
Apache httpd v2.4
Great Wide Open
 
Understanding Open Source Class 101
Great Wide Open
 
Thinking in Git
Great Wide Open
 
Antifragile Design
Great Wide Open
 
Elasticsearch for SQL Users
Great Wide Open
 
Open Source Security Tools for Big Data
Great Wide Open
 
Access by Default
Great Wide Open
 
Migrating to Free Software: a Reference Protocol for LibreOffce
Great Wide Open
 

Recently uploaded (20)

PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
July Patch Tuesday
Ivanti
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
July Patch Tuesday
Ivanti
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 

Dealing with Unstructured Data: Scaling to Infinity

  • 1. Dealing with Unstructured Data Scaling to Infinity Image: Boykung/Shutterstock
  • 4. There are many sources of information
  • 5. Copyright ©2014 Treasure Data. All Rights Reserved. Results Push Results Push SQL Big Data Simplified: One ApproachAppServers Multi-structured Events • register • login • start_event • purchase • etc SQL-based Ad-hoc Queries SQL-based Dashboards DBs & Data Marts Other Apps Results Push Familiar & Table-oriented Infinite & Economical Cloud Data Store ✓App log data ✓Mobile event data ✓Sensor data ✓Telemetry Mobile SDKs Web SDK Multi-structured Events Multi-structured Events Multi-structured Events Multi-structured Events Agent Agent Agent Agent Agent Agent Agent Agent Embedded SDKs Server-side Agents
  • 6. Copyright ©2014 Treasure Data. All Rights Reserved. What is the point of all this data? BI Business Intelligence Using Very Large Sets of Data
  • 8. Copyright ©2015 Treasure Data. All Rights Reserved. Service Launched Series A Funding 100 Customers Selected by Gartner as Cool Vendor in Big Data 10 Trillion Records 5 Trillion Records Treasure Data By the Numbers (Jan-2015): 13T+ records of data imported since launch 500K+ records imported each second 1.5 Trillion+ records imported each month 12B records sent per day by one customer 13 Trillion Records Series B Funding Data Records Stored in the Treasure Data Cloud Service 0 3500000000000 7000000000000 10500000000000 14000000000000 Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13 Oct-13 Dec-13 Feb-14 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14 8 Last 2 years
  • 9. Statistics Total Records Stored 25 Trillion Managed & Supported 24 * 7 * 365 Uptime 99.99% New Records / second 1 Million Daily Twitter volume 100x 1 0 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 24 / 7
  • 10. A solution? • There are trade-offs to consider • Any trade off should make it easy to collect data • Easy does it! un- and semi-structured data (multi- structured data) • Open source means it’s free; also means that you need someone on hand to maintain and implement • Cloud storage means you don’t have to scale and/or shard; tradeoff means performance hit against bare metal Image: John Hammink
  • 12. Images: Lightspring/Shutterstock, John Hammink, Treasure Data There are a few intro to Data Science blogs at blog.treasuredata.com!
  • 13. What does a pipeline need?
  • 14. Open vs. Closed source Image: Heather Craig/Shutterstock
  • 15. Images: PC World, Data-Hive, Wallpapersmela or or ?
  • 17. # logs from a file <source> type tail path /var/log/ httpd.log format apache2 tag web.access </source> # logs from client libraries <source> type forward port 24224 </source> # store logs to ES and HDFS <match *.*> type copy <store> type elasticsearch logstash_format
  • 20. Multi- structured data • un-structured data better for data for ultimate use in statistics
  • 23. an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services embulk.org/docs
  • 26. Hivemall Hivemall is a scalable machine learning library that runs on Apache Hive. Hivemall is designed to be scalable to the number of training instances as well as the number of training features. • Classification • Regression • Recommendation • k-nearest neighbor • Anomaly Detection • Feature Engineering https://github.com/myui/hivemall
  • 27. The Hadoop Story on MongoDB Image courtesy of Steven Francia @ Docker