Dealing with Unstructured Data: Scaling to Infinity

1 like418 views

The document discusses the challenges and strategies for managing unstructured and semi-structured data in the context of big data, highlighting the importance of utilizing cloud services and SQL-based queries for effective data analysis. It provides statistics on Treasure Data's performance, including vast amounts of data imported and stored, and emphasizes the need for trade-offs between ease of data collection and system performance. Additionally, it mentions various tools and libraries like Fluentd and Hivemall that facilitate data processing and machine learning in big data environments.

Technology

Dealing with
Unstructured Data
Scaling to Inﬁnity
Image: Boykung/Shutterstock

Dealing with Unstructured Data: Scaling to Infinity

Copyright ©2014 Treasure Data. All Rights Reserved.
Results Push
Results Push
SQL
Big Data Simplified: One ApproachAppServers
Multi-structured Events
• register
• login
• start_event
• purchase
• etc
SQL-based
Ad-hoc Queries
SQL-based Dashboards
DBs & Data Marts
Other Apps
Results Push
Familiar &
Table-oriented
Infinite & Economical
Cloud Data Store
✓App log data
✓Mobile event data
✓Sensor data
✓Telemetry
Mobile SDKs
Web SDK
Multi-structured Events
Multi-structured Events
Multi-structured Events
Multi-structured Events
Agent
Agent
Agent
Agent Agent
Agent
Agent
Agent
Embedded SDKs
Server-side Agents

Copyright ©2014 Treasure Data. All Rights Reserved.
What is the point of all this data?
BI
Business
Intelligence
Using Very Large
Sets of Data

Copyright ©2015 Treasure Data. All Rights
Reserved.
Service Launched
Series A Funding
100 Customers
Selected by Gartner as
Cool Vendor in Big Data
10 Trillion
Records
5 Trillion Records
Treasure Data By the Numbers (Jan-2015):
13T+ records of data imported since launch
500K+ records imported each second
1.5 Trillion+ records imported each month
12B records sent per day by one customer
13 Trillion Records
Series B Funding
Data Records Stored in the Treasure Data Cloud Service
0
3500000000000
7000000000000
10500000000000
14000000000000
Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13 Oct-13 Dec-13 Feb-14 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14
8
Last 2 years

Statistics
Total Records
Stored
25
Trillion
Managed &
Supported
24 * 7 *
365
Uptime
99.99%
New Records /
second
1
Million Daily Twitter
volume
100x
1 0 1 1 0
0 0 1 0 1
1 1 0 0
0 0 1
24 / 7

A solution?
• There are trade-offs to consider
• Any trade off should make it easy to collect data
• Easy does it! un- and semi-structured data (multi-
structured data)
• Open source means it’s free; also means that you need
someone on hand to maintain and implement
• Cloud storage means you don’t have to scale and/or
shard; tradeoff means performance hit against bare metal
Image: John Hammink

Images: Lightspring/Shutterstock, John Hammink, Treasure Data
There are a few intro to
Data Science blogs at
blog.treasuredata.com!

Open vs. Closed source
Image: Heather Craig/Shutterstock

Images: PC World, Data-Hive, Wallpapersmela
or
or
?

# logs from a file
<source>
type tail
path /var/log/
httpd.log
format apache2
tag web.access
</source>
# logs from client
libraries
<source>
type forward
port 24224
</source>
# store logs to ES and
HDFS
<match *.*>
type copy
<store>
type elasticsearch
logstash_format

Multi- structured data
• un-structured data
better for data for
ultimate use in
statistics

fluentd!
http://www.ﬂuentd.org/

http://msgpack.org/

an open-source bulk data loader that helps data
transfer between various databases, storages, ﬁle
formats, and cloud services
embulk.org/docs

Hivemall
Hivemall is a scalable machine learning library that
runs on Apache Hive.
Hivemall is designed to be scalable to the number
of training instances as well as the number of
training features.
• Classification
• Regression
• Recommendation
• k-nearest neighbor
• Anomaly Detection
• Feature Engineering
https://github.com/myui/hivemall

The Hadoop Story on MongoDB
Image courtesy of Steven Francia @ Docker

More Related Content

What's hot (20)

PPTX

E-Commerce and MongoDB at Backcountry.comMongoDB

PPTX

Everything you need to know about external sharing in OneDrive, SharePoint, a...Drew Madelung

PPTX

Dutch Information Worker User Group - January 2022 - eDiscovery and Microsoft...Albert Hoitingh

PPTX

What your IT Doesn't Know about Publishing DITA Contentctnitchie

PDF

O365Engage17 - Protecting O365 Data in a Modern WorldNCCOMMS

PPTX

What’s new in SharePoint 2016!AntonioMaio2

PDF

Hybrid Dilemma: Dividing Content Between Azure, Office 365 & SharePoint 2016Adam Levithan

PPTX

OneDrive & SharePoint Better TogetherDrew Madelung

PDF

Data Security and Protection in DevOps Karen Lopez

PPTX

SharePoint 2013 ediscovery overviewElie Kash

PDF

Oracle Document Cloud ServiceArush Jain

PDF

SharePoint Saturday Ottawa - How secure is my data in office 365?AntonioMaio2

PDF

Good to Great SharePoint GovernanceNCCOMMS

PPTX

Oracle documents cloud serviceGetting value from IoT, Integration and Data Analytics

PDF

O365Engage17 - Skype for Business Cloud PBX in the Real WorldNCCOMMS

PDF

Delve and the Office Graph for IT- Pros & AdminsSPC Adriatics

PPTX

SharePoint Migration Series: Success Takes Three ActionsAdam Levithan

PPTX

Is BCS Dead?Jeff Fried

PPTX

Governance is Not An Optionspsnyc

PPTX

Navigating the Mess of a Shared drive Migration to SharePointJoanne Klein

E-Commerce and MongoDB at Backcountry.comMongoDB

Everything you need to know about external sharing in OneDrive, SharePoint, a...Drew Madelung

Dutch Information Worker User Group - January 2022 - eDiscovery and Microsoft...Albert Hoitingh

What your IT Doesn't Know about Publishing DITA Contentctnitchie

O365Engage17 - Protecting O365 Data in a Modern WorldNCCOMMS

What’s new in SharePoint 2016!AntonioMaio2

Hybrid Dilemma: Dividing Content Between Azure, Office 365 & SharePoint 2016Adam Levithan

OneDrive & SharePoint Better TogetherDrew Madelung

Data Security and Protection in DevOps Karen Lopez

SharePoint 2013 ediscovery overviewElie Kash

Oracle Document Cloud ServiceArush Jain

SharePoint Saturday Ottawa - How secure is my data in office 365?AntonioMaio2

Good to Great SharePoint GovernanceNCCOMMS

Oracle documents cloud serviceGetting value from IoT, Integration and Data Analytics

O365Engage17 - Skype for Business Cloud PBX in the Real WorldNCCOMMS

Delve and the Office Graph for IT- Pros & AdminsSPC Adriatics

SharePoint Migration Series: Success Takes Three ActionsAdam Levithan

Is BCS Dead?Jeff Fried

Governance is Not An Optionspsnyc

Navigating the Mess of a Shared drive Migration to SharePointJoanne Klein

Viewers also liked (9)

PDF

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Peter Wren-Hilton

PPTX

Unstructured data processing webinar 06272016George Roth

PPTX

Hotsos 2013 - Creating Structure in Unstructured DataMarco Gralike

PPT

Lecture 11 Unstructured Data and the Data Warehousephanleson

PPTX

The Analytic System: Finding Patterns in the DataHealth Catalyst

PPSX

Unstructured Data in BIMonaheng Diaho

PDF

Analyzing Unstructured Data in Hadoop WebinarDatameer

PPT

Analysis of ‘Unstructured’ DataSeth Grimes

PPTX

Using Hadoop as a platform for Master Data ManagementDataWorks Summit

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Peter Wren-Hilton

Unstructured data processing webinar 06272016George Roth

Hotsos 2013 - Creating Structure in Unstructured DataMarco Gralike

Lecture 11 Unstructured Data and the Data Warehousephanleson

The Analytic System: Finding Patterns in the DataHealth Catalyst

Unstructured Data in BIMonaheng Diaho

Analyzing Unstructured Data in Hadoop WebinarDatameer

Analysis of ‘Unstructured’ DataSeth Grimes

Using Hadoop as a platform for Master Data ManagementDataWorks Summit

Similar to Dealing with Unstructured Data: Scaling to Infinity (20)

PDF

Big data rmougGwen (Chen) Shapira

PPTX

Sharing a Startup’s Big Data LessonsGeorge Stathis

PDF

The Big Data Developer (@pavlobaron)Pavlo Baron

PPTX

Essential Data Engineering for Data Scientist SoftServe

PDF

The Evolving Landscape of Data EngineeringAndrei Savu

PPSX

Big data with Hadoop - IntroductionTomy Rhymond

PDF

Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari

PDF

Introduction to Big DataIMC Institute

PPTX

UNIT II Evaluating NoSQL for various .pptxRahul Borate

PDF

Course 3 : Types of data and opportunities by Nikolaos DeligiannisBetacowork

PPTX

Session 10 handling bigger databodaceacat

PPTX

Session 10 handling bigger dataSara-Jayne Terp

PDF

Survey of Big Data Infrastructuresm.a.kirn

PPTX

Bigdata Unit1.pptxssuser92282c

PDF

IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET Journal

PPT

SQL or NoSQL, that is the question!Andraz Tori

PDF

Big Data Science Workshop Documentation V1.0Abdelrahman Astro

PPTX

Introduction to Polyglot Persistence Antonios Giannopoulos

PDF

Big Data Putchong Uthayopas

PPTX

The Big Data StackZubair Nabi

Big data rmougGwen (Chen) Shapira

Sharing a Startup’s Big Data LessonsGeorge Stathis

The Big Data Developer (@pavlobaron)Pavlo Baron

Essential Data Engineering for Data Scientist SoftServe

The Evolving Landscape of Data EngineeringAndrei Savu

Big data with Hadoop - IntroductionTomy Rhymond

Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari

Introduction to Big DataIMC Institute

UNIT II Evaluating NoSQL for various .pptxRahul Borate

Course 3 : Types of data and opportunities by Nikolaos DeligiannisBetacowork

Session 10 handling bigger databodaceacat

Session 10 handling bigger dataSara-Jayne Terp

Survey of Big Data Infrastructuresm.a.kirn

Bigdata Unit1.pptxssuser92282c

IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET Journal

SQL or NoSQL, that is the question!Andraz Tori

Big Data Science Workshop Documentation V1.0Abdelrahman Astro

Introduction to Polyglot Persistence Antonios Giannopoulos

Big Data Putchong Uthayopas

The Big Data StackZubair Nabi

More from Great Wide Open (20)

PDF

The Little Meetup That CouldGreat Wide Open

PDF

Lightning Talk - 5 Hacks to Getting the Job of Your DreamsGreat Wide Open

PDF

You Don't Know Node: Quick Intro to 6 Core FeaturesGreat Wide Open

PDF

Hidden Features in HTTPGreat Wide Open

PPTX

Using Cryptography Properly in ApplicationsGreat Wide Open

PDF

Lightning Talk - Getting Students Involved In Open SourceGreat Wide Open

PDF

How Constraints Cultivate GrowthGreat Wide Open

PDF

Inner Source 101Great Wide Open

PDF

Running MySQL on LinuxGreat Wide Open

PDF

Search is the new UIGreat Wide Open

PDF

Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open

PPTX

The Current Messaging LandscapeGreat Wide Open

PDF

Apache httpd v2.4Great Wide Open

PDF

Understanding Open Source Class 101Great Wide Open

PDF

Thinking in GitGreat Wide Open

PDF

Antifragile DesignGreat Wide Open

PDF

Elasticsearch for SQL UsersGreat Wide Open

PPTX

Open Source Security Tools for Big DataGreat Wide Open

PDF

Access by DefaultGreat Wide Open

PDF

Migrating to Free Software: a Reference Protocol for LibreOffceGreat Wide Open

The Little Meetup That CouldGreat Wide Open

Lightning Talk - 5 Hacks to Getting the Job of Your DreamsGreat Wide Open

You Don't Know Node: Quick Intro to 6 Core FeaturesGreat Wide Open

Hidden Features in HTTPGreat Wide Open

Using Cryptography Properly in ApplicationsGreat Wide Open

Lightning Talk - Getting Students Involved In Open SourceGreat Wide Open

How Constraints Cultivate GrowthGreat Wide Open

Inner Source 101Great Wide Open

Running MySQL on LinuxGreat Wide Open

Search is the new UIGreat Wide Open

Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open

The Current Messaging LandscapeGreat Wide Open

Apache httpd v2.4Great Wide Open

Understanding Open Source Class 101Great Wide Open

Thinking in GitGreat Wide Open

Antifragile DesignGreat Wide Open

Elasticsearch for SQL UsersGreat Wide Open

Open Source Security Tools for Big DataGreat Wide Open

Access by DefaultGreat Wide Open

Migrating to Free Software: a Reference Protocol for LibreOffceGreat Wide Open

Recently uploaded (20)

PPTX

Building Search Using OpenSearch: Limitations and WorkaroundsSease

PPTX

AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptxsameeraaabegumm

PDF

From Code to Challenge: Crafting Skill-Based Games That Engage and Rewardaiyshauae

PDF

Empower Inclusion Through Accessible Java ApplicationsAna-Maria Mihalceanu

PDF

IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...Rejig Digital

PDF

July Patch TuesdayIvanti

PDF

CIFDAQ Market Insights for July 7th 2025CIFDAQ

PDF

Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025BookNet Canada

PDF

"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...Fwdays

PPTX

COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGISSharanya Sarkar

PDF

DevBcn - Building 10x Organizations Using Modern Productivity MetricsJustin Reock

PDF

Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...darshakparmar

PDF

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

PDF

LLMs.txt: Easily Control How AI Crawls Your SiteKeploy

PDF

CIFDAQ Market Wrap for the week of 4th July 2025CIFDAQ

PPTX

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

PDF

How Startups Are Growing Faster with App Developers in Australia.pdfIndia App Developer

PDF

Presentation - Vibe Coding The Future of Techyanuarsinggih1

PDF

[Newgen] NewgenONE Marvin Brochure 1.pdfdarshakparmar

PDF

Using FME to Develop Self-Service CAD Applications for a Major UK Police ForceSafe Software

Building Search Using OpenSearch: Limitations and WorkaroundsSease

AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptxsameeraaabegumm

From Code to Challenge: Crafting Skill-Based Games That Engage and Rewardaiyshauae

Empower Inclusion Through Accessible Java ApplicationsAna-Maria Mihalceanu

IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...Rejig Digital

July Patch TuesdayIvanti

CIFDAQ Market Insights for July 7th 2025CIFDAQ

Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025BookNet Canada

"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...Fwdays

COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGISSharanya Sarkar

DevBcn - Building 10x Organizations Using Modern Productivity MetricsJustin Reock

Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...darshakparmar

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

LLMs.txt: Easily Control How AI Crawls Your SiteKeploy

CIFDAQ Market Wrap for the week of 4th July 2025CIFDAQ

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

How Startups Are Growing Faster with App Developers in Australia.pdfIndia App Developer

Presentation - Vibe Coding The Future of Techyanuarsinggih1

[Newgen] NewgenONE Marvin Brochure 1.pdfdarshakparmar

Using FME to Develop Self-Service CAD Applications for a Major UK Police ForceSafe Software

Dealing with Unstructured Data: Scaling to Infinity

1. Dealing with Unstructured Data Scaling to Inﬁnity Image: Boykung/Shutterstock

2. Image: John Hammink

4. There are many sources of information

5. Copyright ©2014 Treasure Data. All Rights Reserved. Results Push Results Push SQL Big Data Simplified: One ApproachAppServers Multi-structured Events • register • login • start_event • purchase • etc SQL-based Ad-hoc Queries SQL-based Dashboards DBs & Data Marts Other Apps Results Push Familiar & Table-oriented Infinite & Economical Cloud Data Store ✓App log data ✓Mobile event data ✓Sensor data ✓Telemetry Mobile SDKs Web SDK Multi-structured Events Multi-structured Events Multi-structured Events Multi-structured Events Agent Agent Agent Agent Agent Agent Agent Agent Embedded SDKs Server-side Agents

8. Copyright ©2015 Treasure Data. All Rights Reserved. Service Launched Series A Funding 100 Customers Selected by Gartner as Cool Vendor in Big Data 10 Trillion Records 5 Trillion Records Treasure Data By the Numbers (Jan-2015): 13T+ records of data imported since launch 500K+ records imported each second 1.5 Trillion+ records imported each month 12B records sent per day by one customer 13 Trillion Records Series B Funding Data Records Stored in the Treasure Data Cloud Service 0 3500000000000 7000000000000 10500000000000 14000000000000 Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13 Oct-13 Dec-13 Feb-14 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14 8 Last 2 years

9. Statistics Total Records Stored 25 Trillion Managed & Supported 24 * 7 * 365 Uptime 99.99% New Records / second 1 Million Daily Twitter volume 100x 1 0 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 24 / 7

10. A solution? • There are trade-offs to consider • Any trade off should make it easy to collect data • Easy does it! un- and semi-structured data (multi- structured data) • Open source means it’s free; also means that you need someone on hand to maintain and implement • Cloud storage means you don’t have to scale and/or shard; tradeoff means performance hit against bare metal Image: John Hammink

11. Image: Dreamstime

12. Images: Lightspring/Shutterstock, John Hammink, Treasure Data There are a few intro to Data Science blogs at blog.treasuredata.com!

13. What does a pipeline need?

14. Open vs. Closed source Image: Heather Craig/Shutterstock

15. Images: PC World, Data-Hive, Wallpapersmela or or ?

16. LAMBDA ARCHITECTURE

17. # logs from a file <source> type tail path /var/log/ httpd.log format apache2 tag web.access </source> # logs from client libraries <source> type forward port 24224 </source> # store logs to ES and HDFS <match *.*> type copy <store> type elasticsearch logstash_format

18. LESS SIMPLE FORWARDING

19. Before fluentd

20. Multi- structured data • un-structured data better for data for ultimate use in statistics

21. fluentd! http://www.ﬂuentd.org/

22. http://msgpack.org/

23. an open-source bulk data loader that helps data transfer between various databases, storages, ﬁle formats, and cloud services embulk.org/docs

26. Hivemall Hivemall is a scalable machine learning library that runs on Apache Hive. Hivemall is designed to be scalable to the number of training instances as well as the number of training features. • Classification • Regression • Recommendation • k-nearest neighbor • Anomaly Detection • Feature Engineering https://github.com/myui/hivemall

27. The Hadoop Story on MongoDB Image courtesy of Steven Francia @ Docker

28. Questions?