SlideShare a Scribd company logo
1© StreamSets, Inc. All rights reserved.
Project Ouroboros
Using StreamSets Data Collector to Help Manage
the StreamSets Open Source Community
Pat Patterson / Director of Evangelism
@metadaddy / pat@streamsets.com
2© StreamSets, Inc. All rights reserved.
Who Am I?
Pat Patterson / pat@streamsets.com / @metadaddy
Past: Sun Microsystems, Salesforce
Present: Director of Evangelism, StreamSets
I run far 🏃♂️
3© StreamSets, Inc. All rights reserved.
Who is StreamSets?
Seasoned leadership team Customer base from global
8000
50%
Unique commercial
downloaders
2000+
Open source downloads
worldwide
3,000,000+
Broad connectivity
50+
History of innovation
streamsets.com/about-us
4© StreamSets, Inc. All rights reserved.
The StreamSets DataOps Platform
Data Lake
5© StreamSets, Inc. All rights reserved.
A Swiss Army Knife for Data
6© StreamSets, Inc. All rights reserved.
Parse Fastly CDN logs
Extract records relating to downloads
Gain insights
Companies downloading the binaries
Geographic reach
Metrics for different binary artifacts
Objective
7© StreamSets, Inc. All rights reserved.
Bash script to download S3 objects using AWS CLI tool
sed, grep, sort, uniq, awk, diff, xargs, curl
Complex, hard to maintain, slow, essentially ‘write-only’ code
cut -f 1 -d ' ' merge.log|sort|uniq > ips
diff --new-line-format="" --unchanged-line-
format="" ips allips > newips
cat newips|xargs -L 1 -I% curl -s
http://ipinfo.io/%/org|cut -f 2- -d '
'|sort|uniq>orgs && subl orgs
Before
8© StreamSets, Inc. All rights reserved.
Mission creep
Inertia
Why???
Image Nyah S / Pexels / Pexels License
9© StreamSets, Inc. All rights reserved.
Data Flow
StreamSets
Data Collector
↘
↘
Amazon S3
MySQL
10© StreamSets, Inc. All rights reserved.
Parse Fastly CDN log lines, send data to MySQL
<134>2017-07-09T12:01:13Z cache-sjc3636
StreamSetsS3Bucket[60550]: 104.155.191.102 "-" "-"
Sun, 09 Jul 2017 12:01:12 GMT GET
/datacollector/latest/parcel/manifest.json 200 1295
Let’s Get Started!
11© StreamSets, Inc. All rights reserved.
Grok Patterns are designed for exactly this!
Standard patterns for timestamps, HTTP verbs, filenames
<%{NUMBER:priority}>%{TIMESTAMP_ISO8601:timestamp}
%{HOSTNAME:cachenode}
%{WORD:logname}[%{NUMBER:pid}]: %{IP:ip} "-" "-"
%{DATESTAMP_FASTLY:datestamp} %{WORD:verb}
%{PATH:file} %{NUMBER:code} %{SIZE_OR_NULL}
Simple, Right?
12© StreamSets, Inc. All rights reserved.
First Cut
13© StreamSets, Inc. All rights reserved.
What??? An HTTP request isn’t supposed to include the protocol like that!
Fastly records whatever the client sends, no matter how dumb.
But...
Record1-Error SERVICE_ERROR_001 - Cannot parse record from message 'rawData':
com.streamsets.pipeline.api.service.dataformats.DataParserException:
LOG_PARSER_03 - Log line '<134>2017-07-09T12:01:13Z cache-sjc3636
StreamSetsS3Bucket[60550]: 104.155.191.102 "-" "- Sun, 09 Jul 2017 12:01:12 GMT
GET
https://archives.streamsets.com/datacollector/latest/parcel/STREAMSETS_DATAC
OLLECTOR-1.1.4-el6.parcel 404 0' does not conform to 'Grok Format
14© StreamSets, Inc. All rights reserved.
<%{NUMBER:priority}>%{TIMESTAMP_ISO8601:timestamp}
%{HOSTNAME:cachenode}
%{WORD:logname}[%{NUMBER:pid}]: %{IP:ip} "-" "-"
%{DATESTAMP_FASTLY:datestamp} %{WORD:verb}
%{NOTSPACE:file} %{NUMBER:code} %{SIZE_OR_NULL}
Solution: Be Permissive with your Input
15© StreamSets, Inc. All rights reserved.
Even if you think you know the data
schema - test with real data!
First Lesson Learned
16© StreamSets, Inc. All rights reserved.
Second Cut
17© StreamSets, Inc. All rights reserved.
But
Performance SUCKED!
18© StreamSets, Inc. All rights reserved.
Solution: Duplicate the Data
CREATE TABLE download (
id int(11) AUTO_INCREMENT,
ip varchar(64),
date datetime,
file varchar(767),
PRIMARY KEY (`id`),
KEY `date_idx` (`date`),
KEY `file_idx` (`file`)
);
19© StreamSets, Inc. All rights reserved.
Third Cut
20© StreamSets, Inc. All rights reserved.
30x Better Performance!
21© StreamSets, Inc. All rights reserved.
Filtering Downloads
22© StreamSets, Inc. All rights reserved.
Fit the data model to the data
Second Lesson Learned
23© StreamSets, Inc. All rights reserved.
Lookup company details from IP via Kickfire API
What’s Next?
24© StreamSets, Inc. All rights reserved.
Fourth Cut
25© StreamSets, Inc. All rights reserved.
com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 -
Error fetching resource. Status: 429 Reason: You have reached the
maximum calls per second
org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b
But...
Kickfire API is rate limited!
To deliver optimum performance to all of our API customers, KickFire
balances transaction loads by using rate limits
26© StreamSets, Inc. All rights reserved.
Solution - Rate Limit
27© StreamSets, Inc. All rights reserved.
com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 - Error
fetching resource. Status: 429 Reason: You have reached the maximum calls
per month org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b
But...
Kickfire API has a monthly call limit!
28© StreamSets, Inc. All rights reserved.
Solution - Don’t Ask For Data We Already Have
29© StreamSets, Inc. All rights reserved.
Know your API’s
non-functional constraints!
Third Lesson Learned
30© StreamSets, Inc. All rights reserved.
Fifth Cut
31© StreamSets, Inc. All rights reserved.
Leave to run for a few weeks...
Image © Itzuvit / Wikimedia Commons / CC-BY-SA-3.0
32© StreamSets, Inc. All rights reserved.
com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 -
Error fetching resource. Status: 429 Reason: You have reached the
maximum calls per month
org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b
But...
Kickfire’s monthly call limit strikes again!
33© StreamSets, Inc. All rights reserved.
Root Cause
Seeing large numbers of downloads from the same few IP addresses
Data Collector has a microbatch architecture - database writes are
committed at the end of the batch
New IP address isn’t visible in the database until the start of the next batch
Still making repeated requests to Kickfire for the same IP address!
34© StreamSets, Inc. All rights reserved.
Solution - Deduplicate records on IP Address
35© StreamSets, Inc. All rights reserved.
Data Collector operates batch-by-batch
-
design your pipelines accordingly!
Fourth Lesson Learned
36© StreamSets, Inc. All rights reserved.
The Finished Article
37© StreamSets, Inc. All rights reserved.
A Closer Look
38© StreamSets, Inc. All rights reserved.
No plan survives first
contact with the enemy
Helmuth von Moltke the Elder, "On Strategy"
(1871)
Ultimate Lesson Learned
Image in the public domain
39© StreamSets, Inc. All rights reserved.
or
Ultimate Lesson Learned
40© StreamSets, Inc. All rights reserved.
Everybody has a plan
until they get punched
in the mouth
Mike Tyson (1987)
Ultimate Lesson Learned
Image © Abelito Roldan / Flickr / CC BY 2.0
41© StreamSets, Inc. All rights reserved.
September 3-5, 2019
Tue, Sep 3 - Training & Tutorials
Wed-Thu, Sep 4-5, Keynote & Breakouts
Hilton Financial District
(Tue|Wed|Thur)
42© StreamSets, Inc. All rights reserved.
Questions?
43© StreamSets, Inc. All rights reserved.
Thank you
43© StreamSets, Inc. All rights reserved.
Pat Patterson / Director of Evangelism
@metadaddy / pat@streamsets.com

More Related Content

What's hot (20)

PPTX
Innovation in the Enterprise Rent-A-Car Data Warehouse
DataWorks Summit
 
PPTX
HIPAA Compliance in the Cloud
DataWorks Summit/Hadoop Summit
 
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
PDF
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
confluent
 
PDF
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
HostedbyConfluent
 
PPTX
PCAP Graphs for Cybersecurity and System Tuning
Dr. Mirko Kämpf
 
PDF
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Big Data Spain
 
PDF
Enterprise Metadata Integration
Dr. Mirko Kämpf
 
PDF
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
HostedbyConfluent
 
PPTX
Mainframe Modernization with Precisely and Microsoft Azure
Precisely
 
PDF
Spark meetup - Zoomdata Streaming
Zoomdata
 
PPTX
Streaming Data Ingest and Processing with Apache Kafka
Attunity
 
PPTX
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
PDF
Continus sql with sql stream builder
Timothy Spann
 
PDF
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
HostedbyConfluent
 
PDF
Application modernization patterns with apache kafka, debezium, and kubernete...
Bilgin Ibryam
 
PPTX
Lambda architecture with Spark
Vincent GALOPIN
 
PDF
Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...
HostedbyConfluent
 
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
PDF
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
HostedbyConfluent
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
DataWorks Summit
 
HIPAA Compliance in the Cloud
DataWorks Summit/Hadoop Summit
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
confluent
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
HostedbyConfluent
 
PCAP Graphs for Cybersecurity and System Tuning
Dr. Mirko Kämpf
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Big Data Spain
 
Enterprise Metadata Integration
Dr. Mirko Kämpf
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
HostedbyConfluent
 
Mainframe Modernization with Precisely and Microsoft Azure
Precisely
 
Spark meetup - Zoomdata Streaming
Zoomdata
 
Streaming Data Ingest and Processing with Apache Kafka
Attunity
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
Continus sql with sql stream builder
Timothy Spann
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
HostedbyConfluent
 
Application modernization patterns with apache kafka, debezium, and kubernete...
Bilgin Ibryam
 
Lambda architecture with Spark
Vincent GALOPIN
 
Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...
HostedbyConfluent
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
HostedbyConfluent
 

Similar to Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamSets Open Source Community (20)

PDF
A primer on building real time data-driven products
Lars Albertsson
 
PPTX
Data Engineer's Lunch #57: StreamSets for Data Engineering
Anant Corporation
 
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
PDF
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
Gaurav "GP" Pal
 
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
PPTX
Warsaw MuleSoft Meetup #12 Effective Streaming
Patryk Bandurski
 
PPTX
Analysing streaming data in real time (AWS)
javier ramirez
 
PDF
Logging infrastructure for Microservices using StreamSets Data Collector
Cask Data
 
PDF
AWS Community Nordics Virtual Meetup
Anahit Pogosova
 
PDF
Building real time data-driven products
Lars Albertsson
 
PDF
[ODSC EUROPE 2022] Eagleeye - Data Pipeline for Anomaly Detection in Cyber Se...
TuhinSharma15
 
PDF
DataSift September '12 Release Overview
DataSift
 
PPTX
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
PPTX
Anatomy behind Fast Data Applications.pptx
dusavamsikrisna
 
PDF
Buy ebook Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger cheap price
conacofagot41
 
PDF
Download Complete Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger ...
aisaraserale
 
PDF
Scaling up business value with real-time operational graph analytics
Connected Data World
 
PDF
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
aroubkihak
 
PDF
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
A primer on building real time data-driven products
Lars Albertsson
 
Data Engineer's Lunch #57: StreamSets for Data Engineering
Anant Corporation
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
Gaurav "GP" Pal
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
Warsaw MuleSoft Meetup #12 Effective Streaming
Patryk Bandurski
 
Analysing streaming data in real time (AWS)
javier ramirez
 
Logging infrastructure for Microservices using StreamSets Data Collector
Cask Data
 
AWS Community Nordics Virtual Meetup
Anahit Pogosova
 
Building real time data-driven products
Lars Albertsson
 
[ODSC EUROPE 2022] Eagleeye - Data Pipeline for Anomaly Detection in Cyber Se...
TuhinSharma15
 
DataSift September '12 Release Overview
DataSift
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
Anatomy behind Fast Data Applications.pptx
dusavamsikrisna
 
Buy ebook Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger cheap price
conacofagot41
 
Download Complete Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger ...
aisaraserale
 
Scaling up business value with real-time operational graph analytics
Connected Data World
 
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
aroubkihak
 
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
Ad

More from Pat Patterson (20)

PPTX
DevOps from the Provider Perspective
Pat Patterson
 
PPTX
How Imprivata Combines External Data Sources for Business Insights
Pat Patterson
 
PPTX
Data Integration with Apache Kafka: What, Why, How
Pat Patterson
 
PPTX
Dealing with Drift: Building an Enterprise Data Lake
Pat Patterson
 
PPTX
Integrating with Einstein Analytics
Pat Patterson
 
PPTX
Efficient Schemas in Motion with Kafka and Schema Registry
Pat Patterson
 
PPTX
Dealing With Drift - Building an Enterprise Data Lake
Pat Patterson
 
PPTX
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
PPTX
Adaptive Data Cleansing with StreamSets and Cassandra
Pat Patterson
 
PDF
Building Custom Big Data Integrations
Pat Patterson
 
PPTX
Ingest and Stream Processing - What will you choose?
Pat Patterson
 
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
PPTX
Ingest and Stream Processing - What will you choose?
Pat Patterson
 
PPTX
All Aboard the Boxcar! Going Beyond the Basics of REST
Pat Patterson
 
PPTX
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
Pat Patterson
 
PPTX
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
Pat Patterson
 
PPTX
Enterprise IoT: Data in Context
Pat Patterson
 
PPTX
OData: A Standard API for Data Access
Pat Patterson
 
PPTX
API-Driven Relationships: Building The Trans-Internet Express of the Future
Pat Patterson
 
PPTX
Using Salesforce to Manage Your Developer Community
Pat Patterson
 
DevOps from the Provider Perspective
Pat Patterson
 
How Imprivata Combines External Data Sources for Business Insights
Pat Patterson
 
Data Integration with Apache Kafka: What, Why, How
Pat Patterson
 
Dealing with Drift: Building an Enterprise Data Lake
Pat Patterson
 
Integrating with Einstein Analytics
Pat Patterson
 
Efficient Schemas in Motion with Kafka and Schema Registry
Pat Patterson
 
Dealing With Drift - Building an Enterprise Data Lake
Pat Patterson
 
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Adaptive Data Cleansing with StreamSets and Cassandra
Pat Patterson
 
Building Custom Big Data Integrations
Pat Patterson
 
Ingest and Stream Processing - What will you choose?
Pat Patterson
 
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
Ingest and Stream Processing - What will you choose?
Pat Patterson
 
All Aboard the Boxcar! Going Beyond the Basics of REST
Pat Patterson
 
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
Pat Patterson
 
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
Pat Patterson
 
Enterprise IoT: Data in Context
Pat Patterson
 
OData: A Standard API for Data Access
Pat Patterson
 
API-Driven Relationships: Building The Trans-Internet Express of the Future
Pat Patterson
 
Using Salesforce to Manage Your Developer Community
Pat Patterson
 
Ad

Recently uploaded (20)

PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Executive Business Intelligence Dashboards
vandeslie24
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 

Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamSets Open Source Community

  • 1. 1© StreamSets, Inc. All rights reserved. Project Ouroboros Using StreamSets Data Collector to Help Manage the StreamSets Open Source Community Pat Patterson / Director of Evangelism @metadaddy / [email protected]
  • 2. 2© StreamSets, Inc. All rights reserved. Who Am I? Pat Patterson / [email protected] / @metadaddy Past: Sun Microsystems, Salesforce Present: Director of Evangelism, StreamSets I run far 🏃♂️
  • 3. 3© StreamSets, Inc. All rights reserved. Who is StreamSets? Seasoned leadership team Customer base from global 8000 50% Unique commercial downloaders 2000+ Open source downloads worldwide 3,000,000+ Broad connectivity 50+ History of innovation streamsets.com/about-us
  • 4. 4© StreamSets, Inc. All rights reserved. The StreamSets DataOps Platform Data Lake
  • 5. 5© StreamSets, Inc. All rights reserved. A Swiss Army Knife for Data
  • 6. 6© StreamSets, Inc. All rights reserved. Parse Fastly CDN logs Extract records relating to downloads Gain insights Companies downloading the binaries Geographic reach Metrics for different binary artifacts Objective
  • 7. 7© StreamSets, Inc. All rights reserved. Bash script to download S3 objects using AWS CLI tool sed, grep, sort, uniq, awk, diff, xargs, curl Complex, hard to maintain, slow, essentially ‘write-only’ code cut -f 1 -d ' ' merge.log|sort|uniq > ips diff --new-line-format="" --unchanged-line- format="" ips allips > newips cat newips|xargs -L 1 -I% curl -s http://ipinfo.io/%/org|cut -f 2- -d ' '|sort|uniq>orgs && subl orgs Before
  • 8. 8© StreamSets, Inc. All rights reserved. Mission creep Inertia Why??? Image Nyah S / Pexels / Pexels License
  • 9. 9© StreamSets, Inc. All rights reserved. Data Flow StreamSets Data Collector ↘ ↘ Amazon S3 MySQL
  • 10. 10© StreamSets, Inc. All rights reserved. Parse Fastly CDN log lines, send data to MySQL <134>2017-07-09T12:01:13Z cache-sjc3636 StreamSetsS3Bucket[60550]: 104.155.191.102 "-" "-" Sun, 09 Jul 2017 12:01:12 GMT GET /datacollector/latest/parcel/manifest.json 200 1295 Let’s Get Started!
  • 11. 11© StreamSets, Inc. All rights reserved. Grok Patterns are designed for exactly this! Standard patterns for timestamps, HTTP verbs, filenames <%{NUMBER:priority}>%{TIMESTAMP_ISO8601:timestamp} %{HOSTNAME:cachenode} %{WORD:logname}[%{NUMBER:pid}]: %{IP:ip} "-" "-" %{DATESTAMP_FASTLY:datestamp} %{WORD:verb} %{PATH:file} %{NUMBER:code} %{SIZE_OR_NULL} Simple, Right?
  • 12. 12© StreamSets, Inc. All rights reserved. First Cut
  • 13. 13© StreamSets, Inc. All rights reserved. What??? An HTTP request isn’t supposed to include the protocol like that! Fastly records whatever the client sends, no matter how dumb. But... Record1-Error SERVICE_ERROR_001 - Cannot parse record from message 'rawData': com.streamsets.pipeline.api.service.dataformats.DataParserException: LOG_PARSER_03 - Log line '<134>2017-07-09T12:01:13Z cache-sjc3636 StreamSetsS3Bucket[60550]: 104.155.191.102 "-" "- Sun, 09 Jul 2017 12:01:12 GMT GET https://archives.streamsets.com/datacollector/latest/parcel/STREAMSETS_DATAC OLLECTOR-1.1.4-el6.parcel 404 0' does not conform to 'Grok Format
  • 14. 14© StreamSets, Inc. All rights reserved. <%{NUMBER:priority}>%{TIMESTAMP_ISO8601:timestamp} %{HOSTNAME:cachenode} %{WORD:logname}[%{NUMBER:pid}]: %{IP:ip} "-" "-" %{DATESTAMP_FASTLY:datestamp} %{WORD:verb} %{NOTSPACE:file} %{NUMBER:code} %{SIZE_OR_NULL} Solution: Be Permissive with your Input
  • 15. 15© StreamSets, Inc. All rights reserved. Even if you think you know the data schema - test with real data! First Lesson Learned
  • 16. 16© StreamSets, Inc. All rights reserved. Second Cut
  • 17. 17© StreamSets, Inc. All rights reserved. But Performance SUCKED!
  • 18. 18© StreamSets, Inc. All rights reserved. Solution: Duplicate the Data CREATE TABLE download ( id int(11) AUTO_INCREMENT, ip varchar(64), date datetime, file varchar(767), PRIMARY KEY (`id`), KEY `date_idx` (`date`), KEY `file_idx` (`file`) );
  • 19. 19© StreamSets, Inc. All rights reserved. Third Cut
  • 20. 20© StreamSets, Inc. All rights reserved. 30x Better Performance!
  • 21. 21© StreamSets, Inc. All rights reserved. Filtering Downloads
  • 22. 22© StreamSets, Inc. All rights reserved. Fit the data model to the data Second Lesson Learned
  • 23. 23© StreamSets, Inc. All rights reserved. Lookup company details from IP via Kickfire API What’s Next?
  • 24. 24© StreamSets, Inc. All rights reserved. Fourth Cut
  • 25. 25© StreamSets, Inc. All rights reserved. com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 - Error fetching resource. Status: 429 Reason: You have reached the maximum calls per second org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b But... Kickfire API is rate limited! To deliver optimum performance to all of our API customers, KickFire balances transaction loads by using rate limits
  • 26. 26© StreamSets, Inc. All rights reserved. Solution - Rate Limit
  • 27. 27© StreamSets, Inc. All rights reserved. com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 - Error fetching resource. Status: 429 Reason: You have reached the maximum calls per month org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b But... Kickfire API has a monthly call limit!
  • 28. 28© StreamSets, Inc. All rights reserved. Solution - Don’t Ask For Data We Already Have
  • 29. 29© StreamSets, Inc. All rights reserved. Know your API’s non-functional constraints! Third Lesson Learned
  • 30. 30© StreamSets, Inc. All rights reserved. Fifth Cut
  • 31. 31© StreamSets, Inc. All rights reserved. Leave to run for a few weeks... Image © Itzuvit / Wikimedia Commons / CC-BY-SA-3.0
  • 32. 32© StreamSets, Inc. All rights reserved. com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 - Error fetching resource. Status: 429 Reason: You have reached the maximum calls per month org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b But... Kickfire’s monthly call limit strikes again!
  • 33. 33© StreamSets, Inc. All rights reserved. Root Cause Seeing large numbers of downloads from the same few IP addresses Data Collector has a microbatch architecture - database writes are committed at the end of the batch New IP address isn’t visible in the database until the start of the next batch Still making repeated requests to Kickfire for the same IP address!
  • 34. 34© StreamSets, Inc. All rights reserved. Solution - Deduplicate records on IP Address
  • 35. 35© StreamSets, Inc. All rights reserved. Data Collector operates batch-by-batch - design your pipelines accordingly! Fourth Lesson Learned
  • 36. 36© StreamSets, Inc. All rights reserved. The Finished Article
  • 37. 37© StreamSets, Inc. All rights reserved. A Closer Look
  • 38. 38© StreamSets, Inc. All rights reserved. No plan survives first contact with the enemy Helmuth von Moltke the Elder, "On Strategy" (1871) Ultimate Lesson Learned Image in the public domain
  • 39. 39© StreamSets, Inc. All rights reserved. or Ultimate Lesson Learned
  • 40. 40© StreamSets, Inc. All rights reserved. Everybody has a plan until they get punched in the mouth Mike Tyson (1987) Ultimate Lesson Learned Image © Abelito Roldan / Flickr / CC BY 2.0
  • 41. 41© StreamSets, Inc. All rights reserved. September 3-5, 2019 Tue, Sep 3 - Training & Tutorials Wed-Thu, Sep 4-5, Keynote & Breakouts Hilton Financial District (Tue|Wed|Thur)
  • 42. 42© StreamSets, Inc. All rights reserved. Questions?
  • 43. 43© StreamSets, Inc. All rights reserved. Thank you 43© StreamSets, Inc. All rights reserved. Pat Patterson / Director of Evangelism @metadaddy / [email protected]