SlideShare a Scribd company logo
The Road to Uncovering Botnets
From Python Scikit-Learn
to Scala Spark
whoami
• Avi Aminov
– ~2 years Security Researcher at Akamai
– Physics PhD student
• Asaf Nadler
– ~1.5 years Security Researcher at Akamai
– CS PhD student
Enterprise Threat Protection
• Detect malware presence from outbound traffic
– Behavioral pattern analysis
– Domain blacklisting
• Availability – End of June ’17
Akamai
Recursive
DNS
Branch / HQ
Enterprise
DNS
Data
• Akamai Data
– 20-30% of internet traffic
– Customer ISP/Enterprise logs – 20B DNS queries/day
• Third party data
– e.g. Authoritative DNS log lines
• Open data sources
– e.g. WHOIS information
Bot Networks – IP Fluxing
• Goal – Evasion
– Regular bots: waiting for orders
– Proxies: concealing origin server
Command
& Control
server
Bots
Proxy Bots
Bot Networks Detection
• Detect illegitimate IP fluxing
• Features
– IP dispersity (Geo, systems)
– TTL features
– Lexical
Domain Description #Systems #Countries
astro-travels.net PoS CNC Host 157 11
Decision Tree Model
Malicious with high confidence
• Spread across systems
• Unpopular
Benign with high confidence
• IPs in the same system
• Contains meaningful words
Challenge – Going to Production
Feature
Extraction
Scoring Blacklist
Feature
Extraction
Model
Training Model
Model
Evaluation
Data
Sources
What have we done so far?
• Flow
– Researcher describes an algorithm (document + Hive query)
– Dev rewrites the code in MapReduce (now Scala/Spark)
• Problems
– Not applicable to ML pipelines
– Prone to mistakes
– Longer development cycle
Can We Do Better? Option #1
• Research side – Pipeline in Scala/Spark
• Dev side – Implement the algorithms
• Pros
– Greater flexibility
– Research scale
• Cons
– Learning curve
– Lose sklearn/R benefits
Can We Do Better? Option #2
• Research side – Train locally and export model
• Dev side – Transform data using imported model
• Pros
– Quick implementation
– Unified procedure
• Cons
– No support for all models
Export scheme
• Predictive Model Markup Language
• General scheme for ML pipelines
– Data transformations
– Scoring models
• XML format – Readable
• Supported by major data science / ML
frameworks using jPMML (R, sklearn)
PMML Simple Boilerplate
Python (Research side) Scala (Dev side)
Credit: jpmml lib http://openscoring.io/ , https://github.com/jpmml/
Maintained by Villu Ruusmann
Lessons Learned
• Work process adjusted to the task
– Training locally? Export the model
– Training on larger scales? Better to use Spark
• Use jpmml for model export
• When applicable, reduce workload in production
– Example – only look at domains with many IPs
Challenge solved
Feature
Extraction
Scoring Blacklist
Data
Collection
Model
Training Model
Model
Evaluation
Data
Sources PMML
Thank you!
@AviBachsh

More Related Content

PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
PDF
SSR: Structured Streaming for R and Machine Learning
felixcss
 
PPTX
Spark r under the hood with Hossein Falaki
Databricks
 
PDF
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
PDF
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Databricks
 
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
SSR: Structured Streaming for R and Machine Learning
felixcss
 
Spark r under the hood with Hossein Falaki
Databricks
 
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Databricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 

What's hot (20)

PDF
Spark Summit EU talk by Jakub Hava
Spark Summit
 
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
PDF
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
PDF
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
PDF
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Databricks
 
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
PDF
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
PDF
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Databricks
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Spark Summit EU talk by Jakub Hava
Spark Summit
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Databricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Ad

Similar to From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets with Avi Aminov (20)

PDF
Presto as a Service - Tips for operation and monitoring
Taro L. Saito
 
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
PPTX
Big Data Introduction - Solix empower
Durga Gadiraju
 
PPTX
A machine learning and data science pipeline for real companies
DataWorks Summit
 
PDF
Performance and Abstractions
Metosin Oy
 
PPTX
On SDN Research Topics - Christian Esteve Rothenberg
CPqD
 
PDF
Hadoop Ecosystem and Low Latency Streaming Architecture
InSemble
 
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
PPTX
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
PDF
Internals of Presto Service
Treasure Data, Inc.
 
PPTX
Apache Con 2021 Structured Data Streaming
Shivji Kumar Jha
 
PDF
PinTrace Advanced AWS meetup
Suman Karumuri
 
PPTX
Apache Spark sql
aftab alam
 
PPTX
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Chris Fregly
 
PDF
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Spark Summit
 
PPTX
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
PDF
SOHOpelessly Broken
The Security of Things Forum
 
PDF
Swt
Ngoc Anh
 
PDF
Simulating the behavior of satellite Internet links to small islands
APNIC
 
PDF
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Databricks
 
Presto as a Service - Tips for operation and monitoring
Taro L. Saito
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Big Data Introduction - Solix empower
Durga Gadiraju
 
A machine learning and data science pipeline for real companies
DataWorks Summit
 
Performance and Abstractions
Metosin Oy
 
On SDN Research Topics - Christian Esteve Rothenberg
CPqD
 
Hadoop Ecosystem and Low Latency Streaming Architecture
InSemble
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
Internals of Presto Service
Treasure Data, Inc.
 
Apache Con 2021 Structured Data Streaming
Shivji Kumar Jha
 
PinTrace Advanced AWS meetup
Suman Karumuri
 
Apache Spark sql
aftab alam
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Chris Fregly
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Spark Summit
 
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
SOHOpelessly Broken
The Security of Things Forum
 
Simulating the behavior of satellite Internet links to small islands
APNIC
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 

From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets with Avi Aminov

  • 1. The Road to Uncovering Botnets From Python Scikit-Learn to Scala Spark
  • 2. whoami • Avi Aminov – ~2 years Security Researcher at Akamai – Physics PhD student • Asaf Nadler – ~1.5 years Security Researcher at Akamai – CS PhD student
  • 3. Enterprise Threat Protection • Detect malware presence from outbound traffic – Behavioral pattern analysis – Domain blacklisting • Availability – End of June ’17 Akamai Recursive DNS Branch / HQ Enterprise DNS
  • 4. Data • Akamai Data – 20-30% of internet traffic – Customer ISP/Enterprise logs – 20B DNS queries/day • Third party data – e.g. Authoritative DNS log lines • Open data sources – e.g. WHOIS information
  • 5. Bot Networks – IP Fluxing • Goal – Evasion – Regular bots: waiting for orders – Proxies: concealing origin server Command & Control server Bots Proxy Bots
  • 6. Bot Networks Detection • Detect illegitimate IP fluxing • Features – IP dispersity (Geo, systems) – TTL features – Lexical Domain Description #Systems #Countries astro-travels.net PoS CNC Host 157 11
  • 7. Decision Tree Model Malicious with high confidence • Spread across systems • Unpopular Benign with high confidence • IPs in the same system • Contains meaningful words
  • 8. Challenge – Going to Production Feature Extraction Scoring Blacklist Feature Extraction Model Training Model Model Evaluation Data Sources
  • 9. What have we done so far? • Flow – Researcher describes an algorithm (document + Hive query) – Dev rewrites the code in MapReduce (now Scala/Spark) • Problems – Not applicable to ML pipelines – Prone to mistakes – Longer development cycle
  • 10. Can We Do Better? Option #1 • Research side – Pipeline in Scala/Spark • Dev side – Implement the algorithms • Pros – Greater flexibility – Research scale • Cons – Learning curve – Lose sklearn/R benefits
  • 11. Can We Do Better? Option #2 • Research side – Train locally and export model • Dev side – Transform data using imported model • Pros – Quick implementation – Unified procedure • Cons – No support for all models
  • 12. Export scheme • Predictive Model Markup Language • General scheme for ML pipelines – Data transformations – Scoring models • XML format – Readable • Supported by major data science / ML frameworks using jPMML (R, sklearn)
  • 13. PMML Simple Boilerplate Python (Research side) Scala (Dev side) Credit: jpmml lib http://openscoring.io/ , https://github.com/jpmml/ Maintained by Villu Ruusmann
  • 14. Lessons Learned • Work process adjusted to the task – Training locally? Export the model – Training on larger scales? Better to use Spark • Use jpmml for model export • When applicable, reduce workload in production – Example – only look at domains with many IPs