AI on Spark for Malware Analysis and Anomalous Threat Detection

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Jakub Sanojca & Joāo Da Silva, Avast
Researcher Data Engineer

AI on Spark for Malware
Analysis and Anomalous Threat
Detection

Demonstrate how Avast leverages
AI and big data to burn malware.
Goal

Agenda
• What Avast does
• Malware research
• Structured Streaming
• AI anomaly detection
• Demo

Thank you
• Big Data Systems
• AI team - especially Yura, Olga and Dmitry
• Threat researchers and analysts

Avast is dedicated to creating a world
that provides safety and privacy for all,
no matter who you are, where you are,
or how you connect.

Global reach
10#UnifiedDataAnalytics #SparkAISummit
Portfolio of security, privacy
and utility applications

World’s Largest Detection Network
300 M+
new files
monthly 10,000 +
globally
distributed
servers
200B+
URLs

Training the Avast Machine Learning Engine
Purpose-built approach that takes < 12 hours to add
new features, train, and deploy into production

Malware classification
Data
● >500 handcrafted features from binary
files from our experts
Task
● Classification to clean/malware/pup files
Two step ML Pipeline:
● Cluster data with custom k-means
● Classification inside the cluster is done
by Random Forest

Infrastructure: Underlying data lake - Burger

15#UnifiedDataAnalytics #SparkAISummit15
Data
Features Clustering Training Validation Production
Clustering Training Validation
3h 4.5h 24 h
24 h
24 h 6 h
● ~700TB of binary files
● patented tailor-made solution
Architecture: Malware classification

Custom application Spark
• optimised & performant
• takes months to develop
• not that easy to change
• slower
• easy to experiment with
• very fast development

#UnifiedDataAnalytics #SparkAISummit
Threat Detections Streaming

1. Identify - threat researcher
2. Block - operator
3. Analyze and automate - data / AI researcher +
engineers
3 step threat approach

• Thousands of detection time series
• Where should operator focus?
Time series of detections

Short response time is necessary

First idea - custom streaming app
• Python because of ML models

• Big part of code about already solved problems

• POC written by researchers

• POC written by researchers
• Gets job done, but not easy to maintain or experiment

Adopted solution:
Spark Structured Streaming

Structured Streaming

Advantages of
for fast threat detection

Advantages of Structured Streaming
• Unified processing engine

• End to end AI with multiple sinks

• Window aggregations and Watermarking
out of the box

• Window aggregations and Watermarking out of the box
• Resilient streams

Adoption

Structured Streaming Adoption
• Unbounded table

• Unbounded table
• Triggers

• Unbounded table
• Triggers
>>> writer = sdf.writeStream.trigger(processingTime='5 seconds')

• Unbounded table
• Triggers
>>> writer = sdf.writeStream.trigger(once=True)

• Unbounded table
• Triggers
>>> writer = sdf.writeStream.trigger(once=True)
>>> writer = sdf.writeStream.trigger(continuous='5 seconds')

• Unbounded table
• Triggers
• Micro Batch Processing vs Continuous processing

• Unbounded table
• Triggers
– org.apache.spark.sql.execution.streaming.MicroBatchExecution

• Unbounded table
• Triggers
– org.apache.spark.sql.execution.streaming.MicroBatchExecution
– org.apache.spark.sql.execution.streaming.ContinuousExecution
(experimental)

• Unbounded table
• Triggers

Before

Before After

AI driven anomaly detection
on time series

How to quickly identify campaigns of malware and
potentially unwanted programs.
AI driven anomaly detection on time series

How to quickly identify campaigns of malware and potentially
unwanted programs:
• Traditional approaches - find outliers

unwanted programs.
• Machine learning - predict and compare
– Neural networks - LSTMs vs CNNs

unwanted programs.
• Machine learning - predict and compare
– Neural networks - LSTMs vs CNNs
– Other - auto-regressive models etc.

• Sequential
Threat anomaly detection: training

• Sequential
• Parallel! mapPartitions / pandas_udf

• Sequential
• Parallel!
• Distributed - TensorflowOnSpark

• pandas_udf for parallel predictions
• super easy to test on already stored data as batch job
Threat anomaly detection: stream serving

Demo + Code Walkthrough

Challenges
• Multiple potential incompatibility surfaces
• Unexpected behavior / Unknowns
• Silent failures

Takeaways
• Easier collaboration between Science and Engineering teams
• An excellent toolbox to do anomaly detection in near real time
• Easy ML/AI/DL integration
• Parallelism

Questions?

AI on Spark for Malware Analysis and Anomalous Threat Detection

Recommended

More Related Content

Similar to AI on Spark for Malware Analysis and Anomalous Threat Detection (20)

More from Databricks (20)

Recently uploaded (20)

AI on Spark for Malware Analysis and Anomalous Threat Detection