SlideShare a Scribd company logo
Scaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection
with Spark and Databricks
Josh Gillner
Apple Detection Engineering
▪ Protecting Apple’s Systems
▪ Finding & responding to security
threats using log data
▪ Threat research and hunting
^^^ Looking for this guy
Who are we? - Apple Detection Engineering
Which Technologies?
Alert Orchestration
System
CI System
What is a detection?
Detection === Code That Finds Bad Stuff
▪ Get an input dataset(s)
▪ Apply logic
▪ Output
▪ 1 notebook -> 1 job -> 1 detection
What Happens Next?
Analyst Review
Suggestion, Enrichment &
Automated Containment
Alert Orchestration System
Contain
Issue+
This needs to be fast
Standardizing Detections
Problem #1 — Development Overhead
▪ Average time to write, test, and deploy a
basic detection === 1 week
▪ New ideas/week > deployed jobs/week
(unsustainable)
▪ Writing scalatests, preserving test
samples…testing is too cumbersome
▪ > 60% of new code is boilerplate (!!)
Problem #2 — Mo’ Detections, Mo’ Problems
Want to add a cool new
feature to all detections?
Refactor many different
notebooks
Config all over the place in
disparate notebooks
Want to configure multiple
detections at once?
Ongoing tuning and
maintenance?
One-off tuning doesn’t scale
to hundreds of detections
Problem #3 — No Support for Common Patterns
▪ Common enrichments or exclusions
▪ Creating and using statistical
baselines
▪ Write detection test using scalatest
Things People Often Do
(but must write code for)
…everyone implements in
a different way
…fixes/updates must be
applied in 10 places
DetectionKit
Auto-Tuning Alerts
Modular Postprocessing
Automated Enrichments
Complex Exclusions
Notebook-Based CI
Centralized Configuration
Test Generation
Alert Standardization
Future-Looking Abstraction
Multi-Stage Alerting
Modular Investigation Templates
Signal-Based Detections
Rate Limit Failsafes
Preprocessing Transformations
Automated Tagging
Statistics Tables
Entity-Based Deduplication
Asset Attribution
Components
▪ Input
▪ Detection and Alert abstractions
▪ Emitters
▪ Configuration
▪ Tuning
▪ Modular Pre/Postprocessing
▪ Functional Testing
▪ Complex Exclusions
▪ Templatized Investigations
Input
▪ All detection begins with input loading
▪ Pass in inputs through config object
▪ External control through config
▪ decide spark.read vs .readStream
▪ path, schema, format
▪ no hardcoding -> dynamic input
behavior
▪ Abstracts away details of getting data
^^^ This should not change if
someDataset is a production table
or test sample file
Detection and Alert Abstraction
▪ Logic is described
in form of Spark
DataFrame
▪ Supports additional
post-processing
transformation
▪ Basic interface for
consumption by
other code
Detection
val alerts: Map[String, Alert] =
Alert
val modules: ArrayBuffer[Transformer] =
def PostProcessor(input: DataFrame): DataFrame = ???
def df: DataFrame = /* alert logic here */
val config: DetectionConfig
Input and other runtime configs
Test generation
Emitter
▪ Takes output from Alert and send them elsewhere
▪ Also schedules the job in Spark cluster
Alert
MemoryEmitter
FileEmitter
KinesisEmitter
DBFS on AWS S3
In-memory Table
AWS Kinesis
Config Inference
▪ If things can (and should) be changed, move it outside of code
▪ eg. detection name, description, input dataset, emitter
▪ Where possible, supply a sane default or infer them
val checkpointLocation: String =
"dbfs:/mnt/defaultbucket/chk/detection/ / / .chk/"
name = "CodeRed: Something Has Happened"
alertName = "JoshsCoolDetection"
version = "1"
DetectionConfigInfer
Config Inheritance
▪ Fine-grained configurability
▪ Could be multiple Alerts in
same Detection
▪ Individually configurable,
otherwise inherit parent
config
Detection
Alert
val config: DetectionConfig
Alert
Alert
Modular Pre/PostProcessing
▪ DataFrame -> DataFrame transform
applied to input dataset
▪ Supplied in config
▪ Useful for things like date filtering
without changing detection
Preprocessing
Postprocessing
▪ Mutable Seq of transform functions
inside Detection
▪ Applied sequentially to output
foreachBatch Transformers
▪ Some operations not stream-safe
▪ Where the crazy stuff happens
Manual Tuning Lifecycle
▪ Tuning overhead scales
with number of detections
▪ Feedback loop can take
days while analysts
suffer :(
▪ This need to be faster…
ideally automated and self-
service
The data/
environment
changes
DE tweaks
detection
False positive
alerts
Analyst
requests
tuning pain
Self-Tuning Alerts
Detection
Analyst Review
Alert
Labels
(FP, TP, etc.)
Analyst
Consensus!
Modify Behavior
Alert
Orchestration
System
Complex Exclusions
▪ Arbitrary SQL expressions applied
on all results in forEachBatch
▪ Stored in rev-controlled TSV
▪ Integrated into Detection Test
CI…malformed or over-selective
items will fail tests
▪ Preservation of excluded alerts in
a separate table
Eventually, detections look like this >>>
So….
Repetitive Investigations…What Happens?
• Analysts run queries
in notebooks to
investigate
• Most of these
queries look the
same, just different
filter
Analyst Review
Alert Orchestration System
Automated Investigation Templates
▪ Find corresponding
template notebook
▪ Fill it out
▪ Attach to cluster
▪ Execute
Alert Orchestration
System
Workspace API
This lets us automate useful things like…
Interactive Process Trees in D3 Baselines of Typical Activity
Automated Containment
Machines can find, investigate, and contain issues without humans
Automated Investigation
Alert Orchestration System
ODBC API
• Run substantiating
queries via ODBC
• Render verdict
Contain
Issue
Detection Testing
Why is it so painful?
▪ Preserving/exporting JSON
samples
▪ Local SparkSession isn’t a real
cluster
▪ Development happens in
notebooks, testing happens in
IDE
▪ Brittle to even small changes
to schema, etc
Detection Functional Tests
▪ 85% reduction in test LoC
▪ write and run tests in
notebooks!
▪ use Delta sample files in
dbfs, no more exporting
JSON
▪ scalatest generation using
config and convention
Trait: DetectionTest
^^ this is a complete test ^^
Detection Test CI
Git PR
CI System
Test
Notebooks
Workspace API
/Alerts/Test/PRs/<Git PR
number>_<Git commit
hash>
Jobs API
Build
Scripts pass/fail
“Testing has never been this fun!!”
— detection engineers, probably
Jobs CI — Why?
▪ Managing hundreds of jobs in Databricks UI
▪ Each job has associated notebook, config, dbfs files
▪ No inventory of which jobs should be running, where
▪ We need job linting >>>
Databricks Stacks!
Deploy/Reconfigure Jobs with Single PR
CI System
Config Linter
Stacks CLI
Jobs Helper
Deploy Job/
Notebooks/Files
Kickstart/Restart
Set Permissions
Cool Things with Jobs CI!
▪ Deploy or reconfigure many
jobs concurrently
▪ Auto job restarts on notebook/
config change
▪ Standardization of retries,
timeout, permissions
▪ Automate alarm creation for
new jobs
^^^ No one likes manually crafting
Stacks JSON — so we generate it
Saving Time with
Automated Historical Insight
Problem #1 — Cyclical Investigations
▪ Alert comes in, analysts spend hours
looking into it
▪ But the same thing happened 3
months ago and was determined to be
benign
▪ Lots of wasted cycles on duplicative
investigations
Problem #2 — Disparate Context
▪ Want to find historical incident
data?
▪ look in many different places
▪ many search UIs, syntaxes
▪ Manual, slow & painful
▪ New analysts won’t have
historical knowledge
Problem #3 — Finding Patterns
Which incidents relate to other
incidents?
Do we see common infrastructure,
actors?
How much work is repeated?
Case #55557
Case #44447
Case #33337
}(Some IP Address)
Solution: Document Recommendations
▪ Collect all incident-related
tickets, correspondence, and
investigations
▪ Normalize them into a Delta
table
▪ Automate suggestion of
related knowledge using our
own corpus of documents
Emails
Tickets
Alerts
Notebooks
Detection Code
Wikis
“Has This Happened Before?” -> Automated
Includes analyst comments and
verdicts
displayHTML suggestions,
clickable links to original document
Automated Suggestions
alert_hash 112233445
serial = C12345678ABC
src_ip = 88.88.88.123
mime_type = [“text/html"]
dt = 2020-03-15
C12345678ABC
88.88.88.123
Entities
C12345678ABC
88.88.88.123
00:de:ad:00:be:ef
01:8b:ad:00:f0:0d
joshsaccount
joshshostname
Enriched Entities
{
Emails
Tickets
Alerts
Notebooks
Detections
Wikis
Document Search
Suggestion
Anatomy of an Alert
These are not valuable for search! (too
common)
These are good indicators of document
relevance
Entity Tokenization and Enrichment
IP Address
Regex
Domain
Hashes
Accounts
Serials
UDIDs
File Path
Emails
MAC Addresses
Alert Payload
VPN Sessions
Enrichments
DHCP Sessions
Asset Data
Account Data
Suggestion Algorithm
▪ Gather match statistics for each
entity:
▪ historical rarity
▪ document count rarity
▪ doc type distribution
▪ Compute entity weight based on
average ranked percentiles of those
features
▪ More common terms == less
valuable
▪ Return the best n hits by confidence
▪ Not That Expensive™
Q & A ?
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Ad

More Related Content

What's hot (20)

Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
Databricks
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
PostgreSQL Performance Tuning
PostgreSQL Performance TuningPostgreSQL Performance Tuning
PostgreSQL Performance Tuning
elliando dias
 
How to use histograms to get better performance
How to use histograms to get better performanceHow to use histograms to get better performance
How to use histograms to get better performance
MariaDB plc
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
metsarin
 
Red Team Revenge - Attacking Microsoft ATA
Red Team Revenge - Attacking Microsoft ATARed Team Revenge - Attacking Microsoft ATA
Red Team Revenge - Attacking Microsoft ATA
Nikhil Mittal
 
How to improve ELK log pipeline performance
How to improve ELK log pipeline performanceHow to improve ELK log pipeline performance
How to improve ELK log pipeline performance
Steven Shim
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
Image Processing on Delta Lake
Image Processing on Delta LakeImage Processing on Delta Lake
Image Processing on Delta Lake
Databricks
 
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Postgres MVCC - A Developer Centric View of Multi Version Concurrency ControlPostgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Reactive.IO
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
CNIT 129S: Ch 5: Bypassing Client-Side Controls
CNIT 129S: Ch 5: Bypassing Client-Side ControlsCNIT 129S: Ch 5: Bypassing Client-Side Controls
CNIT 129S: Ch 5: Bypassing Client-Side Controls
Sam Bowne
 
A2 - broken authentication and session management(OWASP thailand chapter Apri...
A2 - broken authentication and session management(OWASP thailand chapter Apri...A2 - broken authentication and session management(OWASP thailand chapter Apri...
A2 - broken authentication and session management(OWASP thailand chapter Apri...
Noppadol Songsakaew
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Automate the boring stuff with python
Automate the boring stuff with pythonAutomate the boring stuff with python
Automate the boring stuff with python
DEEPAKSINGHBIST1
 
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 ViennaAutovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
PostgreSQL-Consulting
 
Pentesting GraphQL Applications
Pentesting GraphQL ApplicationsPentesting GraphQL Applications
Pentesting GraphQL Applications
Neelu Tripathy
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking VN
 
Securing AEM webapps by hacking them
Securing AEM webapps by hacking themSecuring AEM webapps by hacking them
Securing AEM webapps by hacking them
Mikhail Egorov
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
Databricks
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
PostgreSQL Performance Tuning
PostgreSQL Performance TuningPostgreSQL Performance Tuning
PostgreSQL Performance Tuning
elliando dias
 
How to use histograms to get better performance
How to use histograms to get better performanceHow to use histograms to get better performance
How to use histograms to get better performance
MariaDB plc
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
metsarin
 
Red Team Revenge - Attacking Microsoft ATA
Red Team Revenge - Attacking Microsoft ATARed Team Revenge - Attacking Microsoft ATA
Red Team Revenge - Attacking Microsoft ATA
Nikhil Mittal
 
How to improve ELK log pipeline performance
How to improve ELK log pipeline performanceHow to improve ELK log pipeline performance
How to improve ELK log pipeline performance
Steven Shim
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
Image Processing on Delta Lake
Image Processing on Delta LakeImage Processing on Delta Lake
Image Processing on Delta Lake
Databricks
 
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Postgres MVCC - A Developer Centric View of Multi Version Concurrency ControlPostgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Reactive.IO
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
CNIT 129S: Ch 5: Bypassing Client-Side Controls
CNIT 129S: Ch 5: Bypassing Client-Side ControlsCNIT 129S: Ch 5: Bypassing Client-Side Controls
CNIT 129S: Ch 5: Bypassing Client-Side Controls
Sam Bowne
 
A2 - broken authentication and session management(OWASP thailand chapter Apri...
A2 - broken authentication and session management(OWASP thailand chapter Apri...A2 - broken authentication and session management(OWASP thailand chapter Apri...
A2 - broken authentication and session management(OWASP thailand chapter Apri...
Noppadol Songsakaew
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Automate the boring stuff with python
Automate the boring stuff with pythonAutomate the boring stuff with python
Automate the boring stuff with python
DEEPAKSINGHBIST1
 
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 ViennaAutovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
PostgreSQL-Consulting
 
Pentesting GraphQL Applications
Pentesting GraphQL ApplicationsPentesting GraphQL Applications
Pentesting GraphQL Applications
Neelu Tripathy
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking VN
 
Securing AEM webapps by hacking them
Securing AEM webapps by hacking themSecuring AEM webapps by hacking them
Securing AEM webapps by hacking them
Mikhail Egorov
 

Similar to Scaling Security Threat Detection with Apache Spark and Databricks (20)

Testing 101
Testing 101Testing 101
Testing 101
Noam Barkai
 
SELJE_Database_Unit_Testing_Slides.pdf
SELJE_Database_Unit_Testing_Slides.pdfSELJE_Database_Unit_Testing_Slides.pdf
SELJE_Database_Unit_Testing_Slides.pdf
Eric Selje
 
Apache Solr - An Experience Report
Apache Solr - An Experience ReportApache Solr - An Experience Report
Apache Solr - An Experience Report
Netcetera
 
Property based testing - Less is more
Property based testing - Less is moreProperty based testing - Less is more
Property based testing - Less is more
Ho Tien VU
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Chelberg ptcuser 2010
Chelberg ptcuser 2010Chelberg ptcuser 2010
Chelberg ptcuser 2010
Clay Helberg
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
CODE BLUE
 
Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015
lpgauth
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
SharabiNaif
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
Anonymous9etQKwW
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
MumitAhmed1
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
rschuppe
 
Illuminate - Performance Analystics driven by Machine Learning
Illuminate - Performance Analystics driven by Machine LearningIlluminate - Performance Analystics driven by Machine Learning
Illuminate - Performance Analystics driven by Machine Learning
jClarity
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Databricks
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
Sriskandarajah Suhothayan
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Srinath Perera
 
DX@Scale: Optimizing Salesforce Development and Deployment for large scale pr...
DX@Scale: Optimizing Salesforce Development and Deployment for large scale pr...DX@Scale: Optimizing Salesforce Development and Deployment for large scale pr...
DX@Scale: Optimizing Salesforce Development and Deployment for large scale pr...
Paris Salesforce Developer Group
 
(ATS3-PLAT07) Pipeline Pilot Protocol Tips, Tricks, and Challenges
(ATS3-PLAT07) Pipeline Pilot Protocol Tips, Tricks, and Challenges(ATS3-PLAT07) Pipeline Pilot Protocol Tips, Tricks, and Challenges
(ATS3-PLAT07) Pipeline Pilot Protocol Tips, Tricks, and Challenges
BIOVIA
 
SELJE_Database_Unit_Testing_Slides.pdf
SELJE_Database_Unit_Testing_Slides.pdfSELJE_Database_Unit_Testing_Slides.pdf
SELJE_Database_Unit_Testing_Slides.pdf
Eric Selje
 
Apache Solr - An Experience Report
Apache Solr - An Experience ReportApache Solr - An Experience Report
Apache Solr - An Experience Report
Netcetera
 
Property based testing - Less is more
Property based testing - Less is moreProperty based testing - Less is more
Property based testing - Less is more
Ho Tien VU
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Chelberg ptcuser 2010
Chelberg ptcuser 2010Chelberg ptcuser 2010
Chelberg ptcuser 2010
Clay Helberg
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
CODE BLUE
 
Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015
lpgauth
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
rschuppe
 
Illuminate - Performance Analystics driven by Machine Learning
Illuminate - Performance Analystics driven by Machine LearningIlluminate - Performance Analystics driven by Machine Learning
Illuminate - Performance Analystics driven by Machine Learning
jClarity
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Databricks
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
Sriskandarajah Suhothayan
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Srinath Perera
 
DX@Scale: Optimizing Salesforce Development and Deployment for large scale pr...
DX@Scale: Optimizing Salesforce Development and Deployment for large scale pr...DX@Scale: Optimizing Salesforce Development and Deployment for large scale pr...
DX@Scale: Optimizing Salesforce Development and Deployment for large scale pr...
Paris Salesforce Developer Group
 
(ATS3-PLAT07) Pipeline Pilot Protocol Tips, Tricks, and Challenges
(ATS3-PLAT07) Pipeline Pilot Protocol Tips, Tricks, and Challenges(ATS3-PLAT07) Pipeline Pilot Protocol Tips, Tricks, and Challenges
(ATS3-PLAT07) Pipeline Pilot Protocol Tips, Tricks, and Challenges
BIOVIA
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
...lab.Lab123456789123456789123456789123456789
...lab.Lab123456789123456789123456789123456789...lab.Lab123456789123456789123456789123456789
...lab.Lab123456789123456789123456789123456789
Ghh
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhhChapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
ChrisjohnAlfiler
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
...lab.Lab123456789123456789123456789123456789
...lab.Lab123456789123456789123456789123456789...lab.Lab123456789123456789123456789123456789
...lab.Lab123456789123456789123456789123456789
Ghh
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhhChapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
ChrisjohnAlfiler
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 

Scaling Security Threat Detection with Apache Spark and Databricks

  • 2. Scaling Security Threat Detection with Spark and Databricks Josh Gillner Apple Detection Engineering
  • 3. ▪ Protecting Apple’s Systems ▪ Finding & responding to security threats using log data ▪ Threat research and hunting ^^^ Looking for this guy Who are we? - Apple Detection Engineering
  • 5. What is a detection?
  • 6. Detection === Code That Finds Bad Stuff ▪ Get an input dataset(s) ▪ Apply logic ▪ Output ▪ 1 notebook -> 1 job -> 1 detection
  • 7. What Happens Next? Analyst Review Suggestion, Enrichment & Automated Containment Alert Orchestration System Contain Issue+ This needs to be fast
  • 9. Problem #1 — Development Overhead ▪ Average time to write, test, and deploy a basic detection === 1 week ▪ New ideas/week > deployed jobs/week (unsustainable) ▪ Writing scalatests, preserving test samples…testing is too cumbersome ▪ > 60% of new code is boilerplate (!!)
  • 10. Problem #2 — Mo’ Detections, Mo’ Problems Want to add a cool new feature to all detections? Refactor many different notebooks Config all over the place in disparate notebooks Want to configure multiple detections at once? Ongoing tuning and maintenance? One-off tuning doesn’t scale to hundreds of detections
  • 11. Problem #3 — No Support for Common Patterns ▪ Common enrichments or exclusions ▪ Creating and using statistical baselines ▪ Write detection test using scalatest Things People Often Do (but must write code for) …everyone implements in a different way …fixes/updates must be applied in 10 places
  • 12. DetectionKit Auto-Tuning Alerts Modular Postprocessing Automated Enrichments Complex Exclusions Notebook-Based CI Centralized Configuration Test Generation Alert Standardization Future-Looking Abstraction Multi-Stage Alerting Modular Investigation Templates Signal-Based Detections Rate Limit Failsafes Preprocessing Transformations Automated Tagging Statistics Tables Entity-Based Deduplication Asset Attribution
  • 13. Components ▪ Input ▪ Detection and Alert abstractions ▪ Emitters ▪ Configuration ▪ Tuning ▪ Modular Pre/Postprocessing ▪ Functional Testing ▪ Complex Exclusions ▪ Templatized Investigations
  • 14. Input ▪ All detection begins with input loading ▪ Pass in inputs through config object ▪ External control through config ▪ decide spark.read vs .readStream ▪ path, schema, format ▪ no hardcoding -> dynamic input behavior ▪ Abstracts away details of getting data ^^^ This should not change if someDataset is a production table or test sample file
  • 15. Detection and Alert Abstraction ▪ Logic is described in form of Spark DataFrame ▪ Supports additional post-processing transformation ▪ Basic interface for consumption by other code Detection val alerts: Map[String, Alert] = Alert val modules: ArrayBuffer[Transformer] = def PostProcessor(input: DataFrame): DataFrame = ??? def df: DataFrame = /* alert logic here */ val config: DetectionConfig Input and other runtime configs Test generation
  • 16. Emitter ▪ Takes output from Alert and send them elsewhere ▪ Also schedules the job in Spark cluster Alert MemoryEmitter FileEmitter KinesisEmitter DBFS on AWS S3 In-memory Table AWS Kinesis
  • 17. Config Inference ▪ If things can (and should) be changed, move it outside of code ▪ eg. detection name, description, input dataset, emitter ▪ Where possible, supply a sane default or infer them val checkpointLocation: String = "dbfs:/mnt/defaultbucket/chk/detection/ / / .chk/" name = "CodeRed: Something Has Happened" alertName = "JoshsCoolDetection" version = "1" DetectionConfigInfer
  • 18. Config Inheritance ▪ Fine-grained configurability ▪ Could be multiple Alerts in same Detection ▪ Individually configurable, otherwise inherit parent config Detection Alert val config: DetectionConfig Alert Alert
  • 19. Modular Pre/PostProcessing ▪ DataFrame -> DataFrame transform applied to input dataset ▪ Supplied in config ▪ Useful for things like date filtering without changing detection Preprocessing Postprocessing ▪ Mutable Seq of transform functions inside Detection ▪ Applied sequentially to output foreachBatch Transformers ▪ Some operations not stream-safe ▪ Where the crazy stuff happens
  • 20. Manual Tuning Lifecycle ▪ Tuning overhead scales with number of detections ▪ Feedback loop can take days while analysts suffer :( ▪ This need to be faster… ideally automated and self- service The data/ environment changes DE tweaks detection False positive alerts Analyst requests tuning pain
  • 21. Self-Tuning Alerts Detection Analyst Review Alert Labels (FP, TP, etc.) Analyst Consensus! Modify Behavior Alert Orchestration System
  • 22. Complex Exclusions ▪ Arbitrary SQL expressions applied on all results in forEachBatch ▪ Stored in rev-controlled TSV ▪ Integrated into Detection Test CI…malformed or over-selective items will fail tests ▪ Preservation of excluded alerts in a separate table Eventually, detections look like this >>> So….
  • 23. Repetitive Investigations…What Happens? • Analysts run queries in notebooks to investigate • Most of these queries look the same, just different filter Analyst Review Alert Orchestration System
  • 24. Automated Investigation Templates ▪ Find corresponding template notebook ▪ Fill it out ▪ Attach to cluster ▪ Execute Alert Orchestration System Workspace API
  • 25. This lets us automate useful things like… Interactive Process Trees in D3 Baselines of Typical Activity
  • 26. Automated Containment Machines can find, investigate, and contain issues without humans Automated Investigation Alert Orchestration System ODBC API • Run substantiating queries via ODBC • Render verdict Contain Issue
  • 27. Detection Testing Why is it so painful? ▪ Preserving/exporting JSON samples ▪ Local SparkSession isn’t a real cluster ▪ Development happens in notebooks, testing happens in IDE ▪ Brittle to even small changes to schema, etc
  • 28. Detection Functional Tests ▪ 85% reduction in test LoC ▪ write and run tests in notebooks! ▪ use Delta sample files in dbfs, no more exporting JSON ▪ scalatest generation using config and convention Trait: DetectionTest ^^ this is a complete test ^^
  • 29. Detection Test CI Git PR CI System Test Notebooks Workspace API /Alerts/Test/PRs/<Git PR number>_<Git commit hash> Jobs API Build Scripts pass/fail “Testing has never been this fun!!” — detection engineers, probably
  • 30. Jobs CI — Why? ▪ Managing hundreds of jobs in Databricks UI ▪ Each job has associated notebook, config, dbfs files ▪ No inventory of which jobs should be running, where ▪ We need job linting >>>
  • 32. Deploy/Reconfigure Jobs with Single PR CI System Config Linter Stacks CLI Jobs Helper Deploy Job/ Notebooks/Files Kickstart/Restart Set Permissions
  • 33. Cool Things with Jobs CI! ▪ Deploy or reconfigure many jobs concurrently ▪ Auto job restarts on notebook/ config change ▪ Standardization of retries, timeout, permissions ▪ Automate alarm creation for new jobs ^^^ No one likes manually crafting Stacks JSON — so we generate it
  • 34. Saving Time with Automated Historical Insight
  • 35. Problem #1 — Cyclical Investigations ▪ Alert comes in, analysts spend hours looking into it ▪ But the same thing happened 3 months ago and was determined to be benign ▪ Lots of wasted cycles on duplicative investigations
  • 36. Problem #2 — Disparate Context ▪ Want to find historical incident data? ▪ look in many different places ▪ many search UIs, syntaxes ▪ Manual, slow & painful ▪ New analysts won’t have historical knowledge
  • 37. Problem #3 — Finding Patterns Which incidents relate to other incidents? Do we see common infrastructure, actors? How much work is repeated? Case #55557 Case #44447 Case #33337 }(Some IP Address)
  • 38. Solution: Document Recommendations ▪ Collect all incident-related tickets, correspondence, and investigations ▪ Normalize them into a Delta table ▪ Automate suggestion of related knowledge using our own corpus of documents Emails Tickets Alerts Notebooks Detection Code Wikis
  • 39. “Has This Happened Before?” -> Automated Includes analyst comments and verdicts displayHTML suggestions, clickable links to original document
  • 40. Automated Suggestions alert_hash 112233445 serial = C12345678ABC src_ip = 88.88.88.123 mime_type = [“text/html"] dt = 2020-03-15 C12345678ABC 88.88.88.123 Entities C12345678ABC 88.88.88.123 00:de:ad:00:be:ef 01:8b:ad:00:f0:0d joshsaccount joshshostname Enriched Entities { Emails Tickets Alerts Notebooks Detections Wikis Document Search Suggestion
  • 41. Anatomy of an Alert These are not valuable for search! (too common) These are good indicators of document relevance
  • 42. Entity Tokenization and Enrichment IP Address Regex Domain Hashes Accounts Serials UDIDs File Path Emails MAC Addresses Alert Payload VPN Sessions Enrichments DHCP Sessions Asset Data Account Data
  • 43. Suggestion Algorithm ▪ Gather match statistics for each entity: ▪ historical rarity ▪ document count rarity ▪ doc type distribution ▪ Compute entity weight based on average ranked percentiles of those features ▪ More common terms == less valuable ▪ Return the best n hits by confidence ▪ Not That Expensive™
  • 44. Q & A ?
  • 45. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.