SlideShare a Scribd company logo
Towards Data Operations
Dr. Andrea Monacchi
Streaming data, models and code as first-class citizens
Riservato Nome personalizzato dell'azienda Versione 1.0
1. About Myself
2. Big Data Architectures
3. Stream-based computing
4. Integrating the Data Science workflow
Summary
About
● BS Computer Science (2010)
● MS Computer Science (2012)
● PhD Information Technology (2016)
● Consultancy experience @ Reply DE (2016-2018)
● Independent Consultant
Big Data
Architectures
Big Data: why are we here?
Loads:
● Transactional (OLTP)
○ all operations
○ ACID properties: atomicity, consistency, isolation,
and durability
● Analytical (OLAP)
○ append-heavy loads
○ aggregations and explorative queries (analytics)
○ hierarchical indexing (OLAP hyper-cubes)
Scalability:
● ACID properties costly
● CAP Theorem
○ impossible to simultaneously distributed data load
and fulfill 3 properties: consistency, availability,
partition tolerance (tolerance to communication
errors)
○ CA are classic RDBMS - vertical scaling only
○ CP (e.g. quorum-based) and AP are NoSQL DBs
○ e.g. Dynamo DB (eventual consistency, AP)
● NoSQL databases
○ relax ACID properties to achieve horizontal scaling
Scalability
● Key/Value Storage (DHT)
○ decentralised, scalable, fault-tolerant
○ easy to partition and distribute data by key
■ e.g. p = hash(k) % num_partitions
○ replication (partition redundancy)
■ 2: error detection
■ 3: error recovery
● Parallel collections
○ e.g. Spark RDD based on underlying HDFS blocks
○ master-slave (or driver-worker) coordination
○ no shared variables (accumulators, broadcast vars)
Parallel computation:
1. Split dataset into multiple partitions/shards
2. Independently process each partition
3. Combine partitions into result
● MapReduce (Functional Programming)
● Split-apply-combine
● Google’s MapReduce
● Cluster Manager
○ Yarn, Mesos, K8s
○ resource isolation and scheduling
○ security, fault-tolerance, monitoring
● Data Serialization formats
○ Text: CSV, JSON, XML,..
○ Binary: SeqFile, Avro, Parquet, ORC, ..
● Batch Processing
○ Hadoop MapReduce variants
■ Pig, Hive
○ Apache Spark
○ Python Dask
● Workflow management tools
○ Fault-tolerant task coordination
○ Oozie, Airflow, Argo (K8s)
Architectures for Data Analytics
● Stages
○ Ingestion (with retention)
○ (re)-processing
○ Presentation/Serving (indexed data, OLAP)
● Lambda Vs. Kappa architecture
○ batch for correctness, streaming for speed
○ mix of technologies (CAP theorem)
○ complexity & operational costs
Stream-based
computing
1st phase - Ingestion: MQTT
● Pub/Sub
○ Totally uncoupled clients
○ messages can be explicitly retained (flag)
● QoS
○ multilevel (0: at most once, 1: at least once, 2: exactly once)
● Persistent sessions
○ broker keeps further info of clients to speed up reconnections
○ queuing of messages in QoS 1 and 2 (disconnections)
○ queuing of all acks for QoS 2
● Last will and Testament messages (LWT)
○ kept until proper client disconnect is sent to the broker
● fault-tolerant pub/sub system with message retention
● exactly once semantics (from v0.11.0) using transactional.id flag (and acks)
● topic as an append-only file log
○ ordering by arrival time (offset)
○ stores changes on the data source (deltas)
○ topic/log consists of multiple partitions
■ partitioning instructed by producer
■ guaranteed message ordering within partition
■ based on message key
● hash(k) % num_partitions
● if k == null, then round robin is used
■ distribution and replication by partition
● 1 elected active (leader) and n replicas
● Zookeeper
1st phase - Ingestion: Kafka
P
Broker2
Broker1
Partition0
Partition1
Partition2
Partition3
Broker3
Partition4
Partition5
● topic maps to local directory
○ 1 file created per each topic partition
○ log rolling: upon size or time limit partition file is rolled
and new one is created
■ log.roll.ms or log.roll.hours = 168
○ log retention:
■ save rolled log to segment files
■ older segments deleted log.retention.hours
or log.retention.bytes
○ log compaction:
■ deletion of older records by key (latest value)
■ log.cleanup.policy=compact on topic
● number of topic partitions (howto)
○ more data -> more partitions
○ proportional to desired throughput
○ overhead for open file handles and TCP connections
■ n partitions, each with m replicas, n*m
■ total partitions on a broker < 100 *
num_brokers * replication_factor, (divide for
all topics)
1st phase - Ingestion: Kafka
● APIs
○ consumer / producer (single thread)
○ Kafka connect
○ KSQL
● Producer (Stateless wrt Broker)
○ ProducerRecord: (topic : String, partition : int,
timestamp : long, key : K, value : V)
○ partition id and time are optional (for manual setup)
● Kafka Connect
○ API for connectors (Source Vs Sink)
○ automatic offset management (commit and restore)
○ at least once delivery
○ exactly once only for certain connectors
○ standalone Vs. distributed modes
○ connectors configurable via REST interface
1st phase - Ingestion: Kafka
● Consumer
○ stateful (own offset wrt topic’s)
○ earliest Vs. latest recovery
○ offset committing: manual, periodical (default 5 secs)
○ consumers can be organized in load-balanced groups (per partition)
○ ideally: consumer’s threads = topic partitions
2nd phase - Stream processing
● streams
○ bounded - can be ingested and batch processed
○ unbounded - processed per event
● stream partitions
○ unit of parallelism
● stream and table duality
○ log changes <-> table
○ Declarative SQL APIs (e.g. KSQL, TableAPI)
● time
○ event time, ingestion time, processing time
○ late message handling (pre-buffering)
● windowing
○ bounded aggregations
○ time or data driven (e.g. count) windows
○ tumbling, sliding and session windows
● stateful operations/transformations
○ state: intermediate result, in-memory key-value store
used across windows or microbatches
○ e.g. RocksDB (Flink+KafkaStreams), LSM-tree
○ e.g. changeLogging back to Kafka topic (KafkaStreams)
● checkpointing
○ save app status for failure recovery (stream replay)
● frameworks
○ Spark Streaming (u-batching), Kafka Streams and Flink
○ Apache Storm, Apache Samza
Code Examples
Flink
● processing webserver logs for frauds
● count downloaded assets per user
● https://github.com/pilillo/flink-quickstart
● Deploy to Kubernetes
SparkStreaming & Kafka
● https://github.com/pilillo/sparkstreaming-quickstart
Kafka Streams
● project skeleton
Integrating the
Data Science
workflow
Data Science workflow
Technical gaps potentially resulting from this process!
● Data Analytics projects
○ stream of research questions and feedback
● Data forked for exploration
○ Data versioning
○ Periodic data quality assessment
● Team misalignment
○ scientists working aside dev team
○ information gaps w/ team & stakeholders
○ unexpected behaviors upon changes
● Results may not be reproducible
● Releases are not frequent
○ Value misalignment (waste of resources)
● CICD only used for data preparation
Data Operations (DataOps)
● DevOps approaches
○ lean product development (continuous feedback and value delivery) using CICD approaches
○ cross-functional teams with mix of development & operations skillset
● DataOps
○ devops for streaming data and analytics as a manufacturing process - Manifesto, Cookbook
○ mix of data engineering and data science skill set
○ focus: continuous data quality assessment, model reproducibility, incremental/progressive delivery of value
Data Operations workflow
Data Science workflow
CICD activities
CICD for ML
● Continuous Data Quality Assessment
○ data versioning (e.g. apache pachyderm)
○ syntax (e.g. confluent schema registry)
○ semantic (e.g. apache griffin) - accuracy, completeness, timeliness, uniqueness, validity, consistency
● Model Tuning
○ hyperparameters tuning - black-box optimization
○ autoML - model selection
○ continuous performance evaluation (wrt newer input data)
○ stakeholder/user performance (e.g. AB testing)
● Model Deployment
○ TF-serve, Seldon
● ML workflow management
○ Amazon Sagemaker, Google ML Engine, MLflow, Kubeflow, Polyaxon
CICD for ML code
See also: https://github.com/EthicalML/awesome-machine-learning-operations
Data-Mill project
● Based on Kubernetes
○ open and scalable
○ seamless integration of bare-metal and cloud-provided clusters
● Enforcing DataOps principles
○ continuous asset monitoring (code, data, models)
○ open-source tools to reproduce and serve models
● Flavour-based organization of components
○ flavour = cluster_spec + SW_components
● Built-in exploration environments (dashboarding tools, jupyter notebooks with DS libraries)
https://data-mill-cloud.github.io/data-mill/
Thank you!

More Related Content

What's hot (20)

PDF
Apache flink
pranay kumar
 
PDF
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
PDF
Data pipelines from zero to solid
Lars Albertsson
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PDF
Testing data streaming applications
Lars Albertsson
 
PDF
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PPTX
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Anant Corporation
 
PDF
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
PDF
m2r2: A Framework for Results Materialization and Reuse
Vasia Kalavri
 
PPTX
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Hajira Jabeen
 
PDF
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
PDF
Pinot: Realtime Distributed OLAP datastore
Kishore Gopalakrishna
 
PDF
Haystack Live tallison_202010_v2
Tim Allison
 
PDF
Lightweight Collection and Storage of Software Repository Data with DataRover
Christoph Matthies
 
PDF
ISNCC 2017
Rim Moussa
 
PDF
Big data processing systems research
Vasia Kalavri
 
PDF
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
PDF
Nikhil summer internship 2016
Nikhil Shekhar
 
Apache flink
pranay kumar
 
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Data pipelines from zero to solid
Lars Albertsson
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Testing data streaming applications
Lars Albertsson
 
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Iceberg: a fast table format for S3
DataWorks Summit
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Anant Corporation
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
m2r2: A Framework for Results Materialization and Reuse
Vasia Kalavri
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Hajira Jabeen
 
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Pinot: Realtime Distributed OLAP datastore
Kishore Gopalakrishna
 
Haystack Live tallison_202010_v2
Tim Allison
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Christoph Matthies
 
ISNCC 2017
Rim Moussa
 
Big data processing systems research
Vasia Kalavri
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
Nikhil summer internship 2016
Nikhil Shekhar
 

Similar to Towards Data Operations (20)

PDF
Building Big Data Streaming Architectures
David Martínez Rego
 
PDF
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
PDF
Building end to end streaming application on Spark
datamantra
 
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
PPTX
Data streaming fundamentals
Mohammed Fazuluddin
 
PDF
Streaming vs batching (conundrum ai internal meetup)
Mark Andreev
 
PPTX
Software architecture for data applications
Ding Li
 
PDF
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
PPT
Moving Towards a Streaming Architecture
Gabriele Modena
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
PDF
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Lightbend
 
PDF
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
PDF
Data Infrastructure for a World of Music
Lars Albertsson
 
PPTX
Data Streaming (in a Nutshell) ... and Spark's window operations
Vincenzo Gulisano
 
PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
PDF
Streaming architecture patterns
hadooparchbook
 
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
PPTX
How to extract valueable information from real time data feeds
Gene Leybzon
 
Building Big Data Streaming Architectures
David Martínez Rego
 
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
Building end to end streaming application on Spark
datamantra
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
Data streaming fundamentals
Mohammed Fazuluddin
 
Streaming vs batching (conundrum ai internal meetup)
Mark Andreev
 
Software architecture for data applications
Ding Li
 
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
Moving Towards a Streaming Architecture
Gabriele Modena
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Lightbend
 
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Data Infrastructure for a World of Music
Lars Albertsson
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Vincenzo Gulisano
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
Streaming architecture patterns
hadooparchbook
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
How to extract valueable information from real time data feeds
Gene Leybzon
 
Ad

More from Andrea Monacchi (9)

PDF
Coordination in distributed systems
Andrea Monacchi
 
PDF
Introduction to istio
Andrea Monacchi
 
PPTX
Anomaly detection on wind turbine data
Andrea Monacchi
 
PDF
Welcome to Load Disaggregation and Building Energy Management
Andrea Monacchi
 
PDF
An Early Warning System for Ambient Assisted Living
Andrea Monacchi
 
PDF
Assisting Energy Management in Smart Buildings and Microgrids
Andrea Monacchi
 
PPTX
Analytics as value added service for energy utilities
Andrea Monacchi
 
PDF
HEMS: A Home Energy Market Simulator
Andrea Monacchi
 
PDF
GREEND: An energy consumption dataset of households in Austria and Italy
Andrea Monacchi
 
Coordination in distributed systems
Andrea Monacchi
 
Introduction to istio
Andrea Monacchi
 
Anomaly detection on wind turbine data
Andrea Monacchi
 
Welcome to Load Disaggregation and Building Energy Management
Andrea Monacchi
 
An Early Warning System for Ambient Assisted Living
Andrea Monacchi
 
Assisting Energy Management in Smart Buildings and Microgrids
Andrea Monacchi
 
Analytics as value added service for energy utilities
Andrea Monacchi
 
HEMS: A Home Energy Market Simulator
Andrea Monacchi
 
GREEND: An energy consumption dataset of households in Austria and Italy
Andrea Monacchi
 
Ad

Recently uploaded (20)

PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PPTX
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
Introduction to Design of Machine Elements
PradeepKumarS27
 
PPT
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
PDF
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
PPTX
GitOps_Repo_Structure for begeinner(Scaffolindg)
DanialHabibi2
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
Hashing Introduction , hash functions and techniques
sailajam21
 
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
Introduction to Design of Machine Elements
PradeepKumarS27
 
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
GitOps_Repo_Structure for begeinner(Scaffolindg)
DanialHabibi2
 
Design Thinking basics for Engineers.pdf
CMR University
 
Day2 B2 Best.pptx
helenjenefa1
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 

Towards Data Operations

  • 1. Towards Data Operations Dr. Andrea Monacchi Streaming data, models and code as first-class citizens
  • 2. Riservato Nome personalizzato dell'azienda Versione 1.0 1. About Myself 2. Big Data Architectures 3. Stream-based computing 4. Integrating the Data Science workflow Summary
  • 3. About ● BS Computer Science (2010) ● MS Computer Science (2012) ● PhD Information Technology (2016) ● Consultancy experience @ Reply DE (2016-2018) ● Independent Consultant
  • 5. Big Data: why are we here? Loads: ● Transactional (OLTP) ○ all operations ○ ACID properties: atomicity, consistency, isolation, and durability ● Analytical (OLAP) ○ append-heavy loads ○ aggregations and explorative queries (analytics) ○ hierarchical indexing (OLAP hyper-cubes) Scalability: ● ACID properties costly ● CAP Theorem ○ impossible to simultaneously distributed data load and fulfill 3 properties: consistency, availability, partition tolerance (tolerance to communication errors) ○ CA are classic RDBMS - vertical scaling only ○ CP (e.g. quorum-based) and AP are NoSQL DBs ○ e.g. Dynamo DB (eventual consistency, AP) ● NoSQL databases ○ relax ACID properties to achieve horizontal scaling
  • 6. Scalability ● Key/Value Storage (DHT) ○ decentralised, scalable, fault-tolerant ○ easy to partition and distribute data by key ■ e.g. p = hash(k) % num_partitions ○ replication (partition redundancy) ■ 2: error detection ■ 3: error recovery ● Parallel collections ○ e.g. Spark RDD based on underlying HDFS blocks ○ master-slave (or driver-worker) coordination ○ no shared variables (accumulators, broadcast vars) Parallel computation: 1. Split dataset into multiple partitions/shards 2. Independently process each partition 3. Combine partitions into result ● MapReduce (Functional Programming) ● Split-apply-combine ● Google’s MapReduce
  • 7. ● Cluster Manager ○ Yarn, Mesos, K8s ○ resource isolation and scheduling ○ security, fault-tolerance, monitoring ● Data Serialization formats ○ Text: CSV, JSON, XML,.. ○ Binary: SeqFile, Avro, Parquet, ORC, .. ● Batch Processing ○ Hadoop MapReduce variants ■ Pig, Hive ○ Apache Spark ○ Python Dask ● Workflow management tools ○ Fault-tolerant task coordination ○ Oozie, Airflow, Argo (K8s) Architectures for Data Analytics ● Stages ○ Ingestion (with retention) ○ (re)-processing ○ Presentation/Serving (indexed data, OLAP) ● Lambda Vs. Kappa architecture ○ batch for correctness, streaming for speed ○ mix of technologies (CAP theorem) ○ complexity & operational costs
  • 9. 1st phase - Ingestion: MQTT ● Pub/Sub ○ Totally uncoupled clients ○ messages can be explicitly retained (flag) ● QoS ○ multilevel (0: at most once, 1: at least once, 2: exactly once) ● Persistent sessions ○ broker keeps further info of clients to speed up reconnections ○ queuing of messages in QoS 1 and 2 (disconnections) ○ queuing of all acks for QoS 2 ● Last will and Testament messages (LWT) ○ kept until proper client disconnect is sent to the broker
  • 10. ● fault-tolerant pub/sub system with message retention ● exactly once semantics (from v0.11.0) using transactional.id flag (and acks) ● topic as an append-only file log ○ ordering by arrival time (offset) ○ stores changes on the data source (deltas) ○ topic/log consists of multiple partitions ■ partitioning instructed by producer ■ guaranteed message ordering within partition ■ based on message key ● hash(k) % num_partitions ● if k == null, then round robin is used ■ distribution and replication by partition ● 1 elected active (leader) and n replicas ● Zookeeper 1st phase - Ingestion: Kafka P Broker2 Broker1 Partition0 Partition1 Partition2 Partition3 Broker3 Partition4 Partition5
  • 11. ● topic maps to local directory ○ 1 file created per each topic partition ○ log rolling: upon size or time limit partition file is rolled and new one is created ■ log.roll.ms or log.roll.hours = 168 ○ log retention: ■ save rolled log to segment files ■ older segments deleted log.retention.hours or log.retention.bytes ○ log compaction: ■ deletion of older records by key (latest value) ■ log.cleanup.policy=compact on topic ● number of topic partitions (howto) ○ more data -> more partitions ○ proportional to desired throughput ○ overhead for open file handles and TCP connections ■ n partitions, each with m replicas, n*m ■ total partitions on a broker < 100 * num_brokers * replication_factor, (divide for all topics) 1st phase - Ingestion: Kafka
  • 12. ● APIs ○ consumer / producer (single thread) ○ Kafka connect ○ KSQL ● Producer (Stateless wrt Broker) ○ ProducerRecord: (topic : String, partition : int, timestamp : long, key : K, value : V) ○ partition id and time are optional (for manual setup) ● Kafka Connect ○ API for connectors (Source Vs Sink) ○ automatic offset management (commit and restore) ○ at least once delivery ○ exactly once only for certain connectors ○ standalone Vs. distributed modes ○ connectors configurable via REST interface 1st phase - Ingestion: Kafka ● Consumer ○ stateful (own offset wrt topic’s) ○ earliest Vs. latest recovery ○ offset committing: manual, periodical (default 5 secs) ○ consumers can be organized in load-balanced groups (per partition) ○ ideally: consumer’s threads = topic partitions
  • 13. 2nd phase - Stream processing ● streams ○ bounded - can be ingested and batch processed ○ unbounded - processed per event ● stream partitions ○ unit of parallelism ● stream and table duality ○ log changes <-> table ○ Declarative SQL APIs (e.g. KSQL, TableAPI) ● time ○ event time, ingestion time, processing time ○ late message handling (pre-buffering) ● windowing ○ bounded aggregations ○ time or data driven (e.g. count) windows ○ tumbling, sliding and session windows ● stateful operations/transformations ○ state: intermediate result, in-memory key-value store used across windows or microbatches ○ e.g. RocksDB (Flink+KafkaStreams), LSM-tree ○ e.g. changeLogging back to Kafka topic (KafkaStreams) ● checkpointing ○ save app status for failure recovery (stream replay) ● frameworks ○ Spark Streaming (u-batching), Kafka Streams and Flink ○ Apache Storm, Apache Samza
  • 14. Code Examples Flink ● processing webserver logs for frauds ● count downloaded assets per user ● https://github.com/pilillo/flink-quickstart ● Deploy to Kubernetes SparkStreaming & Kafka ● https://github.com/pilillo/sparkstreaming-quickstart Kafka Streams ● project skeleton
  • 16. Data Science workflow Technical gaps potentially resulting from this process! ● Data Analytics projects ○ stream of research questions and feedback ● Data forked for exploration ○ Data versioning ○ Periodic data quality assessment ● Team misalignment ○ scientists working aside dev team ○ information gaps w/ team & stakeholders ○ unexpected behaviors upon changes ● Results may not be reproducible ● Releases are not frequent ○ Value misalignment (waste of resources) ● CICD only used for data preparation
  • 17. Data Operations (DataOps) ● DevOps approaches ○ lean product development (continuous feedback and value delivery) using CICD approaches ○ cross-functional teams with mix of development & operations skillset ● DataOps ○ devops for streaming data and analytics as a manufacturing process - Manifesto, Cookbook ○ mix of data engineering and data science skill set ○ focus: continuous data quality assessment, model reproducibility, incremental/progressive delivery of value
  • 18. Data Operations workflow Data Science workflow
  • 21. ● Continuous Data Quality Assessment ○ data versioning (e.g. apache pachyderm) ○ syntax (e.g. confluent schema registry) ○ semantic (e.g. apache griffin) - accuracy, completeness, timeliness, uniqueness, validity, consistency ● Model Tuning ○ hyperparameters tuning - black-box optimization ○ autoML - model selection ○ continuous performance evaluation (wrt newer input data) ○ stakeholder/user performance (e.g. AB testing) ● Model Deployment ○ TF-serve, Seldon ● ML workflow management ○ Amazon Sagemaker, Google ML Engine, MLflow, Kubeflow, Polyaxon CICD for ML code See also: https://github.com/EthicalML/awesome-machine-learning-operations
  • 22. Data-Mill project ● Based on Kubernetes ○ open and scalable ○ seamless integration of bare-metal and cloud-provided clusters ● Enforcing DataOps principles ○ continuous asset monitoring (code, data, models) ○ open-source tools to reproduce and serve models ● Flavour-based organization of components ○ flavour = cluster_spec + SW_components ● Built-in exploration environments (dashboarding tools, jupyter notebooks with DS libraries) https://data-mill-cloud.github.io/data-mill/