SlideShare a Scribd company logo
Towards Data Operations
Dr. Andrea Monacchi
Streaming data, models and code as first-class citizens
Riservato Nome personalizzato dell'azienda Versione 1.0
1. About Myself
2. Big Data Architectures
3. Stream-based computing
4. Integrating the Data Science workflow
Summary
About
● BS Computer Science (2010)
● MS Computer Science (2012)
● PhD Information Technology (2016)
● Consultancy experience @ Reply DE (2016-2018)
● Independent Consultant
Big Data
Architectures
Big Data: why are we here?
Loads:
● Transactional (OLTP)
○ all operations
○ ACID properties: atomicity, consistency, isolation,
and durability
● Analytical (OLAP)
○ append-heavy loads
○ aggregations and explorative queries (analytics)
○ hierarchical indexing (OLAP hyper-cubes)
Scalability:
● ACID properties costly
● CAP Theorem
○ impossible to simultaneously distributed data load
and fulfill 3 properties: consistency, availability,
partition tolerance (tolerance to communication
errors)
○ CA are classic RDBMS - vertical scaling only
○ CP (e.g. quorum-based) and AP are NoSQL DBs
○ e.g. Dynamo DB (eventual consistency, AP)
● NoSQL databases
○ relax ACID properties to achieve horizontal scaling
Scalability
● Key/Value Storage (DHT)
○ decentralised, scalable, fault-tolerant
○ easy to partition and distribute data by key
■ e.g. p = hash(k) % num_partitions
○ replication (partition redundancy)
■ 2: error detection
■ 3: error recovery
● Parallel collections
○ e.g. Spark RDD based on underlying HDFS blocks
○ master-slave (or driver-worker) coordination
○ no shared variables (accumulators, broadcast vars)
Parallel computation:
1. Split dataset into multiple partitions/shards
2. Independently process each partition
3. Combine partitions into result
● MapReduce (Functional Programming)
● Split-apply-combine
● Google’s MapReduce
● Cluster Manager
○ Yarn, Mesos, K8s
○ resource isolation and scheduling
○ security, fault-tolerance, monitoring
● Data Serialization formats
○ Text: CSV, JSON, XML,..
○ Binary: SeqFile, Avro, Parquet, ORC, ..
● Batch Processing
○ Hadoop MapReduce variants
■ Pig, Hive
○ Apache Spark
○ Python Dask
● Workflow management tools
○ Fault-tolerant task coordination
○ Oozie, Airflow, Argo (K8s)
Architectures for Data Analytics
● Stages
○ Ingestion (with retention)
○ (re)-processing
○ Presentation/Serving (indexed data, OLAP)
● Lambda Vs. Kappa architecture
○ batch for correctness, streaming for speed
○ mix of technologies (CAP theorem)
○ complexity & operational costs
Stream-based
computing
1st phase - Ingestion: MQTT
● Pub/Sub
○ Totally uncoupled clients
○ messages can be explicitly retained (flag)
● QoS
○ multilevel (0: at most once, 1: at least once, 2: exactly once)
● Persistent sessions
○ broker keeps further info of clients to speed up reconnections
○ queuing of messages in QoS 1 and 2 (disconnections)
○ queuing of all acks for QoS 2
● Last will and Testament messages (LWT)
○ kept until proper client disconnect is sent to the broker
● fault-tolerant pub/sub system with message retention
● exactly once semantics (from v0.11.0) using transactional.id flag (and acks)
● topic as an append-only file log
○ ordering by arrival time (offset)
○ stores changes on the data source (deltas)
○ topic/log consists of multiple partitions
■ partitioning instructed by producer
■ guaranteed message ordering within partition
■ based on message key
● hash(k) % num_partitions
● if k == null, then round robin is used
■ distribution and replication by partition
● 1 elected active (leader) and n replicas
● Zookeeper
1st phase - Ingestion: Kafka
P
Broker2
Broker1
Partition0
Partition1
Partition2
Partition3
Broker3
Partition4
Partition5
● topic maps to local directory
○ 1 file created per each topic partition
○ log rolling: upon size or time limit partition file is rolled
and new one is created
■ log.roll.ms or log.roll.hours = 168
○ log retention:
■ save rolled log to segment files
■ older segments deleted log.retention.hours
or log.retention.bytes
○ log compaction:
■ deletion of older records by key (latest value)
■ log.cleanup.policy=compact on topic
● number of topic partitions (howto)
○ more data -> more partitions
○ proportional to desired throughput
○ overhead for open file handles and TCP connections
■ n partitions, each with m replicas, n*m
■ total partitions on a broker < 100 *
num_brokers * replication_factor, (divide for
all topics)
1st phase - Ingestion: Kafka
● APIs
○ consumer / producer (single thread)
○ Kafka connect
○ KSQL
● Producer (Stateless wrt Broker)
○ ProducerRecord: (topic : String, partition : int,
timestamp : long, key : K, value : V)
○ partition id and time are optional (for manual setup)
● Kafka Connect
○ API for connectors (Source Vs Sink)
○ automatic offset management (commit and restore)
○ at least once delivery
○ exactly once only for certain connectors
○ standalone Vs. distributed modes
○ connectors configurable via REST interface
1st phase - Ingestion: Kafka
● Consumer
○ stateful (own offset wrt topic’s)
○ earliest Vs. latest recovery
○ offset committing: manual, periodical (default 5 secs)
○ consumers can be organized in load-balanced groups (per partition)
○ ideally: consumer’s threads = topic partitions
2nd phase - Stream processing
● streams
○ bounded - can be ingested and batch processed
○ unbounded - processed per event
● stream partitions
○ unit of parallelism
● stream and table duality
○ log changes <-> table
○ Declarative SQL APIs (e.g. KSQL, TableAPI)
● time
○ event time, ingestion time, processing time
○ late message handling (pre-buffering)
● windowing
○ bounded aggregations
○ time or data driven (e.g. count) windows
○ tumbling, sliding and session windows
● stateful operations/transformations
○ state: intermediate result, in-memory key-value store
used across windows or microbatches
○ e.g. RocksDB (Flink+KafkaStreams), LSM-tree
○ e.g. changeLogging back to Kafka topic (KafkaStreams)
● checkpointing
○ save app status for failure recovery (stream replay)
● frameworks
○ Spark Streaming (u-batching), Kafka Streams and Flink
○ Apache Storm, Apache Samza
Code Examples
Flink
● processing webserver logs for frauds
● count downloaded assets per user
● https://github.com/pilillo/flink-quickstart
● Deploy to Kubernetes
SparkStreaming & Kafka
● https://github.com/pilillo/sparkstreaming-quickstart
Kafka Streams
● project skeleton
Integrating the
Data Science
workflow
Data Science workflow
Technical gaps potentially resulting from this process!
● Data Analytics projects
○ stream of research questions and feedback
● Data forked for exploration
○ Data versioning
○ Periodic data quality assessment
● Team misalignment
○ scientists working aside dev team
○ information gaps w/ team & stakeholders
○ unexpected behaviors upon changes
● Results may not be reproducible
● Releases are not frequent
○ Value misalignment (waste of resources)
● CICD only used for data preparation
Data Operations (DataOps)
● DevOps approaches
○ lean product development (continuous feedback and value delivery) using CICD approaches
○ cross-functional teams with mix of development & operations skillset
● DataOps
○ devops for streaming data and analytics as a manufacturing process - Manifesto, Cookbook
○ mix of data engineering and data science skill set
○ focus: continuous data quality assessment, model reproducibility, incremental/progressive delivery of value
Data Operations workflow
Data Science workflow
CICD activities
CICD for ML
● Continuous Data Quality Assessment
○ data versioning (e.g. apache pachyderm)
○ syntax (e.g. confluent schema registry)
○ semantic (e.g. apache griffin) - accuracy, completeness, timeliness, uniqueness, validity, consistency
● Model Tuning
○ hyperparameters tuning - black-box optimization
○ autoML - model selection
○ continuous performance evaluation (wrt newer input data)
○ stakeholder/user performance (e.g. AB testing)
● Model Deployment
○ TF-serve, Seldon
● ML workflow management
○ Amazon Sagemaker, Google ML Engine, MLflow, Kubeflow, Polyaxon
CICD for ML code
See also: https://github.com/EthicalML/awesome-machine-learning-operations
Data-Mill project
● Based on Kubernetes
○ open and scalable
○ seamless integration of bare-metal and cloud-provided clusters
● Enforcing DataOps principles
○ continuous asset monitoring (code, data, models)
○ open-source tools to reproduce and serve models
● Flavour-based organization of components
○ flavour = cluster_spec + SW_components
● Built-in exploration environments (dashboarding tools, jupyter notebooks with DS libraries)
https://data-mill-cloud.github.io/data-mill/
Thank you!
Ad

More Related Content

What's hot (20)

Apache flink
Apache flinkApache flink
Apache flink
pranay kumar
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Anant Corporation
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
Vasia Kalavri
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Hajira Jabeen
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
Kishore Gopalakrishna
 
Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2
Tim Allison
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
Christoph Matthies
 
ISNCC 2017
ISNCC 2017ISNCC 2017
ISNCC 2017
Rim Moussa
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
Vasia Kalavri
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
Nikhil Shekhar
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Anant Corporation
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
Vasia Kalavri
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Hajira Jabeen
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
Kishore Gopalakrishna
 
Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2
Tim Allison
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
Christoph Matthies
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
Vasia Kalavri
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
Nikhil Shekhar
 

Similar to Towards Data Operations (20)

Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
Hao Xu
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Introduction to Postrges-XC
Introduction to Postrges-XCIntroduction to Postrges-XC
Introduction to Postrges-XC
Ashutosh Bapat
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
confluent
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
XinliShang1
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameterMobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
telestax
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Hisham Mardam-Bey
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layer
Pranith Karampuri
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
Mark Harrison
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
JConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaJConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with Java
Timothy Spann
 
Concurrency, Parallelism And IO
Concurrency,  Parallelism And IOConcurrency,  Parallelism And IO
Concurrency, Parallelism And IO
Piyush Katariya
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kevin Lynch
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
confluent
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
Hao Xu
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Introduction to Postrges-XC
Introduction to Postrges-XCIntroduction to Postrges-XC
Introduction to Postrges-XC
Ashutosh Bapat
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
confluent
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
XinliShang1
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameterMobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
telestax
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Hisham Mardam-Bey
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layer
Pranith Karampuri
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
Mark Harrison
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
JConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaJConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with Java
Timothy Spann
 
Concurrency, Parallelism And IO
Concurrency,  Parallelism And IOConcurrency,  Parallelism And IO
Concurrency, Parallelism And IO
Piyush Katariya
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kevin Lynch
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
confluent
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Ad

More from Andrea Monacchi (9)

Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systems
Andrea Monacchi
 
Introduction to istio
Introduction to istioIntroduction to istio
Introduction to istio
Andrea Monacchi
 
Anomaly detection on wind turbine data
Anomaly detection on wind turbine dataAnomaly detection on wind turbine data
Anomaly detection on wind turbine data
Andrea Monacchi
 
Welcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy ManagementWelcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy Management
Andrea Monacchi
 
An Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted LivingAn Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted Living
Andrea Monacchi
 
Assisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and MicrogridsAssisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and Microgrids
Andrea Monacchi
 
Analytics as value added service for energy utilities
Analytics as value added service for energy utilitiesAnalytics as value added service for energy utilities
Analytics as value added service for energy utilities
Andrea Monacchi
 
HEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market SimulatorHEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market Simulator
Andrea Monacchi
 
GREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and ItalyGREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and Italy
Andrea Monacchi
 
Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systems
Andrea Monacchi
 
Anomaly detection on wind turbine data
Anomaly detection on wind turbine dataAnomaly detection on wind turbine data
Anomaly detection on wind turbine data
Andrea Monacchi
 
Welcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy ManagementWelcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy Management
Andrea Monacchi
 
An Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted LivingAn Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted Living
Andrea Monacchi
 
Assisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and MicrogridsAssisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and Microgrids
Andrea Monacchi
 
Analytics as value added service for energy utilities
Analytics as value added service for energy utilitiesAnalytics as value added service for energy utilities
Analytics as value added service for energy utilities
Andrea Monacchi
 
HEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market SimulatorHEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market Simulator
Andrea Monacchi
 
GREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and ItalyGREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and Italy
Andrea Monacchi
 
Ad

Recently uploaded (20)

Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 

Towards Data Operations

  • 1. Towards Data Operations Dr. Andrea Monacchi Streaming data, models and code as first-class citizens
  • 2. Riservato Nome personalizzato dell'azienda Versione 1.0 1. About Myself 2. Big Data Architectures 3. Stream-based computing 4. Integrating the Data Science workflow Summary
  • 3. About ● BS Computer Science (2010) ● MS Computer Science (2012) ● PhD Information Technology (2016) ● Consultancy experience @ Reply DE (2016-2018) ● Independent Consultant
  • 5. Big Data: why are we here? Loads: ● Transactional (OLTP) ○ all operations ○ ACID properties: atomicity, consistency, isolation, and durability ● Analytical (OLAP) ○ append-heavy loads ○ aggregations and explorative queries (analytics) ○ hierarchical indexing (OLAP hyper-cubes) Scalability: ● ACID properties costly ● CAP Theorem ○ impossible to simultaneously distributed data load and fulfill 3 properties: consistency, availability, partition tolerance (tolerance to communication errors) ○ CA are classic RDBMS - vertical scaling only ○ CP (e.g. quorum-based) and AP are NoSQL DBs ○ e.g. Dynamo DB (eventual consistency, AP) ● NoSQL databases ○ relax ACID properties to achieve horizontal scaling
  • 6. Scalability ● Key/Value Storage (DHT) ○ decentralised, scalable, fault-tolerant ○ easy to partition and distribute data by key ■ e.g. p = hash(k) % num_partitions ○ replication (partition redundancy) ■ 2: error detection ■ 3: error recovery ● Parallel collections ○ e.g. Spark RDD based on underlying HDFS blocks ○ master-slave (or driver-worker) coordination ○ no shared variables (accumulators, broadcast vars) Parallel computation: 1. Split dataset into multiple partitions/shards 2. Independently process each partition 3. Combine partitions into result ● MapReduce (Functional Programming) ● Split-apply-combine ● Google’s MapReduce
  • 7. ● Cluster Manager ○ Yarn, Mesos, K8s ○ resource isolation and scheduling ○ security, fault-tolerance, monitoring ● Data Serialization formats ○ Text: CSV, JSON, XML,.. ○ Binary: SeqFile, Avro, Parquet, ORC, .. ● Batch Processing ○ Hadoop MapReduce variants ■ Pig, Hive ○ Apache Spark ○ Python Dask ● Workflow management tools ○ Fault-tolerant task coordination ○ Oozie, Airflow, Argo (K8s) Architectures for Data Analytics ● Stages ○ Ingestion (with retention) ○ (re)-processing ○ Presentation/Serving (indexed data, OLAP) ● Lambda Vs. Kappa architecture ○ batch for correctness, streaming for speed ○ mix of technologies (CAP theorem) ○ complexity & operational costs
  • 9. 1st phase - Ingestion: MQTT ● Pub/Sub ○ Totally uncoupled clients ○ messages can be explicitly retained (flag) ● QoS ○ multilevel (0: at most once, 1: at least once, 2: exactly once) ● Persistent sessions ○ broker keeps further info of clients to speed up reconnections ○ queuing of messages in QoS 1 and 2 (disconnections) ○ queuing of all acks for QoS 2 ● Last will and Testament messages (LWT) ○ kept until proper client disconnect is sent to the broker
  • 10. ● fault-tolerant pub/sub system with message retention ● exactly once semantics (from v0.11.0) using transactional.id flag (and acks) ● topic as an append-only file log ○ ordering by arrival time (offset) ○ stores changes on the data source (deltas) ○ topic/log consists of multiple partitions ■ partitioning instructed by producer ■ guaranteed message ordering within partition ■ based on message key ● hash(k) % num_partitions ● if k == null, then round robin is used ■ distribution and replication by partition ● 1 elected active (leader) and n replicas ● Zookeeper 1st phase - Ingestion: Kafka P Broker2 Broker1 Partition0 Partition1 Partition2 Partition3 Broker3 Partition4 Partition5
  • 11. ● topic maps to local directory ○ 1 file created per each topic partition ○ log rolling: upon size or time limit partition file is rolled and new one is created ■ log.roll.ms or log.roll.hours = 168 ○ log retention: ■ save rolled log to segment files ■ older segments deleted log.retention.hours or log.retention.bytes ○ log compaction: ■ deletion of older records by key (latest value) ■ log.cleanup.policy=compact on topic ● number of topic partitions (howto) ○ more data -> more partitions ○ proportional to desired throughput ○ overhead for open file handles and TCP connections ■ n partitions, each with m replicas, n*m ■ total partitions on a broker < 100 * num_brokers * replication_factor, (divide for all topics) 1st phase - Ingestion: Kafka
  • 12. ● APIs ○ consumer / producer (single thread) ○ Kafka connect ○ KSQL ● Producer (Stateless wrt Broker) ○ ProducerRecord: (topic : String, partition : int, timestamp : long, key : K, value : V) ○ partition id and time are optional (for manual setup) ● Kafka Connect ○ API for connectors (Source Vs Sink) ○ automatic offset management (commit and restore) ○ at least once delivery ○ exactly once only for certain connectors ○ standalone Vs. distributed modes ○ connectors configurable via REST interface 1st phase - Ingestion: Kafka ● Consumer ○ stateful (own offset wrt topic’s) ○ earliest Vs. latest recovery ○ offset committing: manual, periodical (default 5 secs) ○ consumers can be organized in load-balanced groups (per partition) ○ ideally: consumer’s threads = topic partitions
  • 13. 2nd phase - Stream processing ● streams ○ bounded - can be ingested and batch processed ○ unbounded - processed per event ● stream partitions ○ unit of parallelism ● stream and table duality ○ log changes <-> table ○ Declarative SQL APIs (e.g. KSQL, TableAPI) ● time ○ event time, ingestion time, processing time ○ late message handling (pre-buffering) ● windowing ○ bounded aggregations ○ time or data driven (e.g. count) windows ○ tumbling, sliding and session windows ● stateful operations/transformations ○ state: intermediate result, in-memory key-value store used across windows or microbatches ○ e.g. RocksDB (Flink+KafkaStreams), LSM-tree ○ e.g. changeLogging back to Kafka topic (KafkaStreams) ● checkpointing ○ save app status for failure recovery (stream replay) ● frameworks ○ Spark Streaming (u-batching), Kafka Streams and Flink ○ Apache Storm, Apache Samza
  • 14. Code Examples Flink ● processing webserver logs for frauds ● count downloaded assets per user ● https://github.com/pilillo/flink-quickstart ● Deploy to Kubernetes SparkStreaming & Kafka ● https://github.com/pilillo/sparkstreaming-quickstart Kafka Streams ● project skeleton
  • 16. Data Science workflow Technical gaps potentially resulting from this process! ● Data Analytics projects ○ stream of research questions and feedback ● Data forked for exploration ○ Data versioning ○ Periodic data quality assessment ● Team misalignment ○ scientists working aside dev team ○ information gaps w/ team & stakeholders ○ unexpected behaviors upon changes ● Results may not be reproducible ● Releases are not frequent ○ Value misalignment (waste of resources) ● CICD only used for data preparation
  • 17. Data Operations (DataOps) ● DevOps approaches ○ lean product development (continuous feedback and value delivery) using CICD approaches ○ cross-functional teams with mix of development & operations skillset ● DataOps ○ devops for streaming data and analytics as a manufacturing process - Manifesto, Cookbook ○ mix of data engineering and data science skill set ○ focus: continuous data quality assessment, model reproducibility, incremental/progressive delivery of value
  • 18. Data Operations workflow Data Science workflow
  • 21. ● Continuous Data Quality Assessment ○ data versioning (e.g. apache pachyderm) ○ syntax (e.g. confluent schema registry) ○ semantic (e.g. apache griffin) - accuracy, completeness, timeliness, uniqueness, validity, consistency ● Model Tuning ○ hyperparameters tuning - black-box optimization ○ autoML - model selection ○ continuous performance evaluation (wrt newer input data) ○ stakeholder/user performance (e.g. AB testing) ● Model Deployment ○ TF-serve, Seldon ● ML workflow management ○ Amazon Sagemaker, Google ML Engine, MLflow, Kubeflow, Polyaxon CICD for ML code See also: https://github.com/EthicalML/awesome-machine-learning-operations
  • 22. Data-Mill project ● Based on Kubernetes ○ open and scalable ○ seamless integration of bare-metal and cloud-provided clusters ● Enforcing DataOps principles ○ continuous asset monitoring (code, data, models) ○ open-source tools to reproduce and serve models ● Flavour-based organization of components ○ flavour = cluster_spec + SW_components ● Built-in exploration environments (dashboarding tools, jupyter notebooks with DS libraries) https://data-mill-cloud.github.io/data-mill/