Towards Data Operations

Towards Data Operations
Dr. Andrea Monacchi
Streaming data, models and code as first-class citizens

Riservato Nome personalizzato dell'azienda Versione 1.0
1. About Myself
2. Big Data Architectures
3. Stream-based computing
4. Integrating the Data Science workflow
Summary

About
● BS Computer Science (2010)
● MS Computer Science (2012)
● PhD Information Technology (2016)
● Consultancy experience @ Reply DE (2016-2018)
● Independent Consultant

Big Data: why are we here?
Loads:
● Transactional (OLTP)
○ all operations
○ ACID properties: atomicity, consistency, isolation,
and durability
● Analytical (OLAP)
○ append-heavy loads
○ aggregations and explorative queries (analytics)
○ hierarchical indexing (OLAP hyper-cubes)
Scalability:
● ACID properties costly
● CAP Theorem
○ impossible to simultaneously distributed data load
and fulfill 3 properties: consistency, availability,
partition tolerance (tolerance to communication
errors)
○ CA are classic RDBMS - vertical scaling only
○ CP (e.g. quorum-based) and AP are NoSQL DBs
○ e.g. Dynamo DB (eventual consistency, AP)
● NoSQL databases
○ relax ACID properties to achieve horizontal scaling

Scalability
● Key/Value Storage (DHT)
○ decentralised, scalable, fault-tolerant
○ easy to partition and distribute data by key
■ e.g. p = hash(k) % num_partitions
○ replication (partition redundancy)
■ 2: error detection
■ 3: error recovery
● Parallel collections
○ e.g. Spark RDD based on underlying HDFS blocks
○ master-slave (or driver-worker) coordination
○ no shared variables (accumulators, broadcast vars)
Parallel computation:
1. Split dataset into multiple partitions/shards
2. Independently process each partition
3. Combine partitions into result
● MapReduce (Functional Programming)
● Split-apply-combine
● Google’s MapReduce

● Cluster Manager
○ Yarn, Mesos, K8s
○ resource isolation and scheduling
○ security, fault-tolerance, monitoring
● Data Serialization formats
○ Text: CSV, JSON, XML,..
○ Binary: SeqFile, Avro, Parquet, ORC, ..
● Batch Processing
○ Hadoop MapReduce variants
■ Pig, Hive
○ Apache Spark
○ Python Dask
● Workflow management tools
○ Fault-tolerant task coordination
○ Oozie, Airflow, Argo (K8s)
Architectures for Data Analytics
● Stages
○ Ingestion (with retention)
○ (re)-processing
○ Presentation/Serving (indexed data, OLAP)
● Lambda Vs. Kappa architecture
○ batch for correctness, streaming for speed
○ mix of technologies (CAP theorem)
○ complexity & operational costs

1st phase - Ingestion: MQTT
● Pub/Sub
○ Totally uncoupled clients
○ messages can be explicitly retained (flag)
● QoS
○ multilevel (0: at most once, 1: at least once, 2: exactly once)
● Persistent sessions
○ broker keeps further info of clients to speed up reconnections
○ queuing of messages in QoS 1 and 2 (disconnections)
○ queuing of all acks for QoS 2
● Last will and Testament messages (LWT)
○ kept until proper client disconnect is sent to the broker

● fault-tolerant pub/sub system with message retention
● exactly once semantics (from v0.11.0) using transactional.id flag (and acks)
● topic as an append-only file log
○ ordering by arrival time (offset)
○ stores changes on the data source (deltas)
○ topic/log consists of multiple partitions
■ partitioning instructed by producer
■ guaranteed message ordering within partition
■ based on message key
● hash(k) % num_partitions
● if k == null, then round robin is used
■ distribution and replication by partition
● 1 elected active (leader) and n replicas
● Zookeeper
1st phase - Ingestion: Kafka
P
Broker2
Broker1
Partition0
Partition1
Partition2
Partition3
Broker3
Partition4
Partition5

● topic maps to local directory
○ 1 file created per each topic partition
○ log rolling: upon size or time limit partition file is rolled
and new one is created
■ log.roll.ms or log.roll.hours = 168
○ log retention:
■ save rolled log to segment files
■ older segments deleted log.retention.hours
or log.retention.bytes
○ log compaction:
■ deletion of older records by key (latest value)
■ log.cleanup.policy=compact on topic
● number of topic partitions (howto)
○ more data -> more partitions
○ proportional to desired throughput
○ overhead for open file handles and TCP connections
■ n partitions, each with m replicas, n*m
■ total partitions on a broker < 100 *
num_brokers * replication_factor, (divide for
all topics)

● APIs
○ consumer / producer (single thread)
○ Kafka connect
○ KSQL
● Producer (Stateless wrt Broker)
○ ProducerRecord: (topic : String, partition : int,
timestamp : long, key : K, value : V)
○ partition id and time are optional (for manual setup)
● Kafka Connect
○ API for connectors (Source Vs Sink)
○ automatic offset management (commit and restore)
○ at least once delivery
○ exactly once only for certain connectors
○ standalone Vs. distributed modes
○ connectors configurable via REST interface
● Consumer
○ stateful (own offset wrt topic’s)
○ earliest Vs. latest recovery
○ offset committing: manual, periodical (default 5 secs)
○ consumers can be organized in load-balanced groups (per partition)
○ ideally: consumer’s threads = topic partitions

2nd phase - Stream processing
● streams
○ bounded - can be ingested and batch processed
○ unbounded - processed per event
● stream partitions
○ unit of parallelism
● stream and table duality
○ log changes <-> table
○ Declarative SQL APIs (e.g. KSQL, TableAPI)
● time
○ event time, ingestion time, processing time
○ late message handling (pre-buffering)
● windowing
○ bounded aggregations
○ time or data driven (e.g. count) windows
○ tumbling, sliding and session windows
● stateful operations/transformations
○ state: intermediate result, in-memory key-value store
used across windows or microbatches
○ e.g. RocksDB (Flink+KafkaStreams), LSM-tree
○ e.g. changeLogging back to Kafka topic (KafkaStreams)
● checkpointing
○ save app status for failure recovery (stream replay)
● frameworks
○ Spark Streaming (u-batching), Kafka Streams and Flink
○ Apache Storm, Apache Samza

Code Examples
Flink
● processing webserver logs for frauds
● count downloaded assets per user
● https://github.com/pilillo/flink-quickstart
● Deploy to Kubernetes
SparkStreaming & Kafka
● https://github.com/pilillo/sparkstreaming-quickstart
Kafka Streams
● project skeleton

Integrating the
Data Science
workflow

Data Science workflow
Technical gaps potentially resulting from this process!
● Data Analytics projects
○ stream of research questions and feedback
● Data forked for exploration
○ Data versioning
○ Periodic data quality assessment
● Team misalignment
○ scientists working aside dev team
○ information gaps w/ team & stakeholders
○ unexpected behaviors upon changes
● Results may not be reproducible
● Releases are not frequent
○ Value misalignment (waste of resources)
● CICD only used for data preparation

Data Operations (DataOps)
● DevOps approaches
○ lean product development (continuous feedback and value delivery) using CICD approaches
○ cross-functional teams with mix of development & operations skillset
● DataOps
○ devops for streaming data and analytics as a manufacturing process - Manifesto, Cookbook
○ mix of data engineering and data science skill set
○ focus: continuous data quality assessment, model reproducibility, incremental/progressive delivery of value

Data Operations workflow
Data Science workflow

● Continuous Data Quality Assessment
○ data versioning (e.g. apache pachyderm)
○ syntax (e.g. confluent schema registry)
○ semantic (e.g. apache griffin) - accuracy, completeness, timeliness, uniqueness, validity, consistency
● Model Tuning
○ hyperparameters tuning - black-box optimization
○ autoML - model selection
○ continuous performance evaluation (wrt newer input data)
○ stakeholder/user performance (e.g. AB testing)
● Model Deployment
○ TF-serve, Seldon
● ML workflow management
○ Amazon Sagemaker, Google ML Engine, MLflow, Kubeflow, Polyaxon
CICD for ML code
See also: https://github.com/EthicalML/awesome-machine-learning-operations

Data-Mill project
● Based on Kubernetes
○ open and scalable
○ seamless integration of bare-metal and cloud-provided clusters
● Enforcing DataOps principles
○ continuous asset monitoring (code, data, models)
○ open-source tools to reproduce and serve models
● Flavour-based organization of components
○ flavour = cluster_spec + SW_components
● Built-in exploration environments (dashboarding tools, jupyter notebooks with DS libraries)
https://data-mill-cloud.github.io/data-mill/

Towards Data Operations

More Related Content

What's hot (20)

Similar to Towards Data Operations (20)

More from Andrea Monacchi (9)

Recently uploaded (20)

Towards Data Operations