Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
An Open Source Incremental Processing Framework called Hoodie is summarized. Key points:
- Hoodie provides upsert and incremental processing capabilities on top of a Hadoop data lake to enable near real-time queries while avoiding costly full scans.
- It introduces primitives like upsert and incremental pull to apply mutations and consume only changed data.
- Hoodie stores data on HDFS and provides different views like read optimized, real-time, and log views to balance query performance and data latency for analytical workloads.
- The framework is open source and built on Spark, providing horizontal scalability and leveraging existing Hadoop SQL query engines like Hive and Presto.
I promise that understand NoSQL is as easy as playing with LEGO bricks ! The Google Bigtable presented in 2006 is the inspiration for Apache HBase: let's take a deep dive into Bigtable to better understand Hbase.
YugaByte DB on Kubernetes - An IntroductionYugabyte
This document summarizes YugaByte DB, a distributed SQL and NoSQL database. It discusses how YugaByte DB provides ACID transactions, strong consistency, and high performance at a planet scale. It also describes how to deploy YugaByte DB and an example e-commerce application called Yugastore on Kubernetes. The document outlines the database architecture and components, and provides steps to deploy the system and run a sample workload.
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...Maciek Jozwiak
Spanner is a globally distributed database that provides synchronous replication across data centers. It uses atomic clocks and GPS to synchronize time at a global scale and assign commit timestamps to distributed transactions. The database has a hierarchical data model and supports SQL-like queries while automatically sharding, replicating, and rebalancing data across servers.
Data Security at Scale through Spark and Parquet EncryptionDatabricks
Apple logo is a trademark of Apple Inc. This presentation discusses Parquet encryption at scale using Spark and Parquet. It covers goals of Parquet modular encryption including data privacy, integrity, and performance. It demonstrates writing and reading encrypted Parquet files in Spark and discusses the Apache community roadmap for further integration of Parquet encryption.
Spanner is Google's globally distributed database that provides ACID transactions across data centers. It uses TrueTime to assign timestamps for distributed transactions, ensuring consistency. A Spanner deployment consists of servers organized into zones across datacenters, with a universe master and placement driver coordinating. Transactions are executed across servers and committed using time-based consensus. Evaluation shows Spanner provides low-latency reads and commits at large scale for Google applications.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
This document discusses AlwaysOn availability technologies in SQL Server 2017, including Failover Cluster Instances (FCI) and Availability Groups (AG). It provides an overview of how FCI and AG work to provide high availability and disaster recovery, including capabilities like multi-subnet support. The document also summarizes new features and enhancements to FCI and AG in SQL Server 2014, 2016 and 2017.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...HostedbyConfluent
In a real-time data ingestion pipeline for analytical processing, efficient and fast data loading to a columnar database such as ClickHouse favors large blocks over individual rows. Therefore, applications often rely on some buffering mechanism such as Kafka to store data temporarily, and having a message processing engine to aggregate Kafka messages into large blocks which then get loaded to the backend database. Due to various failures in this pipeline, a naive block aggregator that forms blocks without additional measures, would cause data duplication or data loss. We have developed a solution to avoid these issues, thereby achieving exactly-once delivery from Kafka to ClickHouse. Our solution utilizes Kafka’s metadata to keep track of blocks that we intend to send to ClickHouse, and later uses this metadata information to deterministically re-produce ClickHouse blocks for re-tries in case of failures. The identical blocks are guaranteed to be deduplicated by ClickHouse. We have also developed a run-time verification tool that monitors Kafka’s internal metadata topic, and raises alerts when the required invariants for exactly-once delivery are violated. Our solution has been developed and deployed to the production clusters that span multiple datacenters at eBay.
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Markus Michalewicz
This is the latest version of the Oracle RAC 12c (12.1.0.2) Operational Best Practices presentation as shown during IOUG / Collaborate15. As best practices are a result of true collaboration this will probably be the last version before OOW 2015.
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
This document introduces YugaByte DB, a high-performance, distributed, transactional database. It is built to scale horizontally on commodity servers across data centers for mission-critical applications. YugaByte DB uses a transactional document store based on RocksDB, Raft-based replication for resilience, and automatic sharding and rebalancing. It supports ACID transactions across documents, provides APIs compatible with Cassandra and Redis, and is open source. The architecture is designed for high performance, strong consistency, and cloud-native deployment.
Challenges in Building a Data PipelineManish Kumar
The document discusses challenges in building a data pipeline including making it highly scalable, available with low latency and zero data loss while supporting multiple data sources. It covers expectations around real-time vs batch processing and streaming vs batch data. Implementation approaches like ETL vs ELT are examined along with replication modes, challenges around schema changes and NoSQL. Effective implementations should address transformations, security, replays, monitoring and more. Reference architectures like Lambda and Kappa are briefly outlined.
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through enhancements like delta encoding, binary packing designed for CPU efficiency, and predicate pushdown using statistics. Benchmark results show Parquet provides much better compression and query performance than row-oriented formats on big data workloads. The project is developed as an open-source community with contributions from many organizations.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
This document provides an overview and deep dive into Robinhood's RDS Data Lake architecture for ingesting data from their RDS databases into an S3 data lake. It discusses their prior daily snapshotting approach, and how they implemented a faster change data capture pipeline using Debezium to capture database changes and ingest them incrementally into a Hudi data lake. It also covers lessons learned around change data capture setup and configuration, initial table bootstrapping, data serialization formats, and scaling the ingestion process. Future work areas discussed include orchestrating thousands of pipelines and improving downstream query performance.
Dustin Vannoy presented on using Delta Lake with Azure Databricks. He began with an introduction to Spark and Databricks, demonstrating how to set up a workspace. He then discussed limitations of Spark including lack of ACID compliance and small file problems. Delta Lake addresses these issues with transaction logs for ACID transactions, schema enforcement, automatic file compaction, and performance optimizations like time travel. The presentation included demos of Delta Lake capabilities like schema validation, merging, and querying past versions of data.
Exadata architecture and internals presentationSanjoy Dasgupta
The document provides an overview of Oracle's Exadata database machine. It describes the Exadata X7-2 and X7-8 models, which feature the latest Intel Xeon processors, high-capacity flash storage, and an improved InfiniBand internal network. The document highlights how Exadata's unique smart database software optimizes performance for analytics, online transaction processing, and database consolidation workloads through techniques like smart scan query offloading to storage servers.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
Introduction to Cassandra: Replication and ConsistencyBenjamin Black
A short introduction to replication and consistency in the Cassandra distributed database. Delivered April 28th, 2010 at the Seattle Scalability Meetup.
Oracle Data Guard ensures high availability, disaster recovery and data protection for enterprise data. This enable production Oracle databases to survive disasters and data corruptions. Oracle 18c and 19c offers many new features it will bring many advantages to organization.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
This document discusses AlwaysOn availability technologies in SQL Server 2017, including Failover Cluster Instances (FCI) and Availability Groups (AG). It provides an overview of how FCI and AG work to provide high availability and disaster recovery, including capabilities like multi-subnet support. The document also summarizes new features and enhancements to FCI and AG in SQL Server 2014, 2016 and 2017.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...HostedbyConfluent
In a real-time data ingestion pipeline for analytical processing, efficient and fast data loading to a columnar database such as ClickHouse favors large blocks over individual rows. Therefore, applications often rely on some buffering mechanism such as Kafka to store data temporarily, and having a message processing engine to aggregate Kafka messages into large blocks which then get loaded to the backend database. Due to various failures in this pipeline, a naive block aggregator that forms blocks without additional measures, would cause data duplication or data loss. We have developed a solution to avoid these issues, thereby achieving exactly-once delivery from Kafka to ClickHouse. Our solution utilizes Kafka’s metadata to keep track of blocks that we intend to send to ClickHouse, and later uses this metadata information to deterministically re-produce ClickHouse blocks for re-tries in case of failures. The identical blocks are guaranteed to be deduplicated by ClickHouse. We have also developed a run-time verification tool that monitors Kafka’s internal metadata topic, and raises alerts when the required invariants for exactly-once delivery are violated. Our solution has been developed and deployed to the production clusters that span multiple datacenters at eBay.
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Markus Michalewicz
This is the latest version of the Oracle RAC 12c (12.1.0.2) Operational Best Practices presentation as shown during IOUG / Collaborate15. As best practices are a result of true collaboration this will probably be the last version before OOW 2015.
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
This document introduces YugaByte DB, a high-performance, distributed, transactional database. It is built to scale horizontally on commodity servers across data centers for mission-critical applications. YugaByte DB uses a transactional document store based on RocksDB, Raft-based replication for resilience, and automatic sharding and rebalancing. It supports ACID transactions across documents, provides APIs compatible with Cassandra and Redis, and is open source. The architecture is designed for high performance, strong consistency, and cloud-native deployment.
Challenges in Building a Data PipelineManish Kumar
The document discusses challenges in building a data pipeline including making it highly scalable, available with low latency and zero data loss while supporting multiple data sources. It covers expectations around real-time vs batch processing and streaming vs batch data. Implementation approaches like ETL vs ELT are examined along with replication modes, challenges around schema changes and NoSQL. Effective implementations should address transformations, security, replays, monitoring and more. Reference architectures like Lambda and Kappa are briefly outlined.
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through enhancements like delta encoding, binary packing designed for CPU efficiency, and predicate pushdown using statistics. Benchmark results show Parquet provides much better compression and query performance than row-oriented formats on big data workloads. The project is developed as an open-source community with contributions from many organizations.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
This document provides an overview and deep dive into Robinhood's RDS Data Lake architecture for ingesting data from their RDS databases into an S3 data lake. It discusses their prior daily snapshotting approach, and how they implemented a faster change data capture pipeline using Debezium to capture database changes and ingest them incrementally into a Hudi data lake. It also covers lessons learned around change data capture setup and configuration, initial table bootstrapping, data serialization formats, and scaling the ingestion process. Future work areas discussed include orchestrating thousands of pipelines and improving downstream query performance.
Dustin Vannoy presented on using Delta Lake with Azure Databricks. He began with an introduction to Spark and Databricks, demonstrating how to set up a workspace. He then discussed limitations of Spark including lack of ACID compliance and small file problems. Delta Lake addresses these issues with transaction logs for ACID transactions, schema enforcement, automatic file compaction, and performance optimizations like time travel. The presentation included demos of Delta Lake capabilities like schema validation, merging, and querying past versions of data.
Exadata architecture and internals presentationSanjoy Dasgupta
The document provides an overview of Oracle's Exadata database machine. It describes the Exadata X7-2 and X7-8 models, which feature the latest Intel Xeon processors, high-capacity flash storage, and an improved InfiniBand internal network. The document highlights how Exadata's unique smart database software optimizes performance for analytics, online transaction processing, and database consolidation workloads through techniques like smart scan query offloading to storage servers.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
Introduction to Cassandra: Replication and ConsistencyBenjamin Black
A short introduction to replication and consistency in the Cassandra distributed database. Delivered April 28th, 2010 at the Seattle Scalability Meetup.
Oracle Data Guard ensures high availability, disaster recovery and data protection for enterprise data. This enable production Oracle databases to survive disasters and data corruptions. Oracle 18c and 19c offers many new features it will bring many advantages to organization.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047
Description:
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.
Hire Hadoop Developer at 0.99* $ per hour for first 5 hours. visit http://geeksperhour.com/hire-hadoop-developer/ to post your job.
follow us on https://twitter.com/GeeksPerHourCom
#hadoop #hadoopfreelancer #hadoopprogrammer #hadoopdeveloper
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
This document summarizes a study that compares the performance of K-Means clustering implemented in Apache Spark MLlib and MPI (Message Passing Interface). The authors applied K-Means clustering to NBA play-by-play game data to cluster teams based on their position distributions. They found that MPI ran faster for smaller cluster sizes and fewer iterations, while Spark provided more stable runtimes as parameters increased. The authors tested different numbers of machines in MPI and found that runtime increased linearly with more machines, opposite to their expectation of faster runtimes with more machines distributed the work.
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
The document discusses how to write a MapReduce version of K-means clustering. It involves duplicating the cluster centers across nodes so each data point can be processed independently in the map phase. The map phase outputs (ClusterID, Point) pairs assigning each point to its closest cluster. The reduce phase groups by ClusterID and calculates the new centroid for each cluster, outputting (ClusterID, Centroid) pairs. Each iteration is run as a MapReduce job with the library determining if convergence is reached between iterations.
Optimization for iterative queries on Mapreducemakoto onizuka
This document discusses optimization techniques for iterative queries with convergence properties. It presents OptIQ, a framework that uses view materialization and incrementalization to remove redundant computations from iterative queries. View materialization reuses operations on unmodified attributes by decomposing tables into invariant and variant views. Incrementalization reuses operations on unmodified tuples by processing delta tables between iterations. The document evaluates OptIQ on Hive and Spark, showing it can improve performance of iterative algorithms like PageRank and k-means clustering by up to 5 times.
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
Spark Bi-Clustering - OW2 Big Data Initiative, alticALTIC Altic
This document discusses the OW2 Big Data Initiative and ALTIC's tools and approach for big data including ETL, data warehousing, reporting, analytics, and BI platforms. It also describes Biclustring, an algorithm for big data clustering using Spark and SOM, and how it can integrate with SpagoBI and Talend for big data analysis.
This document summarizes a lecture on clustering and provides a sample MapReduce implementation of K-Means clustering. It introduces clustering, discusses different clustering algorithms like hierarchical and partitional clustering, and focuses on K-Means clustering. It also describes Canopy clustering, which can be used as a preliminary step to partition large datasets and parallelize computation for K-Means clustering. The document then outlines the steps to implement K-Means clustering on large datasets using MapReduce, including selecting canopy centers, assigning points to canopies, and performing the iterative K-Means algorithm in parallel.
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
The document discusses clustering algorithms like K-means and how they can be implemented using Apache Spark. It describes how Spark allows these algorithms to be highly parallelized and run on large datasets. Specifically, it covers how K-means clustering works, its limitations in choosing initial cluster centers, and how K-means++ and K-means|| algorithms aim to address this by sampling points from the dataset to select better initial centers in a parallel manner that is scalable for big data.
This document provides an overview and summary of Apache Hivemall, which is a scalable machine learning library built as a collection of Hive UDFs (user-defined functions). Some key points:
- Hivemall allows users to perform machine learning tasks like classification, regression, recommendation and anomaly detection using SQL queries in Hive, SparkSQL or Pig Latin.
- It provides a number of popular machine learning algorithms like logistic regression, decision trees, factorization machines.
- Hivemall is multi-platform, so models built in one system can be used in another. This allows ML tasks to be parallelized across clusters.
- It has been adopted by several companies for applications like click-through prediction, user
This document summarizes key Hadoop configuration parameters that affect MapReduce job performance and provides suggestions for optimizing these parameters under different conditions. It describes the MapReduce workflow and phases, defines important parameters like dfs.block.size, mapred.compress.map.output, and mapred.tasktracker.map/reduce.tasks.maximum. It explains how to configure these parameters based on factors like cluster size, data and task complexity, and available resources. The document also discusses other performance aspects like temporary space, JVM tuning, and reducing reducer initialization overhead.
Data Infused Product Design and Insights at LinkedInYael Garten
Presentation from a talk given at Boston Big Data Innovation Summit, September 2012.
Summary: The Data Science team at LinkedIn focuses on 3 main goals: (1) providing data-driven business and product insights, (2) creating data products, and (3) extracting interesting insights from our data such as analysis of the economic status of the country or identifying hot companies in a certain geographic region. In this talk I describe how we ensure that our products are data driven -- really data infused at the core -- and share interesting insights we uncover using LinkedIn's rich data. We discuss what makes a good data scientist, and what techniques and technologies LinkedIn data scientists use to convert our rich data into actionable product and business insights, to create data-driven products that truly serve our members.
A Perspective from the intersection Data Science, Mobility, and Mobile DevicesYael Garten
Invited talk at Stanford CSEE392I (Seminar on Trends in Computing and Communications) April 24, 2014.
Covered three topics: (1) Data science at LinkedIn. (2) Mobile data science — how is it different, challenges and opportunities. Examples of how data science impacts business and product decisions. (3) Mobile today, and LinkedIn's mobile story.
K-Means clustering is an algorithm that partitions data points into k clusters based on their distances from initial cluster center points. It is commonly used for classification applications on large datasets and can be parallelized by duplicating cluster centers and processing each data point independently. Mahout provides implementations of K-Means clustering and other algorithms that can operate on distributed datasets stored in Hadoop SequenceFiles.
SF Big Analytics meetup : Hoodie From UberChester Chen
Even after a decade, the name “Hadoop" remains synonymous with "big data”, even as new options for processing/querying (stream processing, in-memory analytics, interactive sql) and storage services (S3/Google Cloud/Azure) have emerged & unlocked new possibilities. However, the overall data architecture has become more complex with more moving parts and specialized systems, leading to duplication of data and strain on usability . In this talk, we argue that by adding some missing blocks to existing Hadoop stack, we are able to a provide similar capabilities right on top of Hadoop, at reduced cost and increased efficiency, greatly simplifying the overall architecture as well in the process. We will discuss the need for incremental processing primitives on Hadoop, motivating them with some real world problems from Uber. We will then introduce “Hoodie”, an open source spark library built at Uber, to enable faster data for petabyte scale data analytics and solve these problems. We will deep dive into the design & implementation of the system and discuss the core concepts around timeline consistency, tradeoffs between ingest speed & query performance. We contrast Hoodie with similar systems in the space, discuss how its deployed across Hadoop ecosystem at Uber and finally also share the technical direction ahead for the project.
Speaker: VINOTH CHANDAR, Staff Software Engineer at Uber
Vinoth is the founding engineer/architect of the data team at Uber, as well as author of many data processing & querying systems at Uber, including "Hoodie". He has keen interest in unified architectures for data analytics and processing.
Previously, Vinoth was the lead on Linkedin’s Voldemort key value store and has also worked on Oracle Database replication engine, HPC, and stream processing.
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
Big Telco, Bigger real-time demands: Real-time processing in Telco
- Presented by Jung-ryong Lee, engineer manager at SK Telecom at Gruter TECHDAY 2014 Oct.29 Seoul, Korea
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
By Doug Daniels (Director of Engineering, Data Dog)
At Datadog, we collect hundreds of billions of metric data points per day from hosts, services, and customers all over the world. In addition charting and monitoring this data in real time, we also run many large-scale offline jobs to apply algorithms and compute aggregations on the data. In the past months, we’ve migrated our largest data sets over to Apache Parquet—an efficient, portable columnar storage format
Optimizing Big Data to run in the Public CloudQubole
Qubole is a cloud-based platform that allows customers to easily run Hadoop and Spark clusters on AWS for big data analytics. It optimizes performance and reduces costs through techniques like caching data in S3 for faster access, using spot instances, and directly writing query outputs to S3. The document discusses Qubole's features, capabilities, and how it provides an easier way for more users like data scientists and analysts to access and query big data compared to building and managing Hadoop clusters themselves.
IEEE International Conference on Data Engineering 2015Yousun Jeong
SK Telecom developed a Hadoop data warehouse (DW) solution to address the high costs and limitations of traditional DW systems for handling big data. The Hadoop DW provides a scalable architecture using Hadoop, Tajo and Spark to cost-effectively store and analyze over 30PB of data across 1000+ nodes. It offers SQL analytics through Tajo for faster querying and easier migration from RDBMS systems. The Hadoop DW has helped SK Telecom and other customers such as semiconductor manufacturers to more affordably store and process massive volumes of both structured and unstructured data for advanced analytics.
Geek Sync | Guide to Understanding and Monitoring TempdbIDERA Software
You can watch the replay for this Geek Sync webcast in the IDERA Resource Center: http://ow.ly/7OmW50A5qNs
Every SQL Server system you work with has a tempdb database. In this Geek Sync, you’ll learn how tempdb is structured, what it’s used for and the common performance problems that are tied to this shared resource.
Fluentd is an open source log collector that allows flexible collection and routing of log data. It uses JSON format for log messages and supports many input and output plugins. Fluentd can collect logs from files, network services, and applications before routing them to storage and analysis services like MongoDB, HDFS, and Treasure Data. The open source project has grown a large community contributing over 100 plugins to make log collection and processing easier.
Deploying any software can be a challenge if you don't understand how resources are used or how to plan for the capacity of your systems. Whether you need to deploy or grow a single MongoDB instance, replica set, or tens of sharded clusters then you probably share the same challenges in trying to size that deployment.
This webinar will cover what resources MongoDB uses, and how to plan for their use in your deployment. Topics covered will include understanding how to model and plan capacity needs for new and growing deployments. The goal of this webinar will be to provide you with the tools needed to be successful in managing your MongoDB capacity planning tasks.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Kafka is becoming an ever more popular choice for users to help enable fast data and Streaming. Kafka provides a wide landscape of configuration to allow you to tweak its performance profile. Understanding the internals of Kafka is critical for picking your ideal configuration. Depending on your use case and data needs, different settings will perform very differently. Lets walk through performance essentials of Kafka. Let's talk about how your Consumer configuration, can speed up or slow down the flow of messages to Brokers. Lets talk about message keys, their implications and their impact on partition performance. Lets talk about how to figure out how many partitions and how many Brokers you should have. Let's discuss consumers and what effects their performance. How do you combine all of these choices and develop the best strategy moving forward? How do you test performance of Kafka? I will attempt a live demo with the help of Zeppelin to show in real time how to tune for performance.
The document discusses LinkedIn's data ecosystem and the challenge of bridging operational transactional data (OLTP) with analytical processing (OLAP) at scale. It describes LinkedIn's solution called Lumos, which is a scalable ETL framework that uses change data capture, delta processing, and virtual snapshots to frequently refresh petabyte-scale data from OLTP databases into Hadoop for OLAP. Lumos supports requirements like handling multiple data centers, schema evolution, and efficient change capture while ensuring data consistency and low latency refresh times.
Alluxio Day VI
October 12, 2021
https://www.alluxio.io/alluxio-day/
Speaker:
Vinoth Chandar, Apache Software Foundation
Raymond Xu, Zendesk
Make your SharePoint fly by tuning and optimizing SQL Serverserge luca
This document summarizes a presentation on optimizing SQL Server for SharePoint. It discusses basic SharePoint database concepts, planning for long-term performance by optimizing resources like CPU, RAM, disks and network latency. It also covers optimal SQL Server configuration including installation, database settings like recovery models and file placement. Maintaining databases through tools like DBCC CheckDB and measuring performance using counters and diagnostic queries is also presented. The presentation emphasizes the importance of collaboration between SharePoint and database administrators to ensure compliance and optimize performance.
SQL Server is really the brain of SharePoint. The default settings of SQL server are not optimised for SharePoint. In this session, Serge Luca (SharePoint MVP) and Isabelle Van Campenhoudt (SQL Server MVP) will give you an overview of what every SQL Server DBA needs to know regarding configuring, monitoring and setting up SQL Server for SharePoint 2013. After a quick description of the SharePoint architecture (site, site collections,…), we will describe the different types of SharePoint databases and their specific configuration settings. Some do’s and don’ts specific to SharePoint and also the disaster recovery options for SharePoint, including (but not only) SQL Server Always On Availability, groups for High availability and disaster recovery in order to achieve an optimal level of business continuity.
Benefits of Attending this Session:
Tips & tricks
Lessons learned from the field
Super return on Investment
Based on the popular blog series, join me in taking a deep dive and a behind the scenes look at how SQL Server 2016 “It Just Runs Faster”, focused on scalability and performance enhancements. This talk will discuss the improvements, not only for awareness, but expose design and internal change details. The beauty behind ‘It Just Runs Faster’ is your ability to just upgrade, in place, and take advantage without lengthy and costly application or infrastructure changes. If you are looking at why SQL Server 2016 makes sense for your business you won’t want to miss this session.
This document discusses considerations for large-scale SharePoint deployments on SQL Server. It provides examples of real-world deployments handling over 10TB of content. It covers understanding SharePoint databases, SQL performance tuning, and architectural design best practices. These include separating databases onto unique volumes, optimizing TempDB, maintaining around 100GB per content database, and using RAID 10 for performance. Statistical results are presented from deployments handling over 70 million documents loaded in under 12 days with expected performance.
SharePoint and Large Scale SQL Deployments - NZSPCguest7c2e070
This document discusses considerations for large-scale SharePoint deployments on SQL Server. It provides examples of real-world deployments handling over 10TB of content. It discusses database types, performance issues like indexing and backups, and architectural design best practices like separating databases onto unique volumes. It also provides statistics on deployments handling over 70 million documents and 40TB of content across multiple farms and databases.
Fluentd is an open source data collector that allows flexible data collection, processing, and output. It supports streaming data from sources like logs and metrics to destinations like databases, search engines, and object stores. Fluentd's plugin-based architecture allows it to support a wide variety of use cases. Recent versions of Fluentd have added features like improved plugin APIs, nanosecond time resolution, and Windows support to make it more suitable for containerized environments and low-latency applications.
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB
Want to learn practical tips for designing systems that can scale efficiently without compromising speed?
Join us for a workshop where we’ll address these challenges head-on and explore how to architect low-latency systems using Rust. During this free interactive workshop oriented for developers, engineers, and architects, we’ll cover how Rust’s unique language features and the Tokio async runtime enable high-performance application development.
As you explore key principles of designing low-latency systems with Rust, you will learn how to:
- Create and compile a real-world app with Rust
- Connect the application to ScyllaDB (NoSQL data store)
- Negotiate tradeoffs related to data modeling and querying
- Manage and monitor the database for consistently low latencies
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo
From predictive maintenance to robotic automation, AI is driving the future of manufacturing. But without high-quality annotated data, even the smartest models fall short.
Discover how data annotation services are powering accuracy, safety, and efficiency in AI-driven manufacturing systems.
Precision in data labeling = Precision on the production floor.
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
6. We All Like A Nimble Elephant
Question: Can we get fresh data, directly on a petabyte scale
Hadoop Data Lake?
7. Previously on .. Strata (2016)
Hadoop @ Uber
“Uber, your Hadoop has arrived: Powering Intelligence for Uber’s
Real-time marketplace”
8. Partitioned by day trip started
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 5 min
Day level partitions
Late Arriving Updates
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Updated
Trips
Motivation
10. Exponential Growth is fun ..
Hadoop @ Uber
Also extremely hard, to keep up with …
- Long waits for queue
- Disks running out of space
Common Pitfalls
- Massive re-computations
- Batch jobs are too big fail
11. Let’s go back 30 years
How did RDBMS-es solve this?
• Update existing row with new value (Transactions)
• Consume a log of changes downstream (Redo log)
• Update again downstream
Concepts
MySQL
(Server A)
MySQL
(Server B)
Update
Update
Pull Redo Log
TransformationImportant Differences
• Columnar file formats
• Read-heavy workloads
• Petabytes & 1000s of servers
12. 10 hr (1000)
8 hr (800)
6 hr (500)
snapshot
Batch
Recompute
Challenging Status Quo
trips
(compacted
table)
12-18+ hr
Presto
Derived
Tables8 hr
Approximation
Hoodie.upsert()
1 hr (100) - Today
10 min (50) - Q2 ‘17
1 hr
Hoodie.incrPull()
[2 mins to pull]
1 hr - 3 hr
(10x less
resources)
Motivation
Accurate!!!
Database
Replicate
d Trip
Rows
HBase
New
/updated
trip rows
Changelog
Kafka
upsert
logging
13. Incremental Processing
Advantages: Increased Efficiency / Leverage Hadoop SQL Engines/ Simplify Architecture
Hoodie Concepts
Incremental Pull (Primitive #2)
• Log stream of changes, avoid costly scans
• Enable chaining processing in DAG
For more, “Case For Incremental Processing on Hadoop” (link)
Upsert (Primitive #1)
• Modify processed results
• kv stores in stream processing
14. Introducing
Hoodie
Open Source
- https://github.com/uber/hoodie
- eng.uber.com/hoodie
Spark Library For Upserts & Incrementals
- Scales horizontally like any job
- Stores dataset directly on HDFS
Storage Abstraction to
- Apply mutations to dataset
- Pull changelog incrementally
Hoodie Concepts
Large HDFS
Dataset
Upsert
(Spark)
Changelog Changelog
Incr Pull
(Hive/Spark/Presto)
Hive Table
(normal queries)
16. Hoodie: Storage Types & Views
Hoodie Concepts
Views : How is Data read?
Read Optimized View
- Parquet Query Performance
- ~30 mins latency for ~500GB
Real Time View
- Hybrid of row & columnar data
- ~1-5 mins latency
- Brings near-real time tables
Log View
- Stream of changes to dataset
- Enables Incremental Pull
Storage Type : How is Data stored?
Copy On Write
- Purely columnar
- Simply creates new versions of files
Merge On Read
- Near-real time
- Shifts some write cost to
reads
- Merges on-the-fly
17. Hoodie: Storage Types & Views
Hoodie Concepts
Storage Type Supported Views
Copy On Write
Read Optimized,
LogView
Merge On Read
Read Optimized,
RealTime,
LogView
18. Storage: Basic Idea
2017/02/17
File1.parquet
Index
Index
File1_v2.parquet
2017/02/15
2017/02/16
2017/02/17
File1.avro.log
200 GB
30min batch
File1
10 GB
5min batch
File1_v1.parquet
10 GB
5 min batch ● 1825 Partitions (365 days * 5 yrs)
● 100 GB Partition Size
● 128 MB File Size
● ~800 Files Per Partition
● Skew spread - 0.005 (single batch)
● 20 seconds to re-write 1 File (shuffle)
● 100 executors
● 7300 Files rewritten
● 24 minutes to write
● 1825 Partitions (365 days * 5 yrs)
● 100 GB Partition Size
● 128 MB File Size
● ~800 Files Per Partition
● Skew spread - 0.5 % (single batch)
New Files - 0.005 % (single batch)
● 20 seconds to re-write 1 File (shuffle)
● 100 executors
10 executors
● 7300 Files rewritten
~ 8 new Files
● 24 minutes to write
~2 minutes to write
Deep Dive
Input
Changelog
Hoodie Dataset
19. Index and Storage
Index
- Tag ingested record as update or insert
- Index is immutable (record key to File mapping never changes)
- Pluggable
- Bloom Filter
- HBase
Storage
- HDFS Block aligned files
- ROFormat - Default is Apache Parquet
- WOFormat - Default is Apache Avro
Deep Dive
20. Concurrency
● Multi-row atomicity
● Strong consistency (Same as HDFS guarantees)
● Single Writer - Multiple Consumer pattern
● MVCC for isolation
○ Running queries are run concurrently to ingestion
Deep Dive
21. Data Skew
Why skew is a problem?
- Spark 2GB Remote Shuffle Block limit
- Straggler problem
Hoodie handles data skew automatically
- Index lookup skew
- Data write skew handled by auto sub partitioning
- Allocate sub-partitions (file ID) based on historical commit stats
- Morph inserts as updates to fix small files
Deep Dive
22. Compaction
Essential for Query performance
- Merge Write Optimized row format with Scan Optimized column format
Scheduled asynchronously to Ingestion
- Ingestion already groups updates per File Id
- Locks down versions of log files to compact
- Pluggable strategy to prioritize compactions
- Base File to Log file size ratio
- Recent partitions compacted first
Deep Dive
23. Failure recovery
Automatic recovery via Spark RDD
- Resilient Distributed Datasets!!
No Partial writes
- Commit is atomic
- Auto rollback last failed commit
Rollback specific commits
Savepoints/Snapshots
Deep Dive
24. Hoodie Write API
// WriteConfig contains basePath of hoodie dataset (among other configs)
HoodieWriteClient(JavaSparkContext jsc, HoodieWriteConfig clientConfig)
// Start a commit and get a commit time to atomically upsert a batch of records
String startCommit()
// Upsert the RDD<Records> into the hoodie dataset
JavaRDD<WriteStatus> upsert(JavaRDD<HoodieRecord<T>> records, final String commitTime)
// Bulk load the RDD<Records> into the hoodie dataset
JavaRDD<WriteStatus> bulkInsert(JavaRDD<HoodieRecord<T>> records, final String commitTime)
// Choose to commit
boolean commit(String commitTime, JavaRDD<WriteStatus> writeStatuses)
// Rollback
boolean rollback(final String commitTime) throws HoodieRollbackException
Deep Dive
25. Hoodie Record
HoodieRecordPayload
// Combine Existing value with New incoming value and return the combined value
○ IndexedRecord combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema);
// Get the Avro IndexedRecord for the dataset schema
○ IndexedRecord getInsertValue(Schema schema);
Deep Dive
27. Hoodie Views
Hoodie Views
REALTIME
READ
OPTIMIZED
Queryexecutiontime
Data Latency
3 Logical views Of Dataset
Read Optimized View
- Raw Parquet Query Performance
- ~30 mins latency for ~500GB
- Targets existing Hive tables
Real Time View
- Hybrid of row & columnar data
- ~1-5 mins latency
- Brings near-real time tables
Log View
- Stream of changes to dataset
- Enables Incr. Data Pipelines
28. Hoodie Views
Read Optimized
Table
Real Time Table
Hive
Hoodie Views
2017/02/15
2017/02/16
2017/02/17
2017/02/16
File1.parquet
Index
Index
File1_v2.parquet
File1.avro.log
File1
File1_v1.parquet
10 GB
5min batch
10 GB
5 min batch
Input
Changelog
Incremental Log
table
29. Read Optimized View
InputFormat picks only Compacted Columnar Files
Optimized for faster query runtime over data latency
- Plug into GetSplits to filter out older versions
- All Optimizations done to read parquet applies (Vectorized etc)
Data latency is the frequency of compaction
Works out of the box with Presto and Apache Spark
Hoodie Views
31. Real Time View
InputFormat merges ROFile with WOFiles at query runtime
Custom RecordReader
- Logs are grouped per FileID
- Single split is usually a single FileID in Hoodie (Block Aligned files)
Latency is the frequency of ingestion (mini-batches)
Works out of the box with Presto and Apache Spark
- Specialized parquet read path optimizations not supported
Hoodie Views
32. Incremental Log View
Hoodie Views
Partitioned by day trip started
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 5 min
Day level partitions
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Updated
Trips Log
View
Incr Pull
33. Incremental Log View
Pull ONLY changed records in a time range using SQL
- ‘startTs’ > _hoodie_commit_time < ‘endTs’
Avoid full table/partition scan
Do not rely on a custom sequence ID to tail
Lookback window restricted based on cleaning policy
Hoodie Views
37. Use Cases
Near Real-Time ingestion / stream into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Incremental ETL processing
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler
Use Cases
39. Use Cases
Near Real-Time ingestion / streaming into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Incremental ETL processing
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler
Unified Analytical Serving Layer
- Eliminate your specialized serving layer , if latency tolerated is > 10 min
- Simplify serving with HDFS for the entire dataset
Use Cases
43. Adoption @ Uber
Use Cases
Powering ~1000 Data ingestion data feeds
- Every 30 mins today, several TBs per hour
- Towards < 10 min in the next few months
Reduced resource usage by 10x
- In production for last 6 months
- Hardened across rolling restarts, data node reboots
Incremental ETL for dimension tables
- Data warehouse at large
Future
- Self serve incremental pipelines (DeltaStreamer)
44. Comparison
Hoodie fills a big void in Hadoop land
- Upserts & Faster data
Play well with Hadoop ecosystem & deployments
- Leverage Spark vs re-inventing yet-another storage silo
Designed for Incremental Processing
- Incremental Pull is a ‘Hoodie’ special
Comparison
45. Source: (CERN Blog) Performance comparison of different file
formats and storage engines in the Hadoop ecosystem
Comparison: Analytical Storage
Hoodie Views
46. Comparison
Apache Kudu
- Targets both OLTP and OLAP
- Dedicated storage servers
- Evolving Ecosystem support*
Hoodie
- OLAP Only
- Built on top of HDFS
- Already works with Spark/Hive/Presto
Hive Transactions
- Tight integration with Hive & ORC
- No read-optimized view
- Hive based impl
Hoodie
- Hive/Spark/Presto
- Parquet/Avro today, but pluggable
- Power of Spark!
Comparison
47. Comparison
HBase/Key-Value Stores
- Write Optimized for OLTP
- Bad Scan Performance
- Scaling farm of storage servers
- Multi row atomicity is tedious
Hoodie
- Read-Optimized for OLAP
- State-of-art columnar formats
- Scales like a normal job or query
- Multi row commits!!
Stream Processing
- Row oriented processing
- Flink/Spark typically upsert results to
OLTP/specialized OLAP stores
Hoodie
- Columnar queries, at higher latency
- HDFS as Sink, Presto as OLAP engine
- Integrates with Spark/Spark Streaming
Comparison
48. Future Plans
Merge On Read (Project #1)
- Active developement, Productionizing, Shipping!
Global Index (Project #2)
- Fast, lightweight index to map key to fileID, globally (not just partitions)
Spark Datasource (Issue #7) & Presto Plugins (Issue #81)
- Native support for incremental SQL (e.g: where _hoodie_commit_time > ... )
Beam Runner (Issue #8)
- Build incremental pipelines that also port across batch or streaming modes
Future
49. Getting Involved
Engage with us on Github
- Look for “beginner-task” tagged issues
- Checkout tools/utilities
Uber is hiring for “Hoodie”
- “Software Engineer - Data Processing Plaform (Hoodie)”
- https://www.uber.com/careers/list/28811/
Swing by Office Hours after talk
- 2:40pm–3:20pm, Location: Table B
Contributions
52. Hoodie Views
3 Logical views Of Dataset
Read Optimized View
- Raw Parquet Query Performance
- ~30 mins latency for ~500GB
- Targets existing Hive tables
Hoodie Concepts
Real Time View
- Hybrid of row & columnar data
- ~1-5 mins latency
- Brings near-real time tables
Log View
- Stream of changes to dataset
- Enables Incr. Data Pipelines
53. Hoodie Storage Types
Define how data is written
- Indexing & Storage of data
- Impl of primitives and timeline actions
- Support 1 or more views
2 Storage Types
- Copy On Write : Purely columnar, simply
creates new versions of files
- Merge On Read : Near-real time, Shifts
some write cost to reads, Merges on-
the-fly
Hoodie Concepts
Storage Type Supported Views
Copy On Write
Read Optimized,
LogView
Merge On Read
Read Optimized,
RealTime,
LogView
55. Timeline Actions
Commit
- Multi-row atomic publish of data to Queries
- Detailed metadata to facilitate log view of changes
Clean
- Remove older versions of files, to reclaim storage space
- Cleaning modes : Retain Last X file versions, Retain Last X Commits
Compaction
- Compact row based log to columnar snapshot, for real-time view
Savepoint
- Roll back to a checkpoint and resume ingestion
Hoodie Concepts
56. Hoodie Terminology
● Basepath: Root of a Hoodie dataset
● Partition Path: Relative path to folder with partitions of data
● Commit: Produce files identified with fileid & commit time
● Record Key:
○ Uniquely identify a record within partition
○ Mapped consistently to a fileid
● File Id Group: Files with all versions of a group of records
● Metadata Directory: Stores a timeline of all metadata actions with atomically publish
Deep Dive
59. Hoodie Write Path
Deep Dive
Spark Application
Hoodie Spark Client
(Persistent) Index
Data Layout in HDFS
Metadata
Tag
Stream
Save
HoodieInputFormat
Get latest
commit
Filter and
Merge
#9: Talk about why updates are needed before going to the prev generation which has hbase to solve mutations
#18: 2 storage types and 3 views
Copy on Write is the first version of storage
Provides 2 views - RO and LogView
Merge on Read is a strict superset of Copy on Write
Provides RealTime view in addition (1 liner - More recent data with cost of merge pushed on to query execution)
#19: Visualization of Storage Types
Talk about a basic parquet dataset laid out in HDFS
We can to ingest say 200GB of data and upsert into this dataset
How do we support upsert primitive
First we need to tag updates and inserts - introduce index
Introduce multi version - to write out updates
Talk about how / why batch sizes matter - amortization - write amplification
Go over the numbers
30 minutes of queued data takes 30 minutes to ingest - 1 hour SLA
We wanted to take on more workloads by pushing that SLA even further down
Have a differential structure - a log of updates queued for a single file
Stream updates into the log file
compaction happens once in a while - compaction becomes similar to previous ingestion flow
Run through the change in numbers
#20: Index should be super quick - Pure Overhead
Block Aligned Files - Balance compaction and query parallelism
#21: Lets talk about some of the challenges/Features of storing the data in the above format
#22: Explain hotspotting and 2GB Limit
Skew could be during index lookup or during data write
Custom partitioning which takes statistics of commits to determine the appropriate number of subpartitions
Auto Corrections of file sizes
#24: Spark RDD has automatic recovery and retries computations
Avro Log maintains the offset to the block and a partially written block will be skipped
SavePoints to rollback and re-ingest
#25: Talk about SparkContext and Config - Index, Storage Formats, Parallelism
StartCommit - Token
#26: Take about what a hoodie record is and the record payload abstraction
#27: Talk briefly about metadata storage.
Bring attention towards the views.
#28: A view is a inputformat - 3 different hive tables are essentially registered pointing to the same HDFS dataset
#29: Recap the storage breifly
Introduce one view after next and explain how it works
Explain about hive - query plan generation
#30: Explain InputFormats for each view
Explain how read optimized inputformat works - generate query plan - getsplits - filter
Talk about optimizated for query runtime - chosen when compaction data latency is good enough
Talk about hive metastore registration
#57: Hoodie partitions HDFS directory further partitioning to a more finer granularity
Subpartitioned as <Partition Path, File Id>
Record Key <==> <Partition Path, File Id> is immutable
Dynamic sub partition automatically handles data skew
Fundamental unit of compaction is rewriting a single File Id
Sub partitioning is used for ingestion only
Query engines only see HDFS directory partitions