Dustin Vannoy is a field data engineer at Databricks and co-founder of Data Engineering San Diego. He specializes in Azure, AWS, Spark, Kafka, Python, data lakes, cloud analytics, and streaming. The document provides an overview of various Azure data and analytics services including Azure SQL DB, Cosmos DB, Blob Storage, Data Lake Storage Gen 2, Databricks, Synapse Analytics, Data Factory, Event Hubs, Stream Analytics, and Machine Learning. It also includes a reference architecture and recommends Microsoft Learn paths and community resources for learning.
Kafka Tiered Storage separates compute and data storage in two independently scalable layers. Uber's Kafka Improvement Proposal (KIP) #405 describes two-tiered storage, which is a major step towards cloud-native Kafka. It stores the most recent data locally and offloads older data to a remote storage service. Operationally, the benefit is faster routine cluster maintenance activities. In Linkedin, Kafka tiered storage is strongly desired to reduce the cost of running Kafka in the Azure cloud environment. As KIP-405 does not dictate the implementation of remote storage substrate, Linkedin's choice for tiering Kafka in Azure deployments is the Azure Blob Service. This presentation will begin with the motivation behind Linkedin efforts to adopt Kafka Tiered Storage. Next, the architecture of KIP-405 will be discussed. Finally, the Remote Storage Manager for Azure Blobs, which is a work-in-progress, will be presented.
Video: https://youtu.be/V5gaBE5CMwg?t=1387
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks
Machine learning pipelines are a hot topic at the moment. Moving data through the pipeline in an efficient and predictable way is one of the most important aspects of running machine learning models in production.
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
Redis is an in-memory key-value store that is often used as a database, cache, and message broker. It supports various data structures like strings, hashes, lists, sets, and sorted sets. While data is stored in memory for fast access, Redis can also persist data to disk. It is widely used by companies like GitHub, Craigslist, and Engine Yard to power applications with high performance needs.
This document discusses YARN federation, which allows multiple YARN clusters to be connected together. It summarizes:
- YARN is used at Microsoft for resource management but faces challenges of large scale and diverse workloads. Federation aims to address this.
- The federation architecture connects multiple independent YARN clusters through centralized services for routing, policies, and state. Applications are unaware and can seamlessly run across clusters.
- Federation policies determine how work is routed and scheduled across clusters, balancing objectives like load balancing, scaling, fairness, and isolation. A spectrum of policy options is discussed from full partitioning to full replication to dynamic partial replication.
- A demo is presented showing a job running across
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
CMP325 talk for AWS re:Invent 2017, by Brendan Gregg. "
At Netflix we make the best use of AWS EC2 instance types and features to create a high performance cloud, achieving near bare metal speed for our workloads. This session will summarize the configuration, tuning, and activities for delivering the fastest possible EC2 instances, and will help other EC2 users improve performance, reduce latency outliers, and make better use of EC2 features. We'll show how we choose EC2 instance types, how we choose between EC2 Xen modes: HVM, PV, and PVHVM, and the importance of EC2 features such SR-IOV for bare-metal performance. SR-IOV is used by EC2 enhanced networking, and recently for the new i3 instance type for enhanced disk performance as well. We'll also cover kernel tuning and observability tools, from basic to advanced. Advanced performance analysis includes the use of Java and Node.js flame graphs, and the new EC2 Performance Monitoring Counter (PMC) feature released this year."
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
This document summarizes a presentation on optimizing Zabbix performance through tuning. It discusses identifying and fixing common problems like default templates and database settings. Next, it covers tuning Zabbix configuration by adjusting the number of server processes and monitoring internal stats. Additional optimizations include using proxies to distribute load, partitioning historical tables, and running Zabbix components on separate hardware. The summary emphasizes monitoring internal stats, tuning configurations and databases, disabling housekeeping, and reviewing additional reading on tuning MySQL, PostgreSQL and Zabbix internals.
This document describes a tiered compilation approach used in Java virtual machines. It has multiple compilation levels from interpretation to highly optimized native code. The goal is to improve startup and steady state performance by adapting the compilation level based on runtime feedback. Evaluation on SPECjvm98 benchmarks shows the tiered approach reduces startup time by up to 35% and settled time by up to 32% compared to always compiling at the highest level.
This presentation is about Java performance and the most effective ways to work with Java memory, including memory saving techniques and overcoming of memory barriers. Moreover, it contains debunking of the most popular myths on speed boosting.
This presentation by Andrii Antilikatorov (Consultant, GlobalLogic) was delivered at GlobalLogic Java Conference #2 in Krakow on April 23, 2016.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Cloud DW technology trends and considerations for enterprises to apply snowflakeSANG WON PARK
올해 처음 오프라인으로 진행된 "한국 데이터 엔니지어 모임"에서 발표한 cloud dw와 snowflake라는 주제로 발표한 내용을 정리하여 공유함. (2022.07)
[ 발표 주제 ]
Cloud DW 기술 트렌드와 Snowflake 적용
- Modern Data Stack에서 Cloud DW의 역할
- 기존 Data Lake + DW와 무엇이 다른가?
- Data Engineer 관점에서 어떻게 사용하면 좋을까? (기능/성능/비용 측면의 장점/단점)
[ 주요 내용 ]
- 최근 많은 Data Engineer가 기존 기술 스택(Hadoop, Spark, DW 등)의 기술적/운영적 한계를 극복하기 위한 고민중.
- 특히 Cloud의 장점과 운영 및 성능을 고려한 Cloud DW(AWS Redshift, GCP BigQuery, DataBricks, Snowflake)를 고려
- 이 중 Snowflake를 실제 프로젝트에 적용한 경험과 기술적인 특징/장점/단점을 공유하고자 함.
작년부터 정부의 데이터 정책 변화와 Cloud 기반의 기술 변화 가속화로 기업의 데이터 환경에도 많은 변화가 발생하고 있고, 기업들은 이에 적응하기 위한 다양한 시도를 하고 있다.
그 중심에 cloud dw (또는 Lake house)가 위치하고 있으며, 이를 기반으로 통합 데이터 플랫폼으로의 아키텍처로 변화하고 있다. 하지만, 아직까지 기존 DW 제품과 주요 CSP(AWS, GCP, Azure)의 제품군을 다양하게 시도하고 있으나, 기대와 다르게 생각보나 낮은 성능 또는 비싼 사용료, 운영의 복잡성으로 인한 많은 시행착오를 거치고 있다.
이 상황에서 작년에 처음 검토한 snowflake의 다양한 기능들이 기업들의 고민과 문제를 상당부분 손쉽게 해결할 수 있다는 것을 확인할 수 있었고, 이를 이용하여 실제 많은 기업들에게 적용하기 위한 POC를 수행하거나, 실제 적용하는 프로젝트를 수행하게 되었다.
본 발표 내용은 이러한 경험을 기반으로 기업(그리고 실제 업무를 수행할 Data Engineer) 관점에서 snowflake가 어떻게 문제를 해결할 수 있는지 cloud dw를 도입/활용/확장 하는 단계별로 문제와 해결 방안을 중심으로 설명하였다.
https://blog.naver.com/freepsw?Redirect=Update&logNo=222815591918
Jay Runkel presented a methodology for sizing MongoDB clusters to meet the requirements of an application. The key steps are: 1) Analyze data size and index size, 2) Estimate the working set based on frequently accessed data, 3) Use a simplified model to estimate IOPS and adjust for real-world factors, 4) Calculate the number of shards needed based on storage, memory and IOPS requirements. He demonstrated this process for an application that collects mobile events, requiring a cluster that can store over 200 billion documents with 50,000 IOPS.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
Over the years, there has been extensive and continuous effort on improving Spark SQL’s query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
The goal of most streams processing jobs is to process data and deliver insights to the business – fast. Unfortunately, sometimes our streams processing jobs fall short of this goal. Or perhaps the job used to run fine, but one day it just isn’t fast enough? In this talk, we’ll dive into the challenges of analyzing performance of real-time stream processing applications. We’ll share troubleshooting suggestions and some of our favorite tools. So next time someone asks “why is this taking so long?”, you’ll know what to do.
The document describes the Volcano/Cascades query optimizer. It uses dynamic programming to efficiently search the large space of possible query execution plans. The optimizer represents queries as logical and physical operators connected by transformation and implementation rules. It explores the logical plan space and then builds physical plans by applying these rules. The search is guided by estimating physical operator costs. The optimizer memoizes partial results to avoid redundant work. This approach allows finding optimal execution plans in a principled way that scales to complex queries and optimizer extensions.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Slidedeck presented at http://devternity.com/ around MongoDB internals. We review the usage patterns of MongoDB, the different storage engines and persistency models as well has the definition of documents and general data structures.
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Pre-register for Databricks Community Edition
Read koalas docs
Unified Data Platform, by Pauline Yeung of Cisco SystemsAltinity Ltd
Presented on December ClickHouse Meetup. Dec 3, 2019
Our journey from using ClickHouse in an internal threat library web application, to experimenting with ClickHouse to migrating production data from Elasticsearch, Postgres, HBase, to trying ClickHouse for error metrics in a product under development.
This document summarizes Kevin Weil's presentation on Hadoop and Pig at Twitter. Weil discusses how Twitter uses Hadoop and Pig to analyze massive amounts of user data, including tweets. He explains how Pig allows for more concise and readable analytics jobs compared to raw MapReduce. Weil also provides examples of how Twitter builds data-driven products and services using these tools, such as their People Search feature.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
This document summarizes a presentation on optimizing Zabbix performance through tuning. It discusses identifying and fixing common problems like default templates and database settings. Next, it covers tuning Zabbix configuration by adjusting the number of server processes and monitoring internal stats. Additional optimizations include using proxies to distribute load, partitioning historical tables, and running Zabbix components on separate hardware. The summary emphasizes monitoring internal stats, tuning configurations and databases, disabling housekeeping, and reviewing additional reading on tuning MySQL, PostgreSQL and Zabbix internals.
This document describes a tiered compilation approach used in Java virtual machines. It has multiple compilation levels from interpretation to highly optimized native code. The goal is to improve startup and steady state performance by adapting the compilation level based on runtime feedback. Evaluation on SPECjvm98 benchmarks shows the tiered approach reduces startup time by up to 35% and settled time by up to 32% compared to always compiling at the highest level.
This presentation is about Java performance and the most effective ways to work with Java memory, including memory saving techniques and overcoming of memory barriers. Moreover, it contains debunking of the most popular myths on speed boosting.
This presentation by Andrii Antilikatorov (Consultant, GlobalLogic) was delivered at GlobalLogic Java Conference #2 in Krakow on April 23, 2016.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Cloud DW technology trends and considerations for enterprises to apply snowflakeSANG WON PARK
올해 처음 오프라인으로 진행된 "한국 데이터 엔니지어 모임"에서 발표한 cloud dw와 snowflake라는 주제로 발표한 내용을 정리하여 공유함. (2022.07)
[ 발표 주제 ]
Cloud DW 기술 트렌드와 Snowflake 적용
- Modern Data Stack에서 Cloud DW의 역할
- 기존 Data Lake + DW와 무엇이 다른가?
- Data Engineer 관점에서 어떻게 사용하면 좋을까? (기능/성능/비용 측면의 장점/단점)
[ 주요 내용 ]
- 최근 많은 Data Engineer가 기존 기술 스택(Hadoop, Spark, DW 등)의 기술적/운영적 한계를 극복하기 위한 고민중.
- 특히 Cloud의 장점과 운영 및 성능을 고려한 Cloud DW(AWS Redshift, GCP BigQuery, DataBricks, Snowflake)를 고려
- 이 중 Snowflake를 실제 프로젝트에 적용한 경험과 기술적인 특징/장점/단점을 공유하고자 함.
작년부터 정부의 데이터 정책 변화와 Cloud 기반의 기술 변화 가속화로 기업의 데이터 환경에도 많은 변화가 발생하고 있고, 기업들은 이에 적응하기 위한 다양한 시도를 하고 있다.
그 중심에 cloud dw (또는 Lake house)가 위치하고 있으며, 이를 기반으로 통합 데이터 플랫폼으로의 아키텍처로 변화하고 있다. 하지만, 아직까지 기존 DW 제품과 주요 CSP(AWS, GCP, Azure)의 제품군을 다양하게 시도하고 있으나, 기대와 다르게 생각보나 낮은 성능 또는 비싼 사용료, 운영의 복잡성으로 인한 많은 시행착오를 거치고 있다.
이 상황에서 작년에 처음 검토한 snowflake의 다양한 기능들이 기업들의 고민과 문제를 상당부분 손쉽게 해결할 수 있다는 것을 확인할 수 있었고, 이를 이용하여 실제 많은 기업들에게 적용하기 위한 POC를 수행하거나, 실제 적용하는 프로젝트를 수행하게 되었다.
본 발표 내용은 이러한 경험을 기반으로 기업(그리고 실제 업무를 수행할 Data Engineer) 관점에서 snowflake가 어떻게 문제를 해결할 수 있는지 cloud dw를 도입/활용/확장 하는 단계별로 문제와 해결 방안을 중심으로 설명하였다.
https://blog.naver.com/freepsw?Redirect=Update&logNo=222815591918
Jay Runkel presented a methodology for sizing MongoDB clusters to meet the requirements of an application. The key steps are: 1) Analyze data size and index size, 2) Estimate the working set based on frequently accessed data, 3) Use a simplified model to estimate IOPS and adjust for real-world factors, 4) Calculate the number of shards needed based on storage, memory and IOPS requirements. He demonstrated this process for an application that collects mobile events, requiring a cluster that can store over 200 billion documents with 50,000 IOPS.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
Over the years, there has been extensive and continuous effort on improving Spark SQL’s query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
The goal of most streams processing jobs is to process data and deliver insights to the business – fast. Unfortunately, sometimes our streams processing jobs fall short of this goal. Or perhaps the job used to run fine, but one day it just isn’t fast enough? In this talk, we’ll dive into the challenges of analyzing performance of real-time stream processing applications. We’ll share troubleshooting suggestions and some of our favorite tools. So next time someone asks “why is this taking so long?”, you’ll know what to do.
The document describes the Volcano/Cascades query optimizer. It uses dynamic programming to efficiently search the large space of possible query execution plans. The optimizer represents queries as logical and physical operators connected by transformation and implementation rules. It explores the logical plan space and then builds physical plans by applying these rules. The search is guided by estimating physical operator costs. The optimizer memoizes partial results to avoid redundant work. This approach allows finding optimal execution plans in a principled way that scales to complex queries and optimizer extensions.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Slidedeck presented at http://devternity.com/ around MongoDB internals. We review the usage patterns of MongoDB, the different storage engines and persistency models as well has the definition of documents and general data structures.
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Pre-register for Databricks Community Edition
Read koalas docs
Unified Data Platform, by Pauline Yeung of Cisco SystemsAltinity Ltd
Presented on December ClickHouse Meetup. Dec 3, 2019
Our journey from using ClickHouse in an internal threat library web application, to experimenting with ClickHouse to migrating production data from Elasticsearch, Postgres, HBase, to trying ClickHouse for error metrics in a product under development.
This document summarizes Kevin Weil's presentation on Hadoop and Pig at Twitter. Weil discusses how Twitter uses Hadoop and Pig to analyze massive amounts of user data, including tweets. He explains how Pig allows for more concise and readable analytics jobs compared to raw MapReduce. Weil also provides examples of how Twitter builds data-driven products and services using these tools, such as their People Search feature.
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
O'Reilly Where 2.0 2011
As a result of cheap storage and computing power, society is measuring and storing increasing amounts of information.
It is now possible to efficiently crunch Petabytes of data with tools like Hadoop.
In this O'Reilly Where 2.0 tutorial, Pete Skomoroch, Sr. Data Scientist at LinkedIn, gives an overview of spatial analytics and how you can use tools like Hadoop, Python, and Mechanical Turk to process location data and derive insights about cities and people.
Topics:
* Data Science & Geo Analytics
* Useful Geo tools and Datasets
* Hadoop, Pig, and Big Data
* Cleaning Location Data with Mechanical Turk
* Spatial Tweet Analytics with Hadoop & Python
* Using Social Data to Understand Cities
The document summarizes a workshop on spatial analytics. It discusses the rise of spatial analytics, different spatial analysis techniques including spatial autocorrelation and interpolation. It covers using Hadoop and Pig for distributed analytics on big spatial data. Examples are provided of spatial analysis jobs in Pig that are much simpler than equivalent MapReduce code. The document concludes by demonstrating a simple example of counting tweets by location in Pig.
An overview of traditional spatial analysis tools, an intro to hadoop and other tools for analyzing terabytes or more of data, and then a primer with examples on combining the two with data pulled from the Twitter streaming API. Given at the O'Reilly Where 2.0 conference in March 2010.
The document summarizes Jimmy Lin's MapReduce tutorial for WWW 2013. It discusses the MapReduce algorithm design and implementation. Specifically, it covers key aspects of MapReduce like local aggregation to reduce network traffic, sequencing computations by manipulating sort order, and using appropriate data structures to accumulate results incrementally. It also provides an example of building a term co-occurrence matrix to measure semantic distance between words.
Big Data Analytics with Hadoop with @techmilindEMC
Hadoop has rapidly emerged as the preferred solution for big data analytics across unstructured data and companies are seeking competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data. This session reviews the practices of performing analytics using unstructured data with Hadoop.
1) Graph databases can represent data and relationships in ways that are more natural than relational databases.
2) However, different graph database implementations use different internal representations and APIs, making it hard to optimize queries across systems or build applications that work with multiple databases.
3) The property graph model has emerged as a common standard that represents graphs with nodes, relationships, and key-value attributes on both nodes and relationships.
Hadoop isn't limited to running Java code, you can write your jobs in a variety of dynamic languages.
This talk is about Hadoop's Streaming API, and the best way we found to run Perl jobs on Amazon's Elastic MapReduce platform.
Hadoop and Hive are used at Facebook for large scale data processing and analytics using commodity hardware and open source software. Hive provides an SQL-like interface to query large datasets stored in Hadoop and translates queries into MapReduce jobs. It is used for daily/weekly data aggregations, ad-hoc analysis, data mining, and other tasks using datasets exceeding petabytes in size stored on Hadoop clusters.
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
This document discusses big data and Hadoop. It begins by explaining how data is growing exponentially and defining what big data is. It then introduces Hadoop as an open-source framework for storing and processing big data across clusters of commodity hardware. The rest of the document provides details on the key components of Hadoop, including HDFS for distributed storage, MapReduce for distributed processing, and various related projects like Pig, Hive and HBase that build on Hadoop.
This document discusses infrastructure for cloud computing and Google's tools. It describes Google's MapReduce and BigTable frameworks, which were developed for large-scale data processing and storage. It also outlines Google's Academic Cloud Computing Initiative (ACCI) partnership with universities to provide cloud computing education and skills. ACCI has helped create cloud computing courses at schools like Tsinghua University in China.
This document provides an overview of Giraph, an open source framework for large-scale graph processing on Hadoop. It discusses why graph processing is important at large scales, existing solutions and their limitations, and Giraph's goals of being easily deployable on Hadoop and providing a graph-oriented programming model. The document describes Giraph's design which uses Hadoop and leverages the bulk synchronous parallel computing model, and provides examples of writing Giraph applications and how Giraph jobs interact with Hadoop.
This document discusses using Python for Hadoop and data mining. It introduces Dumbo, which allows writing Hadoop programs in Python. K-means clustering in MapReduce is also covered. Dumbo provides a Pythonic API for MapReduce and allows extending Hadoop functionality. Examples demonstrate implementing K-means in Dumbo and optimizing it by computing partial centroids locally in mappers. The document also lists Python books and tools for data mining and scientific computing.
Kevin Weil presented on Hadoop at Twitter. He discussed Twitter's data lifecycle including data input via Scribe and Crane, storage in HDFS and HBase, analysis using Pig and Oink, and data products like Birdbrain. He described how tools like Scribe, Crane, Elephant Bird, Pig, and HBase were developed and used at Twitter to handle large volumes of log and tabular data at petabyte scale.
Cloud Computing course presentation, Tarbiat Modares University
By: Sina Ebrahimi, Mohammadreza Noei
Advisor: Sadegh Dorri Nogoorani, PhD.
Presentation Data: 1397/03/07
Video Link in Aparat: https://www.aparat.com/v/N5VbK
Video Link on TMU Cloud: http://cloud.modares.ac.ir/public.php?service=files&t=9ecb8d2dd08df6f990a3eb63f42011f7
This presenation's pptx file (some animations may be lost in slideshare) : http://cloud.modares.ac.ir/public.php?service=files&t=f62282dbd205abaa66de2512d9fdfc83
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB
Want to learn practical tips for designing systems that can scale efficiently without compromising speed?
Join us for a workshop where we’ll address these challenges head-on and explore how to architect low-latency systems using Rust. During this free interactive workshop oriented for developers, engineers, and architects, we’ll cover how Rust’s unique language features and the Tokio async runtime enable high-performance application development.
As you explore key principles of designing low-latency systems with Rust, you will learn how to:
- Create and compile a real-world app with Rust
- Connect the application to ScyllaDB (NoSQL data store)
- Negotiate tradeoffs related to data modeling and querying
- Manage and monitor the database for consistently low latencies
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
Procurement Insights Cost To Value Guide.pptxJon Hansen
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement — not a competitor — to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
2. Introduction
‣ How We Arrived at NoSQL: A Crash Course
‣ Collecting Data (Scribe)
‣ Storing and Analyzing Data (Hadoop)
‣ Rapid Learning over Big Data (Pig)
‣ And More: Cassandra, HBase, FlockDB
3. My Background
‣ Studied Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh routing algorithms,
GBs of data
‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣ Twitter: Hadoop, Pig, HBase, Cassandra, machine learning,
visualization, social graph analysis, soon to be PBs data
4. Introduction
‣ How We Arrived at NoSQL: A Crash Course
‣ Collecting Data (Scribe)
‣ Storing and Analyzing Data (Hadoop)
‣ Rapid Learning over Big Data (Pig)
‣ And More: Cassandra, HBase, FlockDB
6. Data, Data Everywhere
‣ Twitter users generate a lot of data
‣ Anybody want to guess?
‣ 7 TB/day (2+ PB/yr)
7. Data, Data Everywhere
‣ Twitter users generate a lot of data
‣ Anybody want to guess?
‣ 7 TB/day (2+ PB/yr)
‣ 10,000 CDs/day
8. Data, Data Everywhere
‣ Twitter users generate a lot of data
‣ Anybody want to guess?
‣ 7 TB/day (2+ PB/yr)
‣ 10,000 CDs/day
‣ 5 million floppy disks
9. Data, Data Everywhere
‣ Twitter users generate a lot of data
‣ Anybody want to guess?
‣ 7 TB/day (2+ PB/yr)
‣ 10,000 CDs/day
‣ 5 million floppy disks
‣ 300 GB while I give this talk
10. Data, Data Everywhere
‣ Twitter users generate a lot of data
‣ Anybody want to guess?
‣ 7 TB/day (2+ PB/yr)
‣ 10,000 CDs/day
‣ 5 million floppy disks
‣ 300 GB while I give this talk
‣ And doubling multiple times per year
11. Syslog?
‣ Started with syslog-ng
‣ As our volume grew, it didn’t scale
12. Syslog?
‣ Started with syslog-ng
‣ As our volume grew, it didn’t scale
‣ Resources overwhelmed
‣ Lost data
13. Scribe
‣ Surprise! FB had same problem, built and open-sourced Scribe
‣ Log collection framework over Thrift
‣ You write log lines, with categories
‣ It does the rest
14. Scribe
‣ Runs locally; reliable in network outage
FE FE FE
15. Scribe
‣ Runs locally; reliable in network outage
‣ Nodes only know downstream
FE FE FE
writer; hierarchical, scalable
Agg Agg
16. Scribe
‣ Runs locally; reliable in network outage
‣ Nodes only know downstream
FE FE FE
writer; hierarchical, scalable
‣ Pluggable outputs
Agg Agg
File HDFS
17. Scribe at Twitter
‣ Solved our problem, opened new vistas
‣ Currently 30 different categories logged from multiple sources
‣ FE: Javascript, Ruby on Rails
‣ Middle tier: Ruby on Rails, Scala
‣ Backend: Scala, Java, C++
18. Scribe at Twitter
‣ We’ve contributed to it as we’ve used it
‣ Improved logging, monitoring, writing to HDFS, compression
‣ Continuing to work with FB on patches
‣ GSoC project! Help make it more awesome.
• http://github.com/traviscrawford/scribe
• http://wiki.developers.facebook.com/index.php/User:GSoC
19. Introduction
‣ How We Arrived at NoSQL: A Crash Course
‣ Collecting Data (Scribe)
‣ Storing and Analyzing Data (Hadoop)
‣ Rapid Learning over Big Data (Pig)
‣ And More: Cassandra, HBase, FlockDB
20. How do you store 7TB/day?
‣ Single machine?
‣ What’s HD write speed?
21. How do you store 7TB/day?
‣ Single machine?
‣ What’s HD write speed?
‣ ~80 MB/s
22. How do you store 7TB/day?
‣ Single machine?
‣ What’s HD write speed?
‣ ~80 MB/s
‣ 24.3 hours to write 7 TB
23. How do you store 7TB/day?
‣ Single machine?
‣ What’s HD write speed?
‣ ~80 MB/s
‣ 24.3 hours to write 7 TB
‣ Uh oh.
24. Where do I put 7TB/day?
‣ Need a cluster of machines
25. Where do I put 7TB/day?
‣ Need a cluster of machines
‣ ... which adds new layers of complexity
27. Hadoop
‣ Distributed file system
‣ Automatic replication, fault tolerance
‣ MapReduce-based parallel computation
‣ Key-value based computation interface allows for wide
applicability
28. Hadoop
‣ Open source: top-level Apache project
‣ Scalable: Y! has a 4000 node cluster
‣ Powerful: sorted 1TB of random integers in 62 seconds
‣ Easy packaging: free Cloudera RPMs
29. MapReduce Workflow
Inputs
Map
Shuffle/Sort ‣ Challenge: how many tweets per user,
given tweets table?
Map
Outputs ‣ Input: key=row, value=tweet info
Map Reduce
‣ Map: output key=user_id, value=1
Map Reduce
‣ Shuffle: sort by user_id
Map Reduce
‣ Reduce: for each user_id, sum
Map ‣ Output: user_id, tweet count
Map ‣ With 2x machines, runs 2x faster
30. MapReduce Workflow
Inputs
Map
Shuffle/Sort ‣ Challenge: how many tweets per user,
given tweets table?
Map
Outputs ‣ Input: key=row, value=tweet info
Map Reduce
‣ Map: output key=user_id, value=1
Map Reduce
‣ Shuffle: sort by user_id
Map Reduce
‣ Reduce: for each user_id, sum
Map ‣ Output: user_id, tweet count
Map ‣ With 2x machines, runs 2x faster
31. MapReduce Workflow
Inputs
Map
Shuffle/Sort ‣ Challenge: how many tweets per user,
given tweets table?
Map
Outputs ‣ Input: key=row, value=tweet info
Map Reduce
‣ Map: output key=user_id, value=1
Map Reduce
‣ Shuffle: sort by user_id
Map Reduce
‣ Reduce: for each user_id, sum
Map ‣ Output: user_id, tweet count
Map ‣ With 2x machines, runs 2x faster
32. MapReduce Workflow
Inputs
Map
Shuffle/Sort ‣ Challenge: how many tweets per user,
given tweets table?
Map
Outputs ‣ Input: key=row, value=tweet info
Map Reduce
‣ Map: output key=user_id, value=1
Map Reduce
‣ Shuffle: sort by user_id
Map Reduce
‣ Reduce: for each user_id, sum
Map ‣ Output: user_id, tweet count
Map ‣ With 2x machines, runs 2x faster
33. MapReduce Workflow
Inputs
Map
Shuffle/Sort ‣ Challenge: how many tweets per user,
given tweets table?
Map
Outputs ‣ Input: key=row, value=tweet info
Map Reduce
‣ Map: output key=user_id, value=1
Map Reduce
‣ Shuffle: sort by user_id
Map Reduce
‣ Reduce: for each user_id, sum
Map ‣ Output: user_id, tweet count
Map ‣ With 2x machines, runs 2x faster
34. MapReduce Workflow
Inputs
Map
Shuffle/Sort ‣ Challenge: how many tweets per user,
given tweets table?
Map
Outputs ‣ Input: key=row, value=tweet info
Map Reduce
‣ Map: output key=user_id, value=1
Map Reduce
‣ Shuffle: sort by user_id
Map Reduce
‣ Reduce: for each user_id, sum
Map ‣ Output: user_id, tweet count
Map ‣ With 2x machines, runs 2x faster
35. MapReduce Workflow
Inputs
Map
Shuffle/Sort ‣ Challenge: how many tweets per user,
given tweets table?
Map
Outputs ‣ Input: key=row, value=tweet info
Map Reduce
‣ Map: output key=user_id, value=1
Map Reduce
‣ Shuffle: sort by user_id
Map Reduce
‣ Reduce: for each user_id, sum
Map ‣ Output: user_id, tweet count
Map ‣ With 2x machines, runs 2x faster
36. Two Analysis Challenges
‣ 1. Compute friendships in Twitter’s social graph
‣ grep, awk? No way.
‣ Data is in MySQL... self join on an n-billion row table?
‣ n,000,000,000 x n,000,000,000 = ?
37. Two Analysis Challenges
‣ 1. Compute friendships in Twitter’s social graph
‣ grep, awk? No way.
‣ Data is in MySQL... self join on an n-billion row table?
‣ n,000,000,000 x n,000,000,000 = ?
‣ I don’t know either.
38. Two Analysis Challenges
‣ 2. Large-scale grouping and counting?
‣ select count(*) from users? Maybe...
‣ select count(*) from tweets? Uh...
‣ Imagine joining them...
‣ ... and grouping...
‣ ... and sorting...
41. Back to Hadoop
‣ Didn’t we have a cluster of machines?
‣ Hadoop makes it easy to distribute the
calculation
‣ Purpose-built for parallel computation
‣ Just a slight mindset adjustment
42. Back to Hadoop
‣ Didn’t we have a cluster of machines?
‣ Hadoop makes it easy to distribute the
calculation
‣ Purpose-built for parallel computation
‣ Just a slight mindset adjustment
‣ But a fun and valuable one!
43. Analysis at scale
‣ Now we’re rolling
‣ Count all tweets: 12 billion, 5 minutes
‣ Hit FlockDB in parallel to assemble social graph aggregates
‣ Run pagerank across users to calculate reputations
44. But...
‣ Analysis typically in Java
‣ “I need less Java in my life, not more.”
45. But...
‣ Analysis typically in Java
‣ “I need less Java in my life, not more.”
‣ Single-input, two-stage data flow is rigid
46. But...
‣ Analysis typically in Java
‣ “I need less Java in my life, not more.”
‣ Single-input, two-stage data flow is rigid
‣ Projections, filters: custom code
47. But...
‣ Analysis typically in Java
‣ “I need less Java in my life, not more.”
‣ Single-input, two-stage data flow is rigid
‣ Projections, filters: custom code
‣ Joins are lengthy, error-prone
48. But...
‣ Analysis typically in Java
‣ “I need less Java in my life, not more.”
‣ Single-input, two-stage data flow is rigid
‣ Projections, filters: custom code
‣ Joins are lengthy, error-prone
‣ n-stage jobs hard to manage
49. But...
‣ Analysis typically in Java
‣ “I need less Java in my life, not more.”
‣ Single-input, two-stage data flow is rigid
‣ Projections, filters: custom code
‣ Joins are lengthy, error-prone
‣ n-stage jobs hard to manage
‣ Exploration requires compilation!
50. Introduction
‣ How We Arrived at NoSQL: A Crash Course
‣ Collecting Data (Scribe)
‣ Storing and Analyzing Data (Hadoop)
‣ Rapid Learning over Big Data (Pig)
‣ And More: Cassandra, HBase, FlockDB
51. Pig
‣ High-level language
‣ Transformations on sets of records
‣ Process data one step at a time
‣ Easier than SQL?
52. Why Pig?
‣ Because I bet you can read the following script.
58. Pig Democratizes Large-scale Data
Analysis
‣ The Pig version is:
‣ 5% of the code
‣ 5% of the time
‣ Within 25% of the execution time
59. One Thing I’ve Learned
‣ It’s easy to answer questions
‣ It’s hard to ask the right questions
60. One Thing I’ve Learned
‣ It’s easy to answer questions
‣ It’s hard to ask the right questions
‣ Value the system that promotes innovation, iteration
61. One Thing I’ve Learned
‣ It’s easy to answer questions
‣ It’s hard to ask the right questions
‣ Value the system that promotes innovation, iteration
‣ More minds contributing = more value from your data
62. The Hadoop Ecosystem at Twitter
‣ Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1
63. The Hadoop Ecosystem at Twitter
‣ Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1
‣ Heavily modified Scribe writing LZO-compressed to HDFS
‣ LZO: fast, splittable compression, ideal for HDFS*
‣ * http://www.github.com/kevinweil/hadoop-lzo
‣
64. The Hadoop Ecosystem at Twitter
‣ Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1
‣ Heavily modified Scribe writing LZO-compressed to HDFS
‣ LZO: fast, splittable compression, ideal for HDFS*
‣ Data either as flat files (logs) or in protocol buffer format (newer
logs, structured data, etc)
‣ Libs for reading/writing/more open-sourced as elephant-bird**
‣ * http://www.github.com/kevinweil/hadoop-lzo
‣ ** http://www.github.com/kevinweil/elephant-bird
65. The Hadoop Ecosystem at Twitter
‣ Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1
‣ Heavily modified Scribe writing LZO-compressed to HDFS
‣ LZO: fast, splittable compression, ideal for HDFS*
‣ Data either as flat files (logs) or in protocol buffer format (newer
logs, structured data, etc)
‣ Libs for reading/writing/more open-sourced as elephant-bird**
‣ Some Java-based MapReduce, a little Hadoop streaming
‣ * http://www.github.com/kevinweil/hadoop-lzo
‣ ** http://www.github.com/kevinweil/elephant-bird
66. The Hadoop Ecosystem at Twitter
‣ Running Cloudera’s free distro, CDH2 and Hadoop 0.20.1
‣ Heavily modified Scribe writing LZO-compressed to HDFS
‣ LZO: fast, splittable compression, ideal for HDFS*
‣ Data either as flat files (logs) or in protocol buffer format (newer
logs, structured data, etc)
‣ Libs for reading/writing/more open-sourced as elephant-bird**
‣ Some Java-based MapReduce, some HBase, Hadoop streaming
‣ Most analysis, and most interesting analyses, done in Pig
‣ * http://www.github.com/kevinweil/hadoop-lzo
‣ ** http://www.github.com/kevinweil/elephant-bird
71. Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
72. Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣
73. Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣
74. Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣ How many searches happen each day on Twitter?
‣
75. Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣ How many searches happen each day on Twitter?
‣ How many unique queries, how many unique users?
‣
76. Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣ How many searches happen each day on Twitter?
‣ How many unique queries, how many unique users?
‣ What is their geographic distribution?
77. Counting Big Data
‣ Where are users querying from? The API, the front page, their
profile page, etc?
‣
78. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
79. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
80. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
81. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
82. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
83. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
‣ Which features do successful users use often?
84. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
‣ Which features do successful users use often?
‣ Search corrections, search suggestions
85. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
‣ Which features do successful users use often?
‣ Search corrections, search suggestions
‣ A/B testing
86. Correlating Big Data
‣ What is the correlation between users with registered phones
and users that tweet?
87. Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
88. Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
89. Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
90. Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
‣ From the ratio of followers/following?
91. Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
‣ From the ratio of followers/following?
‣ What graph structures lead to successful networks?
92. Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
‣ From the ratio of followers/following?
‣ What graph structures lead to successful networks?
‣ User reputation
93. Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
94. Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
95. Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
96. Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
97. Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
‣ Machine learning
98. Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
‣ Machine learning
‣ Language detection
99. Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
‣ Machine learning
‣ Language detection
‣ ... the list goes on.
100. Research on Big Data
‣ How well can we detect bots and other non-human tweeters?
101. Introduction
‣ How We Arrived at NoSQL: A Crash Course
‣ Collecting Data (Scribe)
‣ Storing and Analyzing Data (Hadoop)
‣ Rapid Learning over Big Data (Pig)
‣ And More: Cassandra, HBase, FlockDB
102. HBase
‣ BigTable clone on top of HDFS
‣ Distributed, column-oriented, no datatypes
‣ Unlike the rest of HDFS, designed for low-latency
‣ Importantly, data is mutable
103. HBase at Twitter
‣ We began building real products based on Hadoop
‣ People search
104. HBase at Twitter
‣ We began building real products based on Hadoop
‣ People search
‣ Old version: offline process on a single node
105. HBase at Twitter
‣ We began building real products based on Hadoop
‣ People search
‣ Old version: offline process on a single node
‣ New version: complex user calculations,
hit extra services in real time, custom indexing
106. HBase at Twitter
‣ We began building real products based on Hadoop
‣ People search
‣ Old version: offline process on a single node
‣ New version: complex user calculations,
hit extra services in real time, custom indexing
‣ Underlying data is mutable
‣ Mutable layer on top of HDFS --> HBase
108. People Search
‣ Import user data into HBase
‣ Periodic MapReduce job reading from HBase
‣ Hits FlockDB, multiple other internal services in mapper
‣ Custom partitioning
109. People Search
‣ Import user data into HBase
‣ Periodic MapReduce job reading from HBase
‣ Hits FlockDB, multiple other internal services in mapper
‣ Custom partitioning
‣ Data sucked across to sharded, replicated, horizontally
scalable, in-memory, low-latency Scala service
‣ Build a trie, do case folding/normalization, suggestions, etc
110. People Search
‣ Import user data into HBase
‣ Periodic MapReduce job reading from HBase
‣ Hits FlockDB, multiple other internal services in mapper
‣ Custom partitioning
‣ Data sucked across to sharded, replicated, horizontally
scalable, in-memory, low-latency Scala service
‣ Build a trie, do case folding/normalization, suggestions, etc
‣ See http://www.slideshare.net/al3x/building-distributed-systems-
in-scala for more
111. HBase
‣ More products now being built on top of it
‣ Flexible, easy to connect to MapReduce/Pig
113. HBase vs Cassandra
‣ “Their origins reveal their strengths and weaknesses”
‣ HBase built on top of batch-oriented system, not low latency
114. HBase vs Cassandra
‣ “Their origins reveal their strengths and weaknesses”
‣ HBase built on top of batch-oriented system, not low latency
‣ Cassandra built from ground up for low latency
115. HBase vs Cassandra
‣ “Their origins reveal their strengths and weaknesses”
‣ HBase built on top of batch-oriented system, not low latency
‣ Cassandra built from ground up for low latency
‣ HBase easy to connect to batch jobs as input and output
116. HBase vs Cassandra
‣ “Their origins reveal their strengths and weaknesses”
‣ HBase built on top of batch-oriented system, not low latency
‣ Cassandra built from ground up for low latency
‣ HBase easy to connect to batch jobs as input and output
‣ Cassandra not so much (but we’re working on it)
117. HBase vs Cassandra
‣ “Their origins reveal their strengths and weaknesses”
‣ HBase built on top of batch-oriented system, not low latency
‣ Cassandra built from ground up for low latency
‣ HBase easy to connect to batch jobs as input and output
‣ Cassandra not so much (but we’re working on it)
‣ HBase has SPOF in the namenode
118. HBase vs Cassandra
‣ Your mileage may vary
‣ At Twitter: HBase for analytics, analysis, dataset generation
‣ Cassandra for online systems
119. HBase vs Cassandra
‣ Your mileage may vary
‣ At Twitter: HBase for analytics, analysis, dataset generation
‣ Cassandra for online systems
‣ As with all NoSQL systems: strengths in different situations
120. FlockDB
‣ Realtime, distributed
social graph store
‣ NOT optimized for data mining
‣ Note: the following slides largely come from @nk’s more
complete talk at http://www.slideshare.net/nkallen/
q-con-3770885
121. FlockDB
‣ Realtime, distributed Intersection
Temporal
social graph store
‣ NOT optimized for data mining
‣ Who follows who (nearly 8
Counts
orders of magnitude!)
‣ Intersection/set operations
‣ Cardinality
‣ Temporal index
122. Set operations?
‣ This tweet needs to
be delivered to people
who follow both
@aplusk (4.7M
followers) and
@foursquare (53K followers)
123. Original solution
‣ MySQL table source_id destination-id
‣ Indices on source_id
20 12
and destination_id
29 12
‣ Couldn’t handle write
34 16
throughput
‣ Indices too large for RAM
124. Next Try
‣ MySQL still
‣ Denormalized
‣ Byte-packed
‣ Chunked
‣ Still temporally ordered
125. Next Try
‣ Problems
‣ O(n) deletes
‣ Data consistency challenges
‣ Inefficient intersections
‣ All of these manifested strongly
for huge users like @aplusk
or @lancearmstrong
126. FlockDB
‣ MySQL underneath still (like PNUTS from Y!)
‣ Partitioned by user_id, gizzard handles sharding/partitioning
‣ Edges stored in both directions, indexed by (src, dest)
‣ Denormalized counts stored
Forward Backward
source_id destination_id updated_at x destination_id source_id updated_at x
20 12 20:50:14 x 12 20 20:50:14 x
20 13 20:51:32 12 32 20:51:32
20 16 12 16
129. FlockDB Timings
‣ Counts: 1ms
‣ Temporal Query: 2ms
‣ Writes: 1ms for journal, 16ms for durability
130. FlockDB Timings
‣ Counts: 1ms
‣ Temporal Query: 2ms
‣ Writes: 1ms for journal, 16ms for durability
‣ Full walks: 100 edges/ms
131. FlockDB is Open Source
‣ We will maintain a community at
‣ http://www.github.com/twitter/flockdb
‣ http://www.github.com/twitter/gizzard
‣ See Nick Kallen’s QCon talk for more
‣ http://www.slideshare.net/nkallen/q-
con-3770885
133. Cassandra
‣ Why Cassandra, for Twitter?
‣ Old/current: vertically, horizontally partitioned MySQL
134. Cassandra
‣ Why Cassandra, for Twitter?
‣ Old/current: vertically, horizontally partitioned MySQL
‣ All kinds of caching layers, all application managed
135. Cassandra
‣ Why Cassandra, for Twitter?
‣ Old/current: vertically, horizontally partitioned MySQL
‣ All kinds of caching layers, all application managed
‣ Alter table impossible, leads to bitfields, piggyback tables
136. Cassandra
‣ Why Cassandra, for Twitter?
‣ Old/current: vertically, horizontally partitioned MySQL
‣ All kinds of caching layers, all application managed
‣ Alter table impossible, leads to bitfields, piggyback tables
‣ Hardware intensive, error prone, etc
137. Cassandra
‣ Why Cassandra, for Twitter?
‣ Old/current: vertically, horizontally partitioned MySQL
‣ All kinds of caching layers, all application managed
‣ Alter table impossible, leads to bitfields, piggyback tables
‣ Hardware intensive, error prone, etc
‣ Not to mention, we hit MySQL write limits sometimes
138. Cassandra
‣ Why Cassandra, for Twitter?
‣ Old/current: vertically, horizontally partitioned MySQL
‣ All kinds of caching layers, all application managed
‣ Alter table impossible, leads to bitfields, piggyback tables
‣ Hardware intensive, error prone, etc
‣ Not to mention, we hit MySQL write limits sometimes
‣ First goal: move all tweets to Cassandra
139. Cassandra
‣ Why Cassandra, for Twitter?
‣ Decentralized, fault-tolerant
‣ All kinds of caching layers, all application managed
‣ Alter table impossible, leads to bitfields, piggyback tables
‣ Hardware intensive, error prone, etc
‣ Not to mention, we hit MySQL write limits sometimes
‣ First goal: move all tweets to Cassandra
140. Cassandra
‣ Why Cassandra, for Twitter?
‣ Decentralized, fault-tolerant
‣ All kinds of caching layers, all application managed
‣ Alter table impossible, leads to bitfields, piggyback tables
‣ Hardware intensive, error prone, etc
‣ Not to mention, we hit MySQL write limits sometimes
‣ First goal: move all tweets to Cassandra
141. Cassandra
‣ Why Cassandra, for Twitter?
‣ Decentralized, fault-tolerant
‣ All kinds of caching layers, all application managed
‣ Flexible schema
‣ Hardware intensive, error prone, etc
‣ Not to mention, we hit MySQL write limits sometimes
‣ First goal: move all tweets to Cassandra
142. Cassandra
‣ Why Cassandra, for Twitter?
‣ Decentralized, fault-tolerant
‣ All kinds of caching layers, all application managed
‣ Flexible schema
‣ Elastic
‣ Not to mention, we hit MySQL write limits sometimes
‣ First goal: move all tweets to Cassandra
143. Cassandra
‣ Why Cassandra, for Twitter?
‣ Decentralized, fault-tolerant
‣ All kinds of caching layers, all application managed
‣ Flexible schema
‣ Elastic
‣ High write throughput
‣ First goal: move all tweets to Cassandra
145. Eventually Consistent?
‣ Twitter is already eventually consistent
‣ Your system may be even worse
146. Eventually Consistent?
‣ Twitter is already eventually consistent
‣ Your system may be even worse
‣ Ryan’s new term: “potential consistency”
‣ Do you have write-through caching?
‣ Do you ever have MySQL replication failures?
147. Eventually Consistent?
‣ Twitter is already eventually consistent
‣ Your system may be even worse
‣ Ryan’s new term: “potential consistency”
‣ Do you have write-through caching?
‣ Do you ever have MySQL replication failures?
‣ There is no automatic consistency repair there, unlike Cassandra
148. Eventually Consistent?
‣ Twitter is already eventually consistent
‣ Your system may be even worse
‣ Ryan’s new term: “potential consistency”
‣ Do you have write-through caching?
‣ Do you ever have MySQL replication failures?
‣ There is no automatic consistency repair there, unlike Cassandra
‣ http://www.slideshare.net/ryansking/scaling-
twitter-with-cassandra
149. Rolling out Cassandra
‣ 1. Integrate Cassandra alongside MySQL
‣ 100% reads/writes to MySQL
‣ Dynamic switches for % dark reads/writes to Cassandra
150. Rolling out Cassandra
‣ 1. Integrate Cassandra alongside MySQL
‣ 100% reads/writes to MySQL
‣ Dynamic switches for % dark reads/writes to Cassandra
‣ 2. Turn up traffic to Cassandra
151. Rolling out Cassandra
‣ 1. Integrate Cassandra alongside MySQL
‣ 100% reads/writes to MySQL
‣ Dynamic switches for % dark reads/writes to Cassandra
‣ 2. Turn up traffic to Cassandra
‣ 3. Find something that’s broken, set switch to 0%
152. Rolling out Cassandra
‣ 1. Integrate Cassandra alongside MySQL
‣ 100% reads/writes to MySQL
‣ Dynamic switches for % dark reads/writes to Cassandra
‣ 2. Turn up traffic to Cassandra
‣ 3. Find something that’s broken, set switch to 0%
‣ 4. Fix it
153. Rolling out Cassandra
‣ 1. Integrate Cassandra alongside MySQL
‣ 100% reads/writes to MySQL
‣ Dynamic switches for % dark reads/writes to Cassandra
‣ 2. Turn up traffic to Cassandra
‣ 3. Find something that’s broken, set switch to 0%
‣ 4. Fix it
‣ 5. GOTO 2
154. Cassandra for Realtime Analytics
‣ Starting a project around realtime analytics
‣ Cassandra as the backing store
‣ Using, developing, testing Digg’s atomic incr patches
‣ More soon.
155. That was a lot of slides
‣ Thanks for sticking with me.
156. Questions? Follow me at
twitter.com/kevinweil
TM