Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn. This was a presentation made at QCon 2009 and is embedded on LinkedIn's blog - http://blog.linkedin.com/
RocksDB is an embedded key-value store that is optimized for fast storage. It uses a log-structured merge-tree to organize data on storage. Optimizing RocksDB for open-channel SSDs would allow controlling data placement to exploit flash parallelism and minimize overhead. This could be done by mapping RocksDB files like SSTables and logs to virtual blocks that map to physical flash blocks in a way that considers data access patterns and flash characteristics. This would improve performance by reducing writes and garbage collection.
This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Kubernetes has evolved from Borg at Google to provide an open source platform for automating deployment, scaling, and management of containerized applications. The presentation discusses how to use Jenkins, Fabric8, and other tools to achieve continuous integration and delivery (CI/CD) with Kubernetes. It provides examples of configuring Jenkins and Fabric8 to build, test, and deploy container images to a Kubernetes cluster, illustrating an end-to-end CI/CD workflow on Kubernetes.
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...HostedbyConfluent
Â
Kubernetes became the de-facto standard for running cloud-native applications. And many users turn to it also to run stateful applications such as Apache Kafka. You can use different tools to deploy Kafka on Kubernetes - write your own YAML files, use Helm Charts, or go for one of the available operators. But there is one thing all of these have in common. You still need very good knowledge of Kubernetes to make sure your Kafka cluster works properly in all situations. This talk will cover different Kubernetes features such as resources, affinity, tolerations, pod disruption budgets, topology spread constraints and more. And it will explain why they are important for Apache Kafka and how to use them. If you are interested in running Kafka on Kubernetes and do not know all of these, this is a talk for you.
A Kafka Clientâs Request: There and Back Again with Danica FineHostedbyConfluent
Â
"Do you know how your data moves into and out of your Apache KafkaÂŽ instance? From the programmerâs point of view, itâs relatively simple. But under the hood, writing to and reading from Kafka is a complex process with a fascinating life cycle thatâs worth understanding.
When you call producer.send() or consumer.poll(), those calls are translated into low-level requests which are sent along to the brokers for processing. In this session, weâll dive into the world of Kafka producers and consumers to follow a request from an initial call to send() or poll(), all the way to disk, and back to the client via the brokerâs final response. Along the way, weâll explore a number of client and broker configurations that affect how these requests are handled and discuss the metrics that you can monitor to help you to keep track of every stage of the request life cycle.
By the end of this session, youâll know the ins and outs of the read and write requests that your Kafka clients make, making your next debugging or performance analysis session a breeze."
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
Â
The goal of most streams processing jobs is to process data and deliver insights to the business â fast. Unfortunately, sometimes our streams processing jobs fall short of this goal. Or perhaps the job used to run fine, but one day it just isnât fast enough? In this talk, weâll dive into the challenges of analyzing performance of real-time stream processing applications. Weâll share troubleshooting suggestions and some of our favorite tools. So next time someone asks âwhy is this taking so long?â, youâll know what to do.
When it comes to Large Scale data processing and Machine Learning, Apache Spark is no doubt one of the top battle-tested frameworks out there for handling batched or streaming workloads. The ease of use, built-in Machine Learning modules, and multi-language support makes it a very attractive choice for data wonks. However bootstrapping and getting off the ground could be difficult for most teams without leveraging a Spark cluster that is already pre-provisioned and provided as a managed service in the Cloud, while this is a very attractive choice to get going, in the long run, it could be a very expensive option if itâs not well managed.
As an alternative to this approach, our team has been exploring and working a lot with running Spark and all our Machine Learning workloads and pipelines as containerized Docker packages on Kubernetes. This provides an infrastructure-agnostic abstraction layer for us, and as a result, it improves our operational efficiency and reduces our overall compute cost. Most importantly, we can easily target our Spark workload deployment to run on any major Cloud or On-prem infrastructure (with Kubernetes as the common denominator) by just modifying a few configurations.
In this talk, we will walk you through the process our team follows to make it easy for us to run a production deployment of our Machine Learning workloads and pipelines on Kubernetes which seamlessly allows us to port our implementation from a local Kubernetes set up on the laptop during development to either an On-prem or Cloud Kubernetes environment
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...HostedbyConfluent
Â
1. Zillow transitioned from using multiple messaging systems and data pipelines to using Kafka as their single streaming platform to unify their data infrastructure.
2. They took a bottom-up approach to gain trust from teams by publishing service level objectives, onboarding non-critical streams quickly, and meeting developers where they were with tools like Terraform.
3. An important lesson was to treat the platform as a product by providing documentation, libraries, and blog posts to make it easy for developers to use.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafkaâs internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
Â
I'm going to discuss the efficiency/performance optimizations of different layers of the system. Starting from the lowest levels like hardware and drivers: these tunings can be applied to pretty much any high-load server. Then weâll move to Linux kernel and its TCP/IP stack: these are the knobs you want to try on any of your TCP-heavy boxes. Finally, weâll discuss library and application-level tunings, which are mostly applicable to HTTP servers in general and nginx/envoy specifically.
For each potential area of optimization Iâll try to give some background on latency/throughput tradeoffs (if any), monitoring guidelines, and, finally, suggest tunings for different workloads.
Also, I'll cover more theoretical approaches to performance analysis and the newly developed tooling like `bpftrace` and new `perf` features.
Restoring Restoration's Reputation in Kafka Streams with Bruno Cadonna & Luca...HostedbyConfluent
Â
The document discusses challenges with restoration in Kafka Streams applications and how the state updater improves restoration. It introduces the state updater, which runs restoration in parallel to processing to avoid blocking processing. This allows restoration checkpoints to be taken and avoids falling out of the consumer group if restoration is slow. Experiments show the state updater approach reduces restoration time and CPU usage compared to blocking restoration. The broader vision is for the state updater to support exactly-once semantics and multi-core scenarios.
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
Â
Moving from Lambda and Kappa Architectures to Kappa+ at Uber
Kappa+ is a new approach developed at Uber to overcome the limitations of the Lambda and Kappa architectures. Whether your realtime infrastructure processes data at Uber scale (well over a trillion messages daily) or only a fraction of that, chances are you will need to reprocess old data at some point.
There can be many reasons for this. Perhaps a bug fix in the realtime code needs to be retroactively applied (aka backfill), or there is a need to train realtime machine learning models on last few months of data before bringing the models online. Kafka's data retention is limited in practice and generally insufficient for such needs. So data must be processed from archives. Aside from addressing such situations, enabling efficient stream processing on archived as well as realtime data also broadens the applicability of stream processing.
This talk introduces the Kappa+ architecture which enables the reuse of streaming realtime logic (stateful and stateless) to efficiently process any amounts of historic data without requiring it to be in Kafka. We shall discuss the complexities involved in such kind of processing and the specific techniques employed in Kappa+ to tackle them.
Wangda Tan and Mayank Bansal presented on YARN Node Labels. Node labels allow grouping nodes with similar hardware or software profiles. This allows applications to request specific nodes and improves cluster partitioning and resource management. Key features include exclusive and non-exclusive node partitions, centralized and distributed configuration, and support in projects like Spark, MapReduce, Slider, and Ambari. Future work includes adding node constraints and supporting node labels in other schedulers like FairScheduler. Node labels help optimize cluster resource utilization and isolate workloads.
This document discusses scheduling work using Kanban rather than Scrum at a Finnish telecom company developing a self-service channel for corporate customers. It had previously used Scrum for 3 years and transitioned to Kanban for the past 9 months. Key aspects of Kanban discussed include not having estimations, sprints, fixed teams or domain areas. The document emphasizes collaboration, transparency and pessimism as keys to success with Kanban scheduling.
Serverless integration with Knative and Apache Camel on KubernetesClaus Ibsen
Â
This presentation will introduce Knative, an open source project that adds serverless capabilities on top of Kubernetes, and present Camel K, a lightweight platform that brings Apache Camel integrations in the serverless world. Camel K allows running Camel routes on top of any Kubernetes cluster, leveraging Knative serverless capabilities such as âscaling to zeroâ.
We will demo how Camel K can connect cloud services or enterprise applications using its 250+ components and how it can intelligently route events within the Knative environment via enterprise integration patterns (EIP).
Target Group: Developers, architects and other technical people - a basic understanding of Kubernetes is an advantage
Apache Kafka is a distributed publish-subscribe messaging system that allows for high-throughput, persistent storage of messages. It provides decoupling of data pipelines by allowing producers to write messages to topics that can then be read from by multiple consumer applications in a scalable, fault-tolerant way. Key aspects of Kafka include topics for categorizing messages, partitions for scaling and parallelism, replication for redundancy, and producers and consumers for writing and reading messages.
This document discusses Zero touch on-premise storage infrastructure with OpenStack Cinder. It describes Viettel's IT infrastructure with mixed storage resources and the challenges of managing it. The solution presented uses OpenStack Cinder and additional tools to automate the management and provisioning of block storage for bare metal servers and OpenStack instances. This removes manual configuration steps and improves performance by pre-zoning storage connections. The goal is to make volume management simpler and allow adding new storage resources without additional configuration through the unified management solution.
This document discusses running MySQL on Kubernetes with Percona Kubernetes Operators. It provides an introduction to cloud native applications and Kubernetes. It then discusses the benefits and challenges of running MySQL on Kubernetes compared to database-as-a-service options. It introduces Percona Kubernetes Operators for MySQL, which help manage and configure MySQL deployments on Kubernetes. Finally, it discusses how to deploy MySQL with the Percona Kubernetes Operators, including prerequisites, connectivity, architecture, high availability, and monitoring.
The document discusses data loss and duplication in Apache Kafka. It begins with an overview of Kafka and how it works as a distributed commit log. It then covers sources of data loss, such as failures at the producer or cluster level. Data duplication can occur when producers retry messages or consumers restart from an unclean shutdown. The document provides configurations and techniques to minimize data loss and duplication, such as using producer acknowledgments and storing metadata to validate messages. It also discusses monitoring Kafka using JMX metrics to detect issues.
Running Kafka as a Native Binary Using GraalVM with Ozan GĂźnalpHostedbyConfluent
Â
"During development and automated tests, it is common to create Kafka clusters from scratch and run workloads against those short-lived clusters. Starting a Kafka broker typically takes several seconds, and those seconds add up to precious time and resources.
How about spinning up a Kafka broker in less than 0.2 seconds with less memory overhead? In this session, we will talk about kafka-native, which leverages GraalVM native image for compiling Kafka broker to native executable using Quarkus framework. After going through some implementation details, we will focus on how it can be used in a Docker container with Testcontainers to speed up integration testing of Kafka applications. We will finally discuss some current caveats and future opportunities of a native-compiled Kafka for cloud-native production clusters."
The Nextcloud Roadmap for Secure Team CollaborationUnivention GmbH
Â
Nextcloud is an open source content collaboration platform that provides file sync and sharing, file server capabilities, and groupware functionality as an alternative to proprietary services like Dropbox, Google Suite, and Office 365. It allows for decentralization of data storage across trusted servers using open cloud mesh federation with end-to-end encryption and optional key recovery. Nextcloud supports collaboration across iOS, Android, Mac, Windows, Linux, and through CalDAV/CardDAV integration with email clients and Outlook/Thunderbird plugins.
Introduction to JIRA & Agile Project ManagementDan Chuparkoff
Â
This document provides an introduction to using JIRA for agile project management. It discusses key concepts like defining tasks, estimating task effort in story points, and using JIRA's agile tools like boards and burndowns. Screenshots show how to create and manage tasks in JIRA's different modes for Scrum and Kanban workflows.
Github Actions enables you to create custom software development lifecycle workflows directly in your Github repository. These workflows are made out of different tasks so-called actions that can be run automatically on certain events.
When I needed to do presentations of Scrum to executives and students, I started to look for existing ones. Most presentations I found were very good for detailed presentations or training. But what I was looking for was a presentation I could give in less than 15 minutes (or more if I wanted). Most of them also contained out dated content. For example, the latest changes in the Scrum framework were not present and what has been removed was still there.
UPDATE VERSION : https://www.slideshare.net/pmengal/scrum-in-ten-slides-v20-2018
The document discusses Project Voldemort, a distributed key-value storage system developed at LinkedIn. It provides an overview of Voldemort's motivation and features, including high availability, horizontal scalability, and consistency guarantees. It also describes LinkedIn's use of Voldemort and Hadoop for applications like event logging, online lookups, and batch processing of large datasets.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.Data Con LA
Â
Jay Kreps, Open Source Visionary and Co Founder of Confluent and several open source projects will be visiting LA. I have asked him to come present at our group. He will present his vision and will answer questions regarding Kafka and other projects
Bio:-
Jay is the co-founder and CEO at Confluent a company built around realtime data streams and the open source messaging system Apache Kafka. He is the original author of several of open source projects including Apache Kafka, Apache Samza, Voldemort, and Azkaban.
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...HostedbyConfluent
Â
1. Zillow transitioned from using multiple messaging systems and data pipelines to using Kafka as their single streaming platform to unify their data infrastructure.
2. They took a bottom-up approach to gain trust from teams by publishing service level objectives, onboarding non-critical streams quickly, and meeting developers where they were with tools like Terraform.
3. An important lesson was to treat the platform as a product by providing documentation, libraries, and blog posts to make it easy for developers to use.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafkaâs internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
Â
I'm going to discuss the efficiency/performance optimizations of different layers of the system. Starting from the lowest levels like hardware and drivers: these tunings can be applied to pretty much any high-load server. Then weâll move to Linux kernel and its TCP/IP stack: these are the knobs you want to try on any of your TCP-heavy boxes. Finally, weâll discuss library and application-level tunings, which are mostly applicable to HTTP servers in general and nginx/envoy specifically.
For each potential area of optimization Iâll try to give some background on latency/throughput tradeoffs (if any), monitoring guidelines, and, finally, suggest tunings for different workloads.
Also, I'll cover more theoretical approaches to performance analysis and the newly developed tooling like `bpftrace` and new `perf` features.
Restoring Restoration's Reputation in Kafka Streams with Bruno Cadonna & Luca...HostedbyConfluent
Â
The document discusses challenges with restoration in Kafka Streams applications and how the state updater improves restoration. It introduces the state updater, which runs restoration in parallel to processing to avoid blocking processing. This allows restoration checkpoints to be taken and avoids falling out of the consumer group if restoration is slow. Experiments show the state updater approach reduces restoration time and CPU usage compared to blocking restoration. The broader vision is for the state updater to support exactly-once semantics and multi-core scenarios.
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
Â
Moving from Lambda and Kappa Architectures to Kappa+ at Uber
Kappa+ is a new approach developed at Uber to overcome the limitations of the Lambda and Kappa architectures. Whether your realtime infrastructure processes data at Uber scale (well over a trillion messages daily) or only a fraction of that, chances are you will need to reprocess old data at some point.
There can be many reasons for this. Perhaps a bug fix in the realtime code needs to be retroactively applied (aka backfill), or there is a need to train realtime machine learning models on last few months of data before bringing the models online. Kafka's data retention is limited in practice and generally insufficient for such needs. So data must be processed from archives. Aside from addressing such situations, enabling efficient stream processing on archived as well as realtime data also broadens the applicability of stream processing.
This talk introduces the Kappa+ architecture which enables the reuse of streaming realtime logic (stateful and stateless) to efficiently process any amounts of historic data without requiring it to be in Kafka. We shall discuss the complexities involved in such kind of processing and the specific techniques employed in Kappa+ to tackle them.
Wangda Tan and Mayank Bansal presented on YARN Node Labels. Node labels allow grouping nodes with similar hardware or software profiles. This allows applications to request specific nodes and improves cluster partitioning and resource management. Key features include exclusive and non-exclusive node partitions, centralized and distributed configuration, and support in projects like Spark, MapReduce, Slider, and Ambari. Future work includes adding node constraints and supporting node labels in other schedulers like FairScheduler. Node labels help optimize cluster resource utilization and isolate workloads.
This document discusses scheduling work using Kanban rather than Scrum at a Finnish telecom company developing a self-service channel for corporate customers. It had previously used Scrum for 3 years and transitioned to Kanban for the past 9 months. Key aspects of Kanban discussed include not having estimations, sprints, fixed teams or domain areas. The document emphasizes collaboration, transparency and pessimism as keys to success with Kanban scheduling.
Serverless integration with Knative and Apache Camel on KubernetesClaus Ibsen
Â
This presentation will introduce Knative, an open source project that adds serverless capabilities on top of Kubernetes, and present Camel K, a lightweight platform that brings Apache Camel integrations in the serverless world. Camel K allows running Camel routes on top of any Kubernetes cluster, leveraging Knative serverless capabilities such as âscaling to zeroâ.
We will demo how Camel K can connect cloud services or enterprise applications using its 250+ components and how it can intelligently route events within the Knative environment via enterprise integration patterns (EIP).
Target Group: Developers, architects and other technical people - a basic understanding of Kubernetes is an advantage
Apache Kafka is a distributed publish-subscribe messaging system that allows for high-throughput, persistent storage of messages. It provides decoupling of data pipelines by allowing producers to write messages to topics that can then be read from by multiple consumer applications in a scalable, fault-tolerant way. Key aspects of Kafka include topics for categorizing messages, partitions for scaling and parallelism, replication for redundancy, and producers and consumers for writing and reading messages.
This document discusses Zero touch on-premise storage infrastructure with OpenStack Cinder. It describes Viettel's IT infrastructure with mixed storage resources and the challenges of managing it. The solution presented uses OpenStack Cinder and additional tools to automate the management and provisioning of block storage for bare metal servers and OpenStack instances. This removes manual configuration steps and improves performance by pre-zoning storage connections. The goal is to make volume management simpler and allow adding new storage resources without additional configuration through the unified management solution.
This document discusses running MySQL on Kubernetes with Percona Kubernetes Operators. It provides an introduction to cloud native applications and Kubernetes. It then discusses the benefits and challenges of running MySQL on Kubernetes compared to database-as-a-service options. It introduces Percona Kubernetes Operators for MySQL, which help manage and configure MySQL deployments on Kubernetes. Finally, it discusses how to deploy MySQL with the Percona Kubernetes Operators, including prerequisites, connectivity, architecture, high availability, and monitoring.
The document discusses data loss and duplication in Apache Kafka. It begins with an overview of Kafka and how it works as a distributed commit log. It then covers sources of data loss, such as failures at the producer or cluster level. Data duplication can occur when producers retry messages or consumers restart from an unclean shutdown. The document provides configurations and techniques to minimize data loss and duplication, such as using producer acknowledgments and storing metadata to validate messages. It also discusses monitoring Kafka using JMX metrics to detect issues.
Running Kafka as a Native Binary Using GraalVM with Ozan GĂźnalpHostedbyConfluent
Â
"During development and automated tests, it is common to create Kafka clusters from scratch and run workloads against those short-lived clusters. Starting a Kafka broker typically takes several seconds, and those seconds add up to precious time and resources.
How about spinning up a Kafka broker in less than 0.2 seconds with less memory overhead? In this session, we will talk about kafka-native, which leverages GraalVM native image for compiling Kafka broker to native executable using Quarkus framework. After going through some implementation details, we will focus on how it can be used in a Docker container with Testcontainers to speed up integration testing of Kafka applications. We will finally discuss some current caveats and future opportunities of a native-compiled Kafka for cloud-native production clusters."
The Nextcloud Roadmap for Secure Team CollaborationUnivention GmbH
Â
Nextcloud is an open source content collaboration platform that provides file sync and sharing, file server capabilities, and groupware functionality as an alternative to proprietary services like Dropbox, Google Suite, and Office 365. It allows for decentralization of data storage across trusted servers using open cloud mesh federation with end-to-end encryption and optional key recovery. Nextcloud supports collaboration across iOS, Android, Mac, Windows, Linux, and through CalDAV/CardDAV integration with email clients and Outlook/Thunderbird plugins.
Introduction to JIRA & Agile Project ManagementDan Chuparkoff
Â
This document provides an introduction to using JIRA for agile project management. It discusses key concepts like defining tasks, estimating task effort in story points, and using JIRA's agile tools like boards and burndowns. Screenshots show how to create and manage tasks in JIRA's different modes for Scrum and Kanban workflows.
Github Actions enables you to create custom software development lifecycle workflows directly in your Github repository. These workflows are made out of different tasks so-called actions that can be run automatically on certain events.
When I needed to do presentations of Scrum to executives and students, I started to look for existing ones. Most presentations I found were very good for detailed presentations or training. But what I was looking for was a presentation I could give in less than 15 minutes (or more if I wanted). Most of them also contained out dated content. For example, the latest changes in the Scrum framework were not present and what has been removed was still there.
UPDATE VERSION : https://www.slideshare.net/pmengal/scrum-in-ten-slides-v20-2018
The document discusses Project Voldemort, a distributed key-value storage system developed at LinkedIn. It provides an overview of Voldemort's motivation and features, including high availability, horizontal scalability, and consistency guarantees. It also describes LinkedIn's use of Voldemort and Hadoop for applications like event logging, online lookups, and batch processing of large datasets.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.Data Con LA
Â
Jay Kreps, Open Source Visionary and Co Founder of Confluent and several open source projects will be visiting LA. I have asked him to come present at our group. He will present his vision and will answer questions regarding Kafka and other projects
Bio:-
Jay is the co-founder and CEO at Confluent a company built around realtime data streams and the open source messaging system Apache Kafka. He is the original author of several of open source projects including Apache Kafka, Apache Samza, Voldemort, and Azkaban.
1) Uber uses Spark and Hadoop to process large amounts of transportation data in real-time and batch. This includes building pipelines to ingest trip data from databases into a data warehouse within 1-2 hours.
2) Paricon is Uber's first Spark application which infers schemas from raw JSON data, converts it to Parquet format for faster querying, and validates the results. It processes over 15TB of data daily.
3) Future work includes building a SQL-based ETL platform on Spark, open sourcing SQL-on-Hadoop, and creating a machine learning platform with Spark and a real-time analytics system called Apollo using Spark Streaming.
Este documento ofrece una introducciĂłn a las bases de datos NoSQL y Apache Cassandra. Explica que las bases de datos NoSQL difieren de las bases de datos relacionales tradicionales en que no requieren esquemas fijos, evitan operaciones JOIN y escalan horizontalmente. Luego describe las caracterĂsticas clave de Cassandra, incluyendo su modelo de datos orientado a columnas y su capacidad de alta disponibilidad y tolerancia a fallos a travĂŠs de la replicaciĂłn distribuida de datos. Finalmente, contrasta el modelo de datos y consultas de Cassandra con las bases de datos rel
This document discusses strategies for scaling a Ruby on Rails application from a small startup to an enterprise-level application. It recommends starting with a small, highly productive team using Rails for rapid development. As the application and user base grow, it suggests adding caching, load balancing, and splitting the application across multiple servers. It also discusses personalizing pages with AJAX to improve caching. The goal is to scale the application efficiently while keeping development agile and in Rails.
Weâd like to share with you the announcement related to our Q3 2011 earnings. Weâll also be live sharing the earnings call both from LinkedInâs Company Page and our @linkedin account, starting 2PM Pacific Time later today.
Please see Disclaimer and Safe Harbor statement in the blog post.
http://stocktwits.com/LinkedIn/message/5682633
LinkedInâs First Earnings Announcement Deck, Q2 2011LinkedIn
Â
Weâd like to share with you the earnings deck related to our first earnings (Q2 2011).
In addition, weâll also be live sharing the earnings call through our @linkedin Twitter account, starting 2PM Pacific Time later today.
The Student African American Brotherhood (SAAB) is looking for a volunteer marketing strategist to help create a strategic marketing plan that respects their budgetary limitations and identifies the most effective channels to reach key audiences. SAAB's mission is to increase graduation rates of African American and Latino males through creating a supportive peer community and improving college readiness. They currently have a consultant working on a report and an active strategic plan to provide guidance to the volunteer.
This document provides ratings and reviews for a book. It includes 3 reviews from readers that give the book high ratings of 4.5/5, 4.5/7, and 4.5/1. The reviews praise the book for its thoughtful exploration of important topics and engaging writing style.
Project Voldemort is a distributed key-value store inspired by Amazon Dynamo and Memcached. It was originally developed at LinkedIn to handle high volumes of data and queries in a scalable way across multiple servers. Voldemort uses consistent hashing to partition and replicate data, vector clocks to resolve concurrent write conflicts, and a layered architecture to provide flexibility. It prioritizes performance, availability, and simplicity over more complex consistency guarantees. LinkedIn uses multiple Voldemort clusters to power various real-time services and applications.
This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.
Large-scale projects development (scaling LAMP)Alexey Rybak
Â
This 8-hours tutorial was given at various conferences including Percona conference (London), DevConf (Moscow), Highload++ (Moscow).
ABSTRACT
During this tutorial we will cover various topics related to high scalability for the LAMP stack. This workshop is divided into three sections.
The first section covers basic principles of shared nothing architectures and horizontal scaling for the app//cache/database tiers.
Section two of this tutorial is devoted to MySQL sharding techniques, queues and a few performance-related tips and tricks.
In section three we will cover the practical approach for measuring site performance and quality, porviding a "lean" support philosophy, connecting buesiness and technology metrics.
In addition we will cover a very useful Pinba real-time statistical server, it's features and various use cases. All of the sections will be based on real-world examples built in Badoo, one of the biggest dating sites on the Internet.
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
Â
Twitter's operations team manages software performance, availability, capacity planning, and configuration management for Twitter. They use metrics, logs, and analysis to find weak points and take corrective action. Some techniques include caching everything possible, moving operations to asynchronous daemons, and optimizing databases to reduce replication delay and locks. The team also created several open source projects like CacheMoney for caching and Kestrel for asynchronous messaging.
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
Â
Fixing Twitter and Finding your own Fail Whale document discusses Twitter operations. The operations team manages software performance, availability, capacity planning, and configuration management using metrics, logs, and data-driven analysis to find weak points and take corrective action. They use managed services for infrastructure to focus on computer science problems. The document outlines Twitter's rapid growth and challenges in maintaining performance as traffic increases. It provides recommendations around caching, databases, asynchronous processing, and other techniques Twitter uses to optimize performance under heavy load.
Twitter's operations team manages software performance, availability, capacity planning, and configuration management. They use metrics, logs, and analysis to find weak points and take corrective action. Some techniques include caching everything possible, moving operations to asynchronous daemons, optimizing databases, and instrumenting all systems. Their goal is to process requests asynchronously when possible and avoid overloading relational databases.
Fixing Twitter and Finding your own Fail Whale document discusses Twitter operations. The Twitter operations team focuses on software performance, availability, capacity planning, and configuration management using metrics, logs, and science. They use a dedicated managed services team and run their own servers instead of cloud services. The document outlines Twitter's rapid growth and challenges in maintaining performance. It discusses strategies for monitoring, analyzing metrics to find weak points, deploying changes, and improving processes through configuration management and peer reviews.
This document provides an introduction to big data and NoSQL databases. It begins with an introduction of the presenter. It then discusses how the era of big data came to be due to limitations of traditional relational databases and scaling approaches. The document introduces different NoSQL data models including document, key-value, graph and column-oriented databases. It provides examples of NoSQL databases that use each data model. The document discusses how NoSQL databases are better suited than relational databases for big data problems and provides a real-world example of Twitter's use of FlockDB. It concludes by discussing approaches for working with big data using MapReduce and provides examples of using MongoDB and Azure for big data.
This document provides a summary of a presentation on Big Data and NoSQL databases. It introduces the presenters, Melissa Demsak and Don Demsak, and their backgrounds. It then discusses how data storage needs have changed with the rise of Big Data, including the problems created by large volumes of data. The presentation contrasts traditional relational database implementations with NoSQL data stores, identifying five categories of NoSQL data models: document, key-value, graph, and column family. It provides examples of databases that fall under each category. The presentation concludes with a comparison of real-world scenarios and which data storage solutions might be best suited to each scenario.
This document provides an overview of NoSQL databases and summarizes key information about several NoSQL databases, including HBase, Redis, Cassandra, MongoDB, and Memcached. It discusses concepts like horizontal scalability, the CAP theorem, eventual consistency, and data models used by different NoSQL databases like key-value, document, columnar, and graph structures.
The document discusses the evolution of database technologies from relational databases to NoSQL databases. It argues that NoSQL databases better fit the needs of modern software development by supporting iterative development, fast feedback, and frequent releases. While early NoSQL technologies faced criticisms regarding lack of features like transactions and integrity checks, they proved useful for scaling applications to large data volumes. The document also advocates for an approach that balances flexibility with complexity by using schemaless stores at the front-end and more rigid structures at the back-end.
Slides from the second meeting of the Toronto High Scalability Meetup @ http://www.meetup.com/toronto-high-scalability/
-Basics of High Scalability and High Availability
-Using a CDN to Achieve 99% Offload
-Caching at the Code Layer
SpringPeople - Introduction to Cloud ComputingSpringPeople
Â
Cloud computing is no longer a fad that is going around. It is for real and is perhaps the most talked about subject. Various players in the cloud eco-system have provided a definition that is closely aligned to their sweet spot âlet it be infrastructure, platforms or applications.
This presentation will provide an exposure of a variety of cloud computing techniques, architecture, technology options to the participants and in general will familiarize cloud fundamentals in a holistic manner spanning all dimensions such as cost, operations, technology etc
A visual story of how LinkedIn is transforming how companies hire, market and sell. Learn more below -
Talent Solutions: http://business.linkedin.com/talent-solutions
Marketing Solutions: http://marketing.linkedin.com/
Sales Solutions: http://sales.linkedin.com/
Download the LinkedIn for Business Playbook: http://lnkd.in/LinkedInForBusinessPlaybook
Designed by Brett Wallace of Why is LinkedIn So Cool? fame: http://www.slideshare.net/brettalexwallace/why-is-linkedin-so-cool-16101604
Discover your career, build your brand and find a job you love. Learn more at https://blog.linkedin.com/2017/february/23/launching-your-career-getting-started-on-your-internship-search-linkedin.
The Top Skills That Can Get You Hired in 2017LinkedIn
Â
We analyzed all the recruiting activity on LinkedIn this year and identified the Top Skills employers seek. Starting Oct 24, learn these skills and much more for free during the Week of Learning.
#AlwaysBeLearning https://learning.linkedin.com/week-of-learning
Accelerating LinkedInâs Vision Through InnovationLinkedIn
Â
See what's next for LinkedIn - from a complete redesign of the desktop experience, to smarter messaging and content discovery features, to the future of professional learning. Read more: https://blog.linkedin.com/2016/09/22/accelerating-LinkedIn-vision
40% of professionals admit they find it hard to describe what they do for a living. We're here to help. Find out how to tell your #workstory: http://lnkd.in/LIworkstory
Presentation given by CEO Jeff Weiner, and CFO Steve Sordello, at LinkedIn Q1 2016 Earnings Call. For more information, check out http://investors.linkedin.com/
The LinkedIn Job Search Guide is your tactical toolkit for getting a job you love.
The LinkedIn Job Search Guide can be read one page at a time, one chapter at a time, or in entirety. The recommended tactics and tools were developed with U.S. job seekers in mind, however many of the strategies may be applied internationally.
Good luck with your job search and we hope that the following guide will put you in the driverâs seat as you develop your career.
Presentation given by CEO Jeff Weiner, and CFO Steve Sordello, at LinkedIn Q4 2015 Earnings Call. For more information, check out http://investors.linkedin.com/.
The document discusses avoiding vague buzzwords in LinkedIn profiles. It provides a list of the top 10 most common buzzwords of 2016 such as "leadership", "passionate", and "successful". The document recommends standing out by showing experiences and results through examples rather than just stating buzzwords. It also suggests uploading projects, sharing views in posts and groups, and writing recommendations to demonstrate qualities like creativity and expertise in a more meaningful way.
LinkedIn Bring In Your Parents Day 2015 - Your Parents' Best Career AdviceLinkedIn
Â
The 3rd Annual LinkedIn Bring In Your Parents Day took place on November 5, 2015. As part of the celebration, we asked people to share the best pieces of career advice their parents ever gave them. Hereâs what they had to say...
Presentation given by CEO Jeff Weiner, and CFO Steve Sordello, at LinkedIn Q3 2015 Earnings Call. For more information, check out http://investors.linkedin.com/.
1. Over 1.9 million LinkedIn members are located in the Greater Toronto Area, with 214,000 (11%) having technology skills.
2. Toronto has a higher proportion of members with technology skills (10.8%) compared to similar cities globally.
3. Nearly 100,000 companies in the Greater Toronto Area are represented on LinkedIn, with over 10,000 currently employing members with technology skills, especially in early career roles.
Top Industries for Freelancers on LinkedIn [Infographic]LinkedIn
Â
According to a LinkedIn survey, the top industries for freelancers are arts and design (46%), media and communication (34%), and business consulting (6%). Within arts and design, the most common roles for freelancers are graphic designer (29%), photographer (20%), and artist (19%). The information technology and program/project management industries each make up only 1% of freelancers.
LinkedIn Quiz: Which Parent Are You When It Comes to Helping Guide Your Child...LinkedIn
Â
Lighthouse, Helicopter or Free-range? Take this quiz to find out what your parenting style is when your children have flown the nest and started their career.
Join LinkedIn's Bring In Your Parents Day on November 5 -- learn out more at biyp.linkedin.com or join the social conversation using #BIYP.
LinkedIn Connect to Opportunity⢠-- Stories of DiscoveryLinkedIn
Â
Every minute of every day, opportunity is within reach on LinkedIn. See how four members use LinkedIn to unlock opportunity and how it can work for you.
LinkedIn Connect to Opportunity. Learn more at https://lnkd.in/b5Xr3nN
Description: Presentation given by CEO Jeff Weiner, and CFO Steve Sordello, at LinkedIn Q2 2015 Earnings Call. For more information, check out http://investors.linkedin.com/.
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
Â
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
AI and Data Privacy in 2025: Global TrendsInData Labs
Â
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the todayâs world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Â
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, âThe Coding War Games.â
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we donât find ourselves having the same discussion again in a decade?
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
Â
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
Â
Weâre bringing the TDX energy to our community with 2 power-packed sessions:
đ ď¸ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
đ Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Â
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
đ Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
đ Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
Procurement Insights Cost To Value Guide.pptxJon Hansen
Â
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement â not a competitor â to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Â
Book industry standards are evolving rapidly. In the first part of this session, weâll share an overview of key developments from 2024 and the early months of 2025. Then, BookNetâs resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about whatâs next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
Book industry standards are evolving rapidly. In the first part of this session, weâll share an overview of key developments from 2024 and the early months of 2025. Then, BookNetâs resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about whatâs next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
Â
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
HCL Nomad Web â Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Â
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden âautomatischâ im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und LĂśsung häufiger Probleme in HCL Nomad Web untersuchen, einschlieĂlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
Â
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. đ
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! đ
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Â
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just toolsâthey're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Â
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Hereâs how:
⢠Optimized Torizon OS & Yocto Support â Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
⢠Seamless Integration with i.MX 8M Plus and i.MX 95 â Toradex SMARC solutions leverage NXPâs i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
⢠Secure and Reliable â With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
⢠Containerized Workflows for AI & IoT â Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
⢠Strong Ecosystem & Developer Support â Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradexâs Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
4. The Team
â˘âŻ LinkedInâs Search, Network, and
Analytics Team
â˘âŻ Project Voldemort
â˘âŻ Search Infrastructure: Zoie, Bobo, etc
â˘âŻ LinkedInâs Hadoop system
â˘âŻ Recommendation Engine
â˘âŻ Data intensive features
â˘âŻ People you may know
â˘âŻ Whoâs viewed my profile
â˘âŻ User history service
7. Why did this happen?
â˘âŻ The internet centralizes computation
â˘âŻ Specialized systems are efficient (10-100x)
â˘âŻ Search: Inverted index
â˘âŻ Offline: Hadoop, Terradata, Oracle DWH
â˘âŻ Memcached
â˘âŻ In memory systems (social graph)
â˘âŻ Specialized system are scalable
â˘âŻ New data and problems
â˘âŻ Graphs, sequences, and text
8. Services and Scale Break Relational DBs
â˘âŻ No joins
â˘âŻ Lots of denormalization
â˘âŻ ORM is less helpful
â˘âŻ No constraints, triggers, etc
â˘âŻ Caching => key/value model
â˘âŻ Latency is key
9. Two Cheers For Relational Databases
â˘âŻ The relational model is a triumph of computer
science:
â˘âŻ General
â˘âŻ Concise
â˘âŻ Well understood
â˘âŻ But then again:
â˘âŻ SQL is a pain
â˘âŻ Hard to build re-usable data structures
â˘âŻ Donât hide the memory hierarchy!
Good: Filesystem API
Bad: SQL, some RPCs
10. Other Considerations
â˘âŻ Who is responsible for performance (engineers?
DBA? site operations?)
â˘âŻ Can you do capacity planning?
â˘âŻ Can you simulate the problem early in the design
phase?
â˘âŻ How do you do upgrades?
â˘âŻ Can you mock your database?
11. Some motivating factors
â˘âŻ This is a latency-oriented system
â˘âŻ Data set is large and persistent
â˘âŻ Cannot be all in memory
â˘âŻ Performance considerations
â˘âŻ Partition data
â˘âŻ Delay writes
â˘âŻ Eliminate network hops
â˘âŻ 80% of caching tiers are fixing problems that shouldnât
exist
â˘âŻ Need control over system availability and data durability
â˘âŻ Must replicate data on multiple machines
â˘âŻ Cost of scalability canât be too high
12. Inspired By Amazon Dynamo & Memcached
â˘âŻ Amazonâs Dynamo storage system
â˘âŻ Works across data centers
â˘âŻ Eventual consistency
â˘âŻ Commodity hardware
â˘âŻ Not too hard to build
ď§âŻ Memcached
â⯠Actually works
â⯠Really fast
â⯠Really simple
ď§âŻ Decisions:
â⯠Multiple reads/writes
â⯠Consistent hashing for data distribution
â⯠Key-Value model
â⯠Data versioning
13. Priorities
1.⯠Performance and scalability
2.⯠Actually works
3.⯠Community
4.⯠Data consistency
5.⯠Flexible & Extensible
6.⯠Everything else
14. Why Is This Hard?
â˘âŻ Failures in a distributed system are much more
complicated
â˘âŻ A can talk to B does not imply B can talk to A
â˘âŻ A can talk to B does not imply C can talk to B
â˘âŻ Getting a consistent view of the cluster is as hard as
getting a consistent view of the data
â˘âŻ Nodes will fail and come back to life with stale data
â˘âŻ I/O has high request latency variance
â˘âŻ I/O on commodity disks is even worse
â˘âŻ Intermittent failures are common
â˘âŻ User must be isolated from these problems
â˘âŻ There are fundamental trade-offs between availability and
consistency
16. Core Concepts - I
ď§âŻ ACID
â⯠Great for single centralized server.
ď§âŻ CAP Theorem
â⯠Consistency (Strict), Availability , Partition Tolerance
â⯠Impossible to achieve all three at same time in distributed platform
â⯠Can choose 2 out of 3
â⯠Dynamo chooses High Availability and Partition Tolerance
ď§âŻ by sacrificing Strict Consistency to Eventual consistency
ď§âŻ Consistency Models
â⯠Strict consistency
ď§âŻ 2 Phase Commits
ď§âŻ PAXOS : distributed algorithm to ensure quorum for consistency
â⯠Eventual consistency
ď§âŻ Different nodes can have different views of value
ď§âŻ In a steady state system will return last written value.
ď§âŻ BUT Can have much strong guarantees.
Proprietary & Confidential 19/11/09 16
17. Core Concept - II
ď§âŻ Consistent Hashing
ď§âŻ Key space is Partitioned
â⯠Many small partitions
ď§âŻ Partitions never change
â⯠Partitions ownership can change
ď§âŻ Replication
â⯠Each partition is stored by âNâ nodes
ď§âŻ Node Failures
â⯠Transient (short term)
â⯠Long term
ď§âŻ Needs faster bootstrapping
Proprietary & Confidential 19/11/09 17
18. Core Concept - III
â˘âŻ N - The replication factor
â˘âŻ R - The number of blocking reads
â˘âŻ W - The number of blocking writes
â˘âŻ If R+W > N
â˘âŻ then we have a quorum-like algorithm
â˘âŻ Guarantees that we will read latest writes OR fail
â˘âŻ R, W, N can be tuned for different use cases
â˘âŻ W = 1, Highly available writes
â˘âŻ R = 1, Read intensive workloads
â˘âŻ Knobs to tune performance, durability and availability
Proprietary & Confidential 19/11/09 18
19. Core Concepts - IV
â˘âŻ Vector Clock [Lamport] provides way to order events in a
distributed system.
â˘âŻ A vector clock is a tuple {t1 , t2 , ..., tn } of counters.
â˘âŻ Each value update has a master node
â˘âŻ When data is written with master node i, it increments ti.
â˘âŻ All the replicas will receive the same version
â˘âŻ Helps resolving consistency between writes on multiple replicas
â˘âŻ If you get network partitions
â˘âŻ You can have a case where two vector clocks are not comparable.
â˘âŻ In this case Voldemort returns both values to clients for conflict resolution
Proprietary & Confidential 19/11/09 19
22. Client API
â˘âŻ Data is organized into âstoresâ, i.e. tables
â˘âŻ Key-value only
â˘âŻ But values can be arbitrarily rich or complex
â˘âŻ Maps, lists, nested combinations âŚ
â˘âŻ Four operations
â˘âŻ PUT (K, V)
â˘âŻ GET (K)
â˘âŻ MULTI-GET (Keys),
â˘âŻ DELETE (K, Version)
â˘âŻ No Range Scans
23. Versioning & Conflict Resolution
â˘âŻ Eventual Consistency allows multiple versions of value
â˘âŻ Need a way to understand which value is latest
â˘âŻ Need a way to say values are not comparable
â˘âŻ Solutions
â˘âŻ Timestamp
â˘âŻ Vector clocks
â˘âŻ Provides global ordering.
â˘âŻ No locking or blocking necessary
24. Serialization
â˘âŻ Really important
â˘âŻ Few Considerations
â˘âŻ Schema free?
â˘âŻ Backward/Forward compatible
â˘âŻ Real life data structures
â˘âŻ Bytes <=> objects <=> strings?
â˘âŻ Size (No XML)
â˘âŻ Many ways to do it -- we allow anything
â˘âŻ Compressed JSON, Protocol Buffers,
Thrift, Voldemort custom serialization
25. Routing
â˘âŻ Routing layer hides lot of complexity
â˘âŻ Hashing schema
â˘âŻ Replication (N, R , W)
â˘âŻ Failures
â˘âŻ Read-Repair (online repair mechanism)
â˘âŻ Hinted Handoff (Long term recovery mechanism)
â˘âŻ Easy to add domain specific strategies
â˘âŻ E.g. only do synchronous operations on nodes in
the local data center
â˘âŻ Client Side / Server Side / Hybrid
27. Routing With Failures
â˘âŻ Failure Detection
â˘âŻRequirements
â˘âŻNeed to be very very fast
â˘âŻ View of server state may be inconsistent
â˘âŻ A can talk to B but C cannot
â˘âŻ A can talk to C , B can talk to A but not to C
â˘âŻ Currently done by routing layer (request timeouts)
â˘âŻ Periodically retries failed nodes.
â˘âŻ All requests must have hard SLAs
â˘âŻOther possible solutions
â˘âŻ Central server
â˘âŻ Gossip protocol
â˘âŻ Need to look more into this.
28. Repair Mechanism
ď§âŻ Read Repair
â⯠Online repair mechanism
ď§âŻ Routing client receives values from multiple node
ď§âŻ Notify a node if you see an old value
ď§âŻ Only works for keys which are read after failures
ď§âŻ Hinted Handoff
â⯠If a write fails write it to any random node
â⯠Just mark the write as a special write
â⯠Each node periodically tries to get rid of all special entries
ď§âŻ Bootstrapping mechanism (We donât have it yet)
â⯠If a node was down for long time
ď§âŻ Hinted handoff can generate ton of traffic
ď§âŻ Need a better way to bootstrap and clear hinted handoff tables
Proprietary & Confidential 19/11/09 28
29. Network Layer
â˘âŻ Network is the major bottleneck in many uses
â˘âŻ Client performance turns out to be harder than server
(client must wait!)
â˘âŻ Lots of issue with socket buffer size/socket pool
â˘âŻ Server is also a Client
â˘âŻ Two implementations
â˘âŻ HTTP + servlet container
â˘âŻ Simple socket protocol + custom server
â˘âŻ HTTP server is great, but http client is 5-10X slower
â˘âŻ Socket protocol is what we use in production
â˘âŻ Recently added a non-blocking version of the server
30. Persistence
â˘âŻ Single machine key-value storage is a commodity
â˘âŻ Plugins are better than tying yourself to a single strategy
â˘âŻ Different use cases
â˘âŻ optimize reads
â˘âŻ optimize writes
â˘âŻ large vs small values
â˘âŻ SSDs may completely change this layer
â˘âŻ Better filesystems may completely change this layer
â˘âŻ Couple of different options
â˘âŻ BDB, MySQL and mmapâd file implementations
â˘âŻ Berkeley DBs most popular
â˘âŻ In memory plugin for testing
â˘âŻ Btrees are still the best all-purpose structure
â˘âŻ No flush on write is a huge, huge win
32. LinkedIn problems we wanted to solve
â˘âŻ Application Examples
â˘âŻ People You May Know
â˘âŻ Item-Item Recommendations
â˘âŻ Member and Company Derived Data
â˘âŻ Userâs network statistics
â˘âŻ Who Viewed My Profile?
â˘âŻ Abuse detection
â˘âŻ Userâs History Service
â˘âŻ Relevance data
â˘âŻ Crawler detection
â˘âŻ Many others have come up since
â˘âŻ Some data is batch computed and served as read only
â˘âŻ Some data is very high write load
â˘âŻ Latency is key
33. Key-Value Design Example
ď§âŻ How to build a fast, scalable comment system?
ď§âŻ One approach
â⯠(post_id, page) => [comment_id_1, comment_id_2, âŚ]
â⯠comment_id => comment_body
ď§âŻ GET comment_ids by post and page
ď§âŻ MULTIGET comment bodies
ď§âŻ Threaded, paginated comments left as an exercise ď
Proprietary & Confidential 19/11/09 33
34. Hadoop and Voldemort sitting in a treeâŚ
ď§âŻ Hadoop can generate a lot of data
ď§âŻ Bottleneck 1: Getting the data out of hadoop
ď§âŻ Bottleneck 2: Transfer to DB
ď§âŻ Bottleneck 3: Index building
ď§âŻ We had a critical process where this process took a DBA
a week to run!
ď§âŻ Index building is a batch operation
19/11/09 34
36. Read-only storage engine
ď§âŻ Throughput vs. Latency
ď§âŻ Index building done in Hadoop
ď§âŻ Fully parallel transfer
ď§âŻ Very efficient on-disk structure
ď§âŻ Heavy reliance on OS pagecache
ď§âŻ Rollback!
37. Voldemort At LinkedIn
â˘âŻ 4 Clusters, 4 teams
â˘âŻ Wide variety of data sizes, clients, needs
â˘âŻ My team:
â˘âŻ 12 machines
â˘âŻ Nice servers
â˘âŻ 500M operations/day
â˘âŻ ~4 billion events in 10 stores (one per event type)
â˘âŻ Peak load > 10k operations / second
â˘âŻ Other teams: news article data, email related data, UI
settings
39. Some performance numbers
â˘âŻ Production stats
â˘âŻ Median: 0.1 ms
â˘âŻ 99.9 percentile GET: 3 ms
â˘âŻ Single node max throughput (1 client node, 1 server
node):
â˘âŻ 19,384 reads/sec
â˘âŻ 16,559 writes/sec
â˘âŻ These numbers are for mostly in-memory problems
40. Glaring Weaknesses
â˘âŻ Not nearly enough documentation
â˘âŻ No online cluster expansion (without reduced
guarantees)
â˘âŻ Need more clients in other languages (Java,
Python, Ruby, and C++ currently)
â˘âŻ Better tools for cluster-wide control and
monitoring
41. State of the Project
â˘âŻ Active mailing list
â˘âŻ 4-5 regular committers outside LinkedIn
â˘âŻ Lots of contributors
â˘âŻ Equal contribution from in and out of LinkedIn
â˘âŻ Project basics
â˘âŻ IRC
â˘âŻ Some documentation
â˘âŻ Lots more to do
â˘âŻ > 300 unit tests that run on every checkin (and pass)
â˘âŻ Pretty clean code
â˘âŻ Moved to GitHub (by popular demand)
â˘âŻ Production usage at a half dozen companies
â˘âŻ Not just a LinkedIn project anymore
â˘âŻ But LinkedIn is really committed to it (and we are hiring to work on it)
42. Some new & upcoming things
â˘âŻ New
â˘âŻ Python, Ruby clients
â˘âŻ Non-blocking socket server
â˘âŻ Alpha round on online cluster expansion
â˘âŻ Read-only store and Hadoop integration
â˘âŻ Improved monitoring stats
â˘âŻ Distributed testing infrastructure
â˘âŻ Compression
â˘âŻ Future
â˘âŻ Publish/Subscribe model to track changes
â˘âŻ Improved failure detection
44. Testing and releases
ď§âŻ Testing âin the cloudâ
ď§âŻ Distributed systems have complex failure scenarios
ď§âŻ A storage system, above all, must be stable
ď§âŻ Automated testing allows rapid iteration while maintaining confidence in
systemsâ correctness and stability
ď§âŻ EC2-based testing framework
ď§âŻ Tests are invoked programmatically
ď§âŻ Contributed by Kirk True
ď§âŻ Adaptable to other cloud hosting providers
ď§âŻ Regular releases for new features and bugs
ď§âŻ Trunk stays stable
Proprietary & Confidential 19/11/09 44
45. Shameless promotion
â˘âŻ Check it out: project-voldemort.com
â˘âŻ We love getting patches.
â˘âŻ We kind of love getting bug reports.
â˘âŻ LinkedIn is hiring, so you can work on this full time.
â˘âŻ Email me if interested
â˘âŻ [email protected]