Apache Flume is a system for collecting streaming data from various sources and transporting it to destinations such as HDFS or HBase. It has a configurable architecture that allows data to flow from clients to sinks via channels. Sources produce events that are sent to channels, which then deliver the events to sinks. Flume agents run sources and sinks in a configurable topology to provide reliable data transport.
Henry Robinson works at Cloudera on distributed data collection tools like Flume and ZooKeeper. Cloudera provides support for Hadoop and open source projects like Flume. Flume is a scalable and configurable system for collecting large amounts of log and event data into Hadoop from diverse sources. It allows defining flexible data flows that can reliably move data between collection agents and storage systems.
Flume is an Apache project for log aggregation and movement, optimized for Hadoop ecosystems. It uses a push model with agents and channels. Kafka is a distributed publish-subscribe messaging system optimized for high throughput and availability. It uses a pull model and supports multiple consumers. Kafka generally has higher throughput than Flume. Flume and Kafka can be combined, with Flume using Kafka as a channel or source/sink, to take advantage of both systems.
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Apache Flume is a simple yet robust data collection and aggregation framework which allows easy declarative configuration of components to pipeline data from upstream source to backend services such as Hadoop HDFS, HBase and others.
Flume is a system for collecting, aggregating, and moving large amounts of streaming data into Hadoop. It has reliable, customizable components like sources that generate or collect event data, channels that buffer events, and sinks that ship events to destinations. Sources put events into channels, which decouple sources from sinks and provide reliability. Sinks remove events from channels and transmit them to their final destination. Flume ensures reliable event delivery through transactional channel operations and persistence. It also provides load balancing, failover, and contextual routing capabilities through interceptors, channel selectors, and sink processors.
This document discusses Apache Flume and using its HBase sink. It provides an overview of Flume's architecture, components, and sources. It describes how the HBase sink works, its configuration options, and provides an example configuration for collecting data from a sequential source and storing it in an HBase table.
Arvind Prabhakar presented on Apache Flume. He discussed that Flume is an open-source system for aggregating large amounts of log and streaming data from many sources and efficiently transporting it to data stores and processing systems. It is designed to handle high volumes of continuously arriving data from distributed servers or devices. Flume uses a pipeline-based architecture that allows for reliable, scalable, and customizable data ingestion.
Apache Flume is a distributed system for efficiently collecting large streams of log data into Hadoop. It has a simple architecture based on streaming data flows between sources, sinks, and channels. An agent contains a source that collects data, a channel that buffers the data, and a sink that stores it. This document demonstrates how to install Flume, configure it to collect tweets from Twitter using the Twitter streaming API, and save the tweets to HDFS.
This document provides an overview of Apache Flume, a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store. It describes the core concepts in Flume including events, clients, agents, sources, channels, and sinks. Sources are components that read data and pass it to channels. Channels buffer events and sinks remove events from channels and transmit them to their destination. The document discusses commonly used source, channel and sink types and provides examples of Flume flows.
Flume NG is a tool for collecting and moving large amounts of log data from distributed servers to a Hadoop cluster. It uses agents that collect data through sources like netcat, store data temporarily in channels like memory, and then write data to sinks like HDFS. Flume provides reliable data transport through its use of transactions and flexible configuration through sources, channels, and sinks.
Design data pipeline to gather log events and transform it to queryable data with HIVE ddl.
This covers Java applications with log4j and non-java unix applications using rsyslog.
Apache Flume is a tool for collecting large amounts of streaming data from various sources and transporting it to a centralized data store like HDFS. It reliably delivers events from multiple data sources to destinations such as HDFS or HBase. Flume uses a simple and flexible architecture based on streaming data flows, with reliable delivery of events guaranteed through a system of agents, channels, and sinks.
The document describes using Apache Flume to collect log data from machines in a manufacturing process. It proposes setting up Flume agents on each machine that generate log files and forwarding the data to a central HDFS server. The author tests a sample Flume configuration with two virtual machines generating logs and an agent transferring the data to an HDFS directory. Next steps discussed are analyzing the log data using tools like MapReduce, Hive, and Mahout and visualizing it to improve quality control and production processes.
How Orange Financial combat financial frauds over 50M transactions a day usin...JinfengHuang3
You will learn how Orange Financial combats financial fraud over 50M transactions a day using Apache Pulsar. The presentation is shared at Strata Data Conference at New York, US, 2019/09.
This document summarizes a presentation about optimizing for low latency in HBase. It discusses how to measure latency, the write and read paths in HBase, sources of latency like garbage collection and compactions, and techniques for reducing latency like streaming puts, block caching, and timeline consistency. The key points are that single puts can achieve millisecond latency while garbage collection and machine failures can cause pauses of 10s of milliseconds to seconds, and optimizing for the "magical 1%" of requests after the 99th percentile is important to improve average latency.
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
HBase Coprocessors allow user code to be run on region servers within each region of an HBase table. Coprocessors are loaded dynamically and scale automatically as regions are split or merged. They provide hooks into various HBase operations via observer classes and define an interface for custom endpoint calls between clients and servers. Examples of use cases include secondary indexes, filters, and replacing MapReduce jobs with server-side processing.
Pulsar is a distributed pub/sub messaging platform developed by Yahoo. It provides scalable messaging with persistence, ordering and delivery guarantees. Pulsar is used extensively at Yahoo, handling 100 billion messages per day across 80+ applications. It provides common use cases like messaging queues, notifications and feedback systems. Pulsar's architecture uses brokers for client interactions, Apache BookKeeper for durable storage, and Zookeeper for coordination. Future work includes adding encryption, globally consistent topics, and C++ client support.
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.
Pinterest uses Apache HBase to store data for users' personalized "following feeds" at scale. This involves storing billions of pins and updates per day. Some key challenges addressed are handling high throughput writes from fanouts, providing low latency reads, and resolving potential data inconsistencies from race conditions. Optimizations to HBase include increased memstore size, block cache tuning, and prefix compression. Maintaining high availability involves writing to dual clusters, tight Zookeeper timeouts, and automated repairs.
Kafka meetup JP #3 - Engineering Apache Kafka at LINEkawamuray
This document summarizes a presentation about engineering Apache Kafka at LINE. Some key points:
- LINE uses Apache Kafka as a central data hub to pass data between services, handling over 140 billion messages per day.
- Data stored in Kafka includes application logs, data mutations, and task requests. This data is used for tasks like data replication, analytics, and asynchronous processing.
- Performance optimizations have led to target latencies below 1ms for 50% of produces and below 10ms for 99% of produces.
- SystemTap, a Linux tracing tool, helped identify slow disk reads causing delayed Kafka responses, improving performance.
- Having a single Kafka cluster as a data hub makes inter-service
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...DataWorks Summit
This document discusses using Apache Flume to stream data from various sources to Hadoop for telecommunications operators. It introduces Flume, describing its key components like agents, sources, channels, and sinks. It provides an end-to-end architecture example showing data flowing from external sources through Flume into Hadoop and then into an EDW for analysis and user reports. Finally, it discusses next generation architectures using technologies like Spark, machine learning, and real-time analytics.
HBaseCon 2015: HBase at Scale in an Online and High-Demand EnvironmentHBaseCon
Pinterest runs 38 different HBase clusters in production, doing a lot of different types of work—with some doing up to 5 million operations per second. In this talk, you'll get details about how we do capacity planning, maintenance tasks such as online automated rolling compaction, configuration management, and monitoring.
In this session, you will learn the work Xiaomi has done to improve the availability and stability of our HBase clusters, including cross-site data and service backup and a coordinated compaction framework. You'll also learn about the Themis framework, which supports cross-row transactions on HBase based on Google's percolator algorithm, and its usage in Xiaomi's applications.
The document summarizes the HBase 1.0 release which introduces major new features and interfaces including a new client API, region replicas for high availability, online configuration changes, and semantic versioning. It describes goals of laying a stable foundation, stabilizing clusters and clients, and making versioning explicit. Compatibility with earlier versions is discussed and the new interfaces like ConnectionFactory, Connection, Table and BufferedMutator are introduced along with examples of using them.
If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8
An introduction to Apache Flume that comes from Hadoop Administrator Training delivered by GetInData.
Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data. By reading these slides, you will learn about Apache Flume, its motivation, the most important features, architecture of Flume, its reliability guarantees, Agent's configuration, integration with the Apache Hadoop Ecosystem and more.
This document discusses filesystems, RPC, HDFS, and I/O schedulers. It provides an overview of Linux kernel I/O schedulers and how they optimize disk access. It then discusses the I/O stack in Linux, including the virtual filesystem (VFS) layer. It describes the NFS client-server model using RPC over TCP/IP and how HDFS uses a similar model with its own APIs. Finally, it outlines the write process in HDFS from the client to data nodes.
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...Cloudera, Inc.
This document discusses using Apache Flume to stream data into Apache HBase. It describes how Flume provides a scalable and flexible way to collect and transport log and event data to HBase. Specifically, it covers the HBase sink plugin for Flume, which allows routing Flume events to HBase tables. It notes that while the initial HBase sink had limitations, the asynchronous HBase sink improved performance by fully utilizing the HBase cluster. Overall, the document presents Flume as a viable alternative to directly writing to HBase and provides flexibility to change schemas without code changes.
Building Apache Cassandra clusters for massive scaleAlex Thompson
Covering theory and operational aspects of bring up Apache Cassandra clusters - this presentation can be used as a field reference. Presented by Alex Thompson at the Sydney Cassandra Meetup.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
Apache Flume is a distributed system for efficiently collecting large streams of log data into Hadoop. It has a simple architecture based on streaming data flows between sources, sinks, and channels. An agent contains a source that collects data, a channel that buffers the data, and a sink that stores it. This document demonstrates how to install Flume, configure it to collect tweets from Twitter using the Twitter streaming API, and save the tweets to HDFS.
This document provides an overview of Apache Flume, a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store. It describes the core concepts in Flume including events, clients, agents, sources, channels, and sinks. Sources are components that read data and pass it to channels. Channels buffer events and sinks remove events from channels and transmit them to their destination. The document discusses commonly used source, channel and sink types and provides examples of Flume flows.
Flume NG is a tool for collecting and moving large amounts of log data from distributed servers to a Hadoop cluster. It uses agents that collect data through sources like netcat, store data temporarily in channels like memory, and then write data to sinks like HDFS. Flume provides reliable data transport through its use of transactions and flexible configuration through sources, channels, and sinks.
Design data pipeline to gather log events and transform it to queryable data with HIVE ddl.
This covers Java applications with log4j and non-java unix applications using rsyslog.
Apache Flume is a tool for collecting large amounts of streaming data from various sources and transporting it to a centralized data store like HDFS. It reliably delivers events from multiple data sources to destinations such as HDFS or HBase. Flume uses a simple and flexible architecture based on streaming data flows, with reliable delivery of events guaranteed through a system of agents, channels, and sinks.
The document describes using Apache Flume to collect log data from machines in a manufacturing process. It proposes setting up Flume agents on each machine that generate log files and forwarding the data to a central HDFS server. The author tests a sample Flume configuration with two virtual machines generating logs and an agent transferring the data to an HDFS directory. Next steps discussed are analyzing the log data using tools like MapReduce, Hive, and Mahout and visualizing it to improve quality control and production processes.
How Orange Financial combat financial frauds over 50M transactions a day usin...JinfengHuang3
You will learn how Orange Financial combats financial fraud over 50M transactions a day using Apache Pulsar. The presentation is shared at Strata Data Conference at New York, US, 2019/09.
This document summarizes a presentation about optimizing for low latency in HBase. It discusses how to measure latency, the write and read paths in HBase, sources of latency like garbage collection and compactions, and techniques for reducing latency like streaming puts, block caching, and timeline consistency. The key points are that single puts can achieve millisecond latency while garbage collection and machine failures can cause pauses of 10s of milliseconds to seconds, and optimizing for the "magical 1%" of requests after the 99th percentile is important to improve average latency.
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
HBase Coprocessors allow user code to be run on region servers within each region of an HBase table. Coprocessors are loaded dynamically and scale automatically as regions are split or merged. They provide hooks into various HBase operations via observer classes and define an interface for custom endpoint calls between clients and servers. Examples of use cases include secondary indexes, filters, and replacing MapReduce jobs with server-side processing.
Pulsar is a distributed pub/sub messaging platform developed by Yahoo. It provides scalable messaging with persistence, ordering and delivery guarantees. Pulsar is used extensively at Yahoo, handling 100 billion messages per day across 80+ applications. It provides common use cases like messaging queues, notifications and feedback systems. Pulsar's architecture uses brokers for client interactions, Apache BookKeeper for durable storage, and Zookeeper for coordination. Future work includes adding encryption, globally consistent topics, and C++ client support.
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.
Pinterest uses Apache HBase to store data for users' personalized "following feeds" at scale. This involves storing billions of pins and updates per day. Some key challenges addressed are handling high throughput writes from fanouts, providing low latency reads, and resolving potential data inconsistencies from race conditions. Optimizations to HBase include increased memstore size, block cache tuning, and prefix compression. Maintaining high availability involves writing to dual clusters, tight Zookeeper timeouts, and automated repairs.
Kafka meetup JP #3 - Engineering Apache Kafka at LINEkawamuray
This document summarizes a presentation about engineering Apache Kafka at LINE. Some key points:
- LINE uses Apache Kafka as a central data hub to pass data between services, handling over 140 billion messages per day.
- Data stored in Kafka includes application logs, data mutations, and task requests. This data is used for tasks like data replication, analytics, and asynchronous processing.
- Performance optimizations have led to target latencies below 1ms for 50% of produces and below 10ms for 99% of produces.
- SystemTap, a Linux tracing tool, helped identify slow disk reads causing delayed Kafka responses, improving performance.
- Having a single Kafka cluster as a data hub makes inter-service
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...DataWorks Summit
This document discusses using Apache Flume to stream data from various sources to Hadoop for telecommunications operators. It introduces Flume, describing its key components like agents, sources, channels, and sinks. It provides an end-to-end architecture example showing data flowing from external sources through Flume into Hadoop and then into an EDW for analysis and user reports. Finally, it discusses next generation architectures using technologies like Spark, machine learning, and real-time analytics.
HBaseCon 2015: HBase at Scale in an Online and High-Demand EnvironmentHBaseCon
Pinterest runs 38 different HBase clusters in production, doing a lot of different types of work—with some doing up to 5 million operations per second. In this talk, you'll get details about how we do capacity planning, maintenance tasks such as online automated rolling compaction, configuration management, and monitoring.
In this session, you will learn the work Xiaomi has done to improve the availability and stability of our HBase clusters, including cross-site data and service backup and a coordinated compaction framework. You'll also learn about the Themis framework, which supports cross-row transactions on HBase based on Google's percolator algorithm, and its usage in Xiaomi's applications.
The document summarizes the HBase 1.0 release which introduces major new features and interfaces including a new client API, region replicas for high availability, online configuration changes, and semantic versioning. It describes goals of laying a stable foundation, stabilizing clusters and clients, and making versioning explicit. Compatibility with earlier versions is discussed and the new interfaces like ConnectionFactory, Connection, Table and BufferedMutator are introduced along with examples of using them.
If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8
An introduction to Apache Flume that comes from Hadoop Administrator Training delivered by GetInData.
Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data. By reading these slides, you will learn about Apache Flume, its motivation, the most important features, architecture of Flume, its reliability guarantees, Agent's configuration, integration with the Apache Hadoop Ecosystem and more.
This document discusses filesystems, RPC, HDFS, and I/O schedulers. It provides an overview of Linux kernel I/O schedulers and how they optimize disk access. It then discusses the I/O stack in Linux, including the virtual filesystem (VFS) layer. It describes the NFS client-server model using RPC over TCP/IP and how HDFS uses a similar model with its own APIs. Finally, it outlines the write process in HDFS from the client to data nodes.
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...Cloudera, Inc.
This document discusses using Apache Flume to stream data into Apache HBase. It describes how Flume provides a scalable and flexible way to collect and transport log and event data to HBase. Specifically, it covers the HBase sink plugin for Flume, which allows routing Flume events to HBase tables. It notes that while the initial HBase sink had limitations, the asynchronous HBase sink improved performance by fully utilizing the HBase cluster. Overall, the document presents Flume as a viable alternative to directly writing to HBase and provides flexibility to change schemas without code changes.
Building Apache Cassandra clusters for massive scaleAlex Thompson
Covering theory and operational aspects of bring up Apache Cassandra clusters - this presentation can be used as a field reference. Presented by Alex Thompson at the Sydney Cassandra Meetup.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
Slides presented at Percona Live Europe Open Source Database Conference 2019, Amsterdam, 2019-10-01.
Imagine a world where all Wikipedia articles disappear due to a human error or software bug. Sounds unreal? According to some estimations, it would take an excess of hundreds of million person-hours to be written again. To prevent that scenario from ever happening, our SRE team at Wikimedia recently refactored the relational database recovery system.
In this session, we will discuss how we backup 550TB of MariaDB data without impacting the 15 billion page views per month we get. We will cover what were our initial plans to replace the old infrastructure, how we achieved recovering 2TB databases in less than 30 minutes while maintaining per-table granularity, as well as the different types of backups we implemented. Lastly, we will talk about lessons learned, what went well, how our original plans changed and future work.
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
I'm going to discuss the efficiency/performance optimizations of different layers of the system. Starting from the lowest levels like hardware and drivers: these tunings can be applied to pretty much any high-load server. Then we’ll move to Linux kernel and its TCP/IP stack: these are the knobs you want to try on any of your TCP-heavy boxes. Finally, we’ll discuss library and application-level tunings, which are mostly applicable to HTTP servers in general and nginx/envoy specifically.
For each potential area of optimization I’ll try to give some background on latency/throughput tradeoffs (if any), monitoring guidelines, and, finally, suggest tunings for different workloads.
Also, I'll cover more theoretical approaches to performance analysis and the newly developed tooling like `bpftrace` and new `perf` features.
The document summarizes performance tests of distributed file systems GlusterFS, XtremeFS, FhgFS, Tahoe-LAFS and PlasmaFS. It finds that FhgFS performs best for large sequential file operations while GlusterFS works best as a general purpose distributed file system. Tests included sequential reads/writes, local to distributed copies, distributed to distributed copies, and Joomla application deployment which found GlusterFS and FhgFS had the fastest speeds. The document provides detailed setup instructions for each file system tested.
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...Puppet
This document provides an overview of measuring and tuning performance for the Puppet Enterprise (PE) platform. It discusses gathering data from PE services like Puppet Server and PuppetDB through JVM logging, metrics, and configurations. Important metrics for Puppet Server include JRuby usage and catalog compilation times. Tuning options involve adjusting JRuby capacity and rebalancing agent checkins. The document also covers monitoring PuppetDB for storage usage and command processing, as well as optimizing PostgreSQL query performance.
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...DataStax
Successfully running Apache Cassandra in production often means knowing what configuration settings to change and which ones to leave as default. Over the years the cassandra.yaml file has grown to provide a number of settings that can improve stability and performance. While the file contains plenty of helpful comments, there is more to be said about the settings and when to change them.
In this talk Edward Capriolo, Consultant at The Last Pickle, will break down the parameters in the configuration files. Looking at those that are essential to getting started, those that impact performance, those that improve availability, the exotic ones, and the ones that should not be played with. This talk is ideal for someone someone setting up Cassandra for the first time up to people with deployments in productions and wondering what the more exotic configuration options do.
About the Speaker
Edward Capriolo Consultant, The Last Pickle
Long time Apache Cassandra user, big data enthusiast.
Shak larry-jeder-perf-and-tuning-summit14-part2-finalTommy Lee
This document provides an overview of performance analysis and tuning techniques in Red Hat Enterprise Linux (RHEL). It discusses the tuned profile packages and how they optimize systems for different workloads. Specific topics covered include disk I/O tuning, memory tuning, network performance tuning, and power management techniques. A variety of Linux performance analysis tools are also introduced, including tuned, turbostat, netsniff-ng, and Performance Co-Pilot.
hbaseconasia2017: Large scale data near-line loading method and architectureHBaseCon
This document proposes a read-write split near-line data loading method and architecture to:
- Increase data loading performance by separating write operations from read operations. A WriteServer handles write requests and loads data to HDFS to be read from by RegionServers.
- Control resources used by write operations to ensure read operations are not starved of resources like CPU, network, disk I/O, and handlers.
- Provide an architecture corresponding to Kafka and HDFS for streaming data from Kafka to HDFS to be loaded into HBase in a delayed manner.
- Include optimizations like task balancing across WriteServer slaves, prioritized compaction of small files, and customizable storage engines.
- Report test results showing one Write
The document discusses setting up a Squid proxy server on a Linux system to improve network security and performance for a home network. It recommends using an old Pentium II computer with at least 80-100MB of RAM as the proxy server. The document provides instructions for installing Squid and configuring the Squid.conf file to optimize disk usage, caching, and logging. It also explains how to set up the Squid proxy server to work with an iptables firewall for access control and protection from intruders.
This document summarizes Marian Marinov's testing and experience with different distributed filesystems at his company SiteGround. He tested CephFS, GlusterFS, MooseFS, OrangeFS, and BeeGFS. CephFS required a lot of resources but lacked redundancy. GlusterFS was relatively easy to set up but had high CPU usage. MooseFS and OrangeFS were also easy to set up. Ultimately, they settled on Ceph RBD with NFS and caching for performance and simplicity. File creation performance tests showed MooseFS and NFS+Ceph RBD outperformed OrangeFS and GlusterFS. Tuning settings like MTU, congestion control, and caching helped optimize performance.
The document summarizes the hardware, software configuration, and management of a large Hadoop cluster at Facebook. The cluster consists of 320 nodes arranged in 8 racks. The nodes are configured for different purposes like running the distributed file system, MapReduce jobs, and testing. Software like Hypershell and Cfengine are used for administration. Common issues and performance optimization techniques are also discussed.
The document discusses changes in z/VM 6.3 to support large logical partition (LPAR) workloads. Key changes include implementing HiperDispatch to improve processor efficiency through affinity-aware dispatching and vertical CPU management. Memory support was increased from 256GB to 1TB per z/VM system. Other improvements include enhanced dump support for larger environments and tools for studying monitor data to understand workload behavior.
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Redis Labs
Dynomite is a
thin, distributed dynamo layer for different storage engines and protocols. Currently at Netflix, we are focusing on using
Redis as the storage engine. Dynomite supports multi-datacenter replication and is designed for high availability. In the age of high scalability and big data, Dynomite’s design goal is to turn single-server datastore solutions into peer-to-peer, linearly
scalable, clustered systems while still preserving the native client/server protocols of the datastores, e.g., Redis protocol. In this talk, we are going to present Dynomite recent features, and the Dyno client. Both projects are open source and available to the community.
Storage and performance- Batch processing, WhiptailInternet World
Batch processing allows jobs to run without manual intervention by shifting processing to less busy times. It avoids idling computing resources and allows higher overall utilization. Batch processing provides benefits like prioritizing batch and interactive work. The document then discusses different approaches to batch processing like dedicating all resources to it or sharing resources. It outlines challenges like systems being unavailable during batch processing. The rest of the document summarizes Whiptail's flash storage solutions for accelerating workloads and reducing costs and resources compared to HDDs.
Challenges with Gluster and Persistent Memory with Dan LambrightGluster.org
This document discusses challenges in using persistent memory (SCM) with distributed storage systems like Gluster. It notes that SCM provides faster access than SSDs but must address latency throughout the storage stack, including network transfer times and CPU overhead. The document examines how Gluster's design amplifies lookup operations and proposes caching file metadata at clients to reduce overhead. It also suggests using SCM as a tiered cache layer and optimizing replication strategies to fully leverage the speed of SCM.
This document discusses the Hadoop framework. It provides an overview of Hadoop and its core components, MapReduce and HDFS. It describes how Hadoop is suitable for processing large datasets in distributed environments using commodity hardware. It also summarizes some of Hadoop's limitations and how additional tools like HBase, Pig Latin, and Hive can expand its capabilities.
The document discusses the features and capabilities of the QNAP TS-832PX and TS-932PX network attached storage (NAS) devices. Both NAS devices come with dual 10GbE SFP+ and dual 2.5GbE RJ45 ports to provide faster network speeds. They are suitable for small and medium sized business environments that have an increasing number of connected devices and larger file sizes. The document provides details on the specifications, performance tests results, and software applications that come with the NAS devices.
Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...VMware Tanzu
The document describes how Manulife improved the efficiency and reliability of their Pivotal Cloud Foundry (PCF) platforms while saving over $730,000. Key changes included implementing a scheduler to stop non-critical apps on weekends, switching from internal to external blob storage, changing Diego cell VM types to more optimized models, and tuning various foundation configurations. These changes resulted in estimated annual savings of $40,000 from scheduling, $21,500 from external blob storage, and over $1 million from Diego cell and foundation changes, for a total of over $1 million in savings.
This document discusses the need for observability in data pipelines. It notes that real data pipelines often fail or take a long time to rerun without providing any insight into what went wrong. This is because of frequent code, data, dependency, and infrastructure changes. The document recommends taking a production engineering approach to observability using metrics, logging, and alerting tools. It also suggests experiment management and encapsulating reporting in notebooks. Most importantly, it stresses measuring everything through metrics at all stages of data ingestion and processing to better understand where issues occur.
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
Couchbase is a popular open source NoSQL platform used by giants like Apple, LinkedIn, Walmart, Visa and many others and runs on-premise or in a public/hybrid/multi cloud.
Couchbase has a sub-millisecond K/V cache integrated with a document based DB, a unique and many more services and features.
In this session we will talk about the unique architecture of Couchbase, its unique N1QL language - a SQL-Like language that is ANSI compliant, the services and features Couchbase offers and demonstrate some of them live.
We will also discuss what makes Couchbase different than other popular NoSQL platforms like MongoDB, Cassandra, Redis, DynamoDB etc.
At the end we will talk about the next version of Couchbase (6.5) that will be released later this year and about Couchbase 7.0 that will be released next year.
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
The document provides an overview of machine learning concepts including linear regression, artificial neural networks, and convolutional neural networks. It discusses how artificial neural networks are inspired by biological neurons and can learn relationships in data. The document uses the MNIST dataset example to demonstrate how a neural network can be trained to classify images of handwritten digits using backpropagation to adjust weights to minimize error. TensorFlow is introduced as a popular Python library for building machine learning models, enabling flexible creation and training of neural networks.
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
Machine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
The technology of fake news between a new front and a new frontier | Big Dat...Omid Vahdaty
קוראים לי ניצן אור קדראי ואני עומדת בצומת המעניינת שבין טכנולוגיה, מדיה ואקטיביזם.
בארבע וחצי השנים האחרונות אני עובדת בידיעות אחרונות, בהתחלה כמנהלת המוצר של אפליקציית ynet וכיום כמנהלת החדשנות.
הייתי שותפה בהקמת עמותת סטארט-אח, עמותה המספקת שירותי פיתוח ומוצר עבור עמותות אחרות, ולאחרונה מתעסקת בהקמת קהילה שמטרתה לחקור את ההיבטים הטכנולוגיים של תופעת הפייק ניוז ובניית כלים אפליקטיביים לצורך ניהול חכם של המלחמה בתופעה.
ההרצאה תדבר על תופעת הפייק ניוז. נתמקד בטכנולוגיה שמאפשרת את הפצת הפייק ניוז ונראה דוגמאות לשימוש בטכנולוגיה זו.
נבחן את היקף התופעה ברשתות החברתיות ונלמד איך ענקיות הטכנולוגיה מנסות להילחם בה.
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Making your analytics talk business | Big Data DemystifiedOmid Vahdaty
MAKING YOUR ANALYTICS TALK BUSINESS
Aligning your analysis to the business is fundamental for all types of analytics (digital or product analytics, business intelligence, etc) and is vertical- and tool agnostic. In this talk we will build on the discussion that was started in the previous meetup, and will discuss how analysts can learn to derive their stakeholders' expectations, how to shift from metrics to "real" KPIs, and how to approach an analysis in order to create real impact.
This session is primarily geared towards those starting out into analytics, practitioners who feel that they are still struggling to prove their value in the organization or simply folks who want to power up their reporting and recommendation skills. If you are already a master at aligning your analysis to the business, you're most welcome as well: join us to share your experiences so that we can all learn from each other and improve!
Bios:
Eliza Savov - Eliza is the team lead of the Customer Experience and Analytics team at Clicktale, the worldwide leader in behavioral analytics. She has extensive experience working with data analytics, having previously worked at Clicktale as a senior customer experience analyst, and as a product analyst at Seeking Alpha.
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...Omid Vahdaty
In the talk we will discuss how to break down the company’s overall goals all the way to your BI team’s daily activities in 3 simple stages:
1. Understanding the path to success - Creating a revenue model
2. Gathering support and strategizing - Structuring a team
3. Executing - Tracking KPIs
Bios:
Omri Halak -Omri is the director of business operations at Logz.io, an intelligent and scalable machine data analytics platform built on ELK & Grafana that empowers engineers to monitor, troubleshoot, and secure mission-critical applications more effectively. In this position, Omri combines actionable business insights from the BI side with fast and effective delivery on the Operations side. Omri has ample experience connecting data with business, with previous positions at SimilarWeb as a business analyst, at Woobi as finance director, and as Head of State Guarantees at Israel Ministry of Finance.
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...Omid Vahdaty
Lecturer has Deep experience defining Cloud computing, security models for IaaS, PaaS, and SaaS architectures specifically as the architecture relates to IAM. Deep Experience Defining Privacy protection Policy, a big fan of GDPR interpretation.
DeelExperience in Information security, Defining Healthcare security best practices including AI and Big Data, IT Security and ICS security and privacy controls in the industrial environments.
Deep knowledge of security frameworks such as Cloud Security Alliance (CSA), International Organization for Standardization (ISO), National Institute of Standards and Technology (NIST), IBM ITCS104 etc.
What Will You learn:
Every day, the website collects a huge amount of data. The data allows to analyze the behavior of Internet users, their interests, their purchasing behavior and the conversion rates. In order to increase business, big data offers the tools to analyze and process data in order to reveal competitive advantages from the data.
What Healthcare has to do with Big Data
How AI can assist in patient care?
Why some are afraid? Are there any dangers?
Aerospike meetup july 2019 | Big Data DemystifiedOmid Vahdaty
Building a low latency (sub millisecond), high throughput database that can handle big data AND linearly scale is not easy - but we did it anyway...
In this session we will get to know Aerospike, an enterprise distributed primary key database solution.
- We will do an introduction to Aerospike - basic terms, how it works and why is it widely used in mission critical systems deployments.
- We will understand the 'magic' behind Aerospike ability to handle small, medium and even Petabyte scale data, and still guarantee predictable performance of sub-millisecond latency
- We will learn how Aerospike devops is different than other solutions in the market, and see how easy it is to run it on cloud environments as well as on premise.
We will also run a demo - showing a live example of the performance and self-healing technologies the database have to offer.
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...Omid Vahdaty
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS
-Learn how to connect BI and product management to solve business problems
-Discover how to lead clients to ask the right questions to get the data and insight they really want
-Get pointers on saving your time and your company's resources by understanding what your customers need, not what they ask for
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?
In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup.
Some of our online materials (mixed content from several cloud vendor):
Website:
https://big-data-demystified.ninja (under construction)
Meetups:
https://www.meetup.com/Big-Data-Demystified
https://www.meetup.com/AWS-Big-Data-Demystified/
You tube channels:
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
AWS Big Data Demystified #4 data governance demystified [security, networ...Omid Vahdaty
The document provides an overview of data governance on AWS. It discusses key aspects of data governance including availability, usability, consistency, integrity and security. It covers considerations for data access, regulations, and architecture implications. Specific AWS services for data storage, processing and security are outlined, such as S3, VPC, IAM, encryption options. The document emphasizes account segregation, identity-based access policies, and designing networks and data flows with big data applications in mind.
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
This document provides an overview of various AWS big data services including Athena, Redshift Spectrum, EMR, and Hive. It discusses how Athena allows users to run SQL queries directly on data stored in S3 using Presto. Redshift Spectrum enables querying data in S3 using standard SQL from Amazon Redshift. EMR is a managed Hadoop framework that can run Hive, Spark, and other big data applications. Hive provides a SQL-like interface to query data stored in various formats like Parquet and ORC on distributed storage systems. The document demonstrates features and provides best practices for working with these AWS big data services.
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
This document provides an overview of streaming data and messaging concepts including batch processing, streaming, streaming vs messaging, challenges with streaming data, and AWS services for streaming and messaging like Kinesis, Kinesis Firehose, SQS, and Kafka. It discusses use cases and comparisons for these different services. For example, Kinesis is suitable for complex analytics on streaming data while SQS focuses on per-event messaging. Firehose automatically loads streaming data into AWS services like S3 and Redshift without custom coding.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
EMR Spark tuning involves configuring Spark and YARN parameters like executor memory and cores to optimize performance. The default Spark configurations depend on the deployment method (Thrift, Zeppelin etc). YARN is used for resource management in cluster mode, and allocates resources to containers based on minimum and maximum thresholds. When tuning, factors like available cluster resources, executor instances and cores should be considered to avoid overcommitting resources.
Zeppelin is an open-source web-based notebook that enables data ingestion, exploration, visualization, and collaboration on Apache Spark. It has built-in support for languages like SQL, Python, Scala and R. Zeppelin notebooks can be stored in S3 for persistence and sharing. Apache Livy is a REST API that allows managing Spark jobs and provides a way to securely run and share notebooks across multiple users.
We introduce the Gaussian process (GP) modeling module developed within the UQLab software framework. The novel design of the GP-module aims at providing seamless integration of GP modeling into any uncertainty quantification workflow, as well as a standalone surrogate modeling tool. We first briefly present the key mathematical tools on the basis of GP modeling (a.k.a. Kriging), as well as the associated theoretical and computational framework. We then provide an extensive overview of the available features of the software and demonstrate its flexibility and user-friendliness. Finally, we showcase the usage and the performance of the software on several applications borrowed from different fields of engineering. These include a basic surrogate of a well-known analytical benchmark function; a hierarchical Kriging example applied to wind turbine aero-servo-elastic simulations and a more complex geotechnical example that requires a non-stationary, user-defined correlation function. The GP-module, like the rest of the scientific code that is shipped with UQLab, is open source (BSD license).
π0.5: a Vision-Language-Action Model with Open-World GeneralizationNABLAS株式会社
今回の資料「Transfusion / π0 / π0.5」は、画像・言語・アクションを統合するロボット基盤モデルについて紹介しています。
拡散×自己回帰を融合したTransformerをベースに、π0.5ではオープンワールドでの推論・計画も可能に。
This presentation introduces robot foundation models that integrate vision, language, and action.
Built on a Transformer combining diffusion and autoregression, π0.5 enables reasoning and planning in open-world settings.
Fluid mechanics is the branch of physics concerned with the mechanics of fluids (liquids, gases, and plasmas) and the forces on them. Originally applied to water (hydromechanics), it found applications in a wide range of disciplines, including mechanical, aerospace, civil, chemical, and biomedical engineering, as well as geophysics, oceanography, meteorology, astrophysics, and biology.
It can be divided into fluid statics, the study of various fluids at rest, and fluid dynamics.
Fluid statics, also known as hydrostatics, is the study of fluids at rest, specifically when there's no relative motion between fluid particles. It focuses on the conditions under which fluids are in stable equilibrium and doesn't involve fluid motion.
Fluid kinematics is the branch of fluid mechanics that focuses on describing and analyzing the motion of fluids, such as liquids and gases, without considering the forces that cause the motion. It deals with the geometrical and temporal aspects of fluid flow, including velocity and acceleration. Fluid dynamics, on the other hand, considers the forces acting on the fluid.
Fluid dynamics is the study of the effect of forces on fluid motion. It is a branch of continuum mechanics, a subject which models matter without using the information that it is made out of atoms; that is, it models matter from a macroscopic viewpoint rather than from microscopic.
Fluid mechanics, especially fluid dynamics, is an active field of research, typically mathematically complex. Many problems are partly or wholly unsolved and are best addressed by numerical methods, typically using computers. A modern discipline, called computational fluid dynamics (CFD), is devoted to this approach. Particle image velocimetry, an experimental method for visualizing and analyzing fluid flow, also takes advantage of the highly visual nature of fluid flow.
Fundamentally, every fluid mechanical system is assumed to obey the basic laws :
Conservation of mass
Conservation of energy
Conservation of momentum
The continuum assumption
For example, the assumption that mass is conserved means that for any fixed control volume (for example, a spherical volume)—enclosed by a control surface—the rate of change of the mass contained in that volume is equal to the rate at which mass is passing through the surface from outside to inside, minus the rate at which mass is passing from inside to outside. This can be expressed as an equation in integral form over the control volume.
The continuum assumption is an idealization of continuum mechanics under which fluids can be treated as continuous, even though, on a microscopic scale, they are composed of molecules. Under the continuum assumption, macroscopic (observed/measurable) properties such as density, pressure, temperature, and bulk velocity are taken to be well-defined at "infinitesimal" volume elements—small in comparison to the characteristic length scale of the system, but large in comparison to molecular length scale
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfMohamedAbdelkader115
Glad to be one of only 14 members inside Kuwait to hold this credential.
Please check the members inside kuwait from this link:
https://www.rics.org/networking/find-a-member.html?firstname=&lastname=&town=&country=Kuwait&member_grade=(AssocRICS)&expert_witness=&accrediation=&page=1
Raish Khanji GTU 8th sem Internship Report.pdfRaishKhanji
This report details the practical experiences gained during an internship at Indo German Tool
Room, Ahmedabad. The internship provided hands-on training in various manufacturing technologies, encompassing both conventional and advanced techniques. Significant emphasis was placed on machining processes, including operation and fundamental
understanding of lathe and milling machines. Furthermore, the internship incorporated
modern welding technology, notably through the application of an Augmented Reality (AR)
simulator, offering a safe and effective environment for skill development. Exposure to
industrial automation was achieved through practical exercises in Programmable Logic Controllers (PLCs) using Siemens TIA software and direct operation of industrial robots
utilizing teach pendants. The principles and practical aspects of Computer Numerical Control
(CNC) technology were also explored. Complementing these manufacturing processes, the
internship included extensive application of SolidWorks software for design and modeling tasks. This comprehensive practical training has provided a foundational understanding of
key aspects of modern manufacturing and design, enhancing the technical proficiency and readiness for future engineering endeavors.
International Journal of Distributed and Parallel systems (IJDPS)samueljackson3773
The growth of Internet and other web technologies requires the development of new
algorithms and architectures for parallel and distributed computing. International journal of
Distributed and parallel systems is a bimonthly open access peer-reviewed journal aims to
publish high quality scientific papers arising from original research and development from
the international community in the areas of parallel and distributed systems. IJDPS serves
as a platform for engineers and researchers to present new ideas and system technology,
with an interactive and friendly, but strongly professional atmosphere.
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYijscai
With the increased use of Artificial Intelligence (AI) in malware analysis there is also an increased need to
understand the decisions models make when identifying malicious artifacts. Explainable AI (XAI) becomes
the answer to interpreting the decision-making process that AI malware analysis models use to determine
malicious benign samples to gain trust that in a production environment, the system is able to catch
malware. With any cyber innovation brings a new set of challenges and literature soon came out about XAI
as a new attack vector. Adversarial XAI (AdvXAI) is a relatively new concept but with AI applications in
many sectors, it is crucial to quickly respond to the attack surface that it creates. This paper seeks to
conceptualize a theoretical framework focused on addressing AdvXAI in malware analysis in an effort to
balance explainability with security. Following this framework, designing a machine with an AI malware
detection and analysis model will ensure that it can effectively analyze malware, explain how it came to its
decision, and be built securely to avoid adversarial attacks and manipulations. The framework focuses on
choosing malware datasets to train the model, choosing the AI model, choosing an XAI technique,
implementing AdvXAI defensive measures, and continually evaluating the model. This framework will
significantly contribute to automated malware detection and XAI efforts allowing for secure systems that
are resilient to adversarial attacks.
Passenger car unit (PCU) of a vehicle type depends on vehicular characteristics, stream characteristics, roadway characteristics, environmental factors, climate conditions and control conditions. Keeping in view various factors affecting PCU, a model was developed taking a volume to capacity ratio and percentage share of particular vehicle type as independent parameters. A microscopic traffic simulation model VISSIM has been used in present study for generating traffic flow data which some time very difficult to obtain from field survey. A comparison study was carried out with the purpose of verifying when the adaptive neuro-fuzzy inference system (ANFIS), artificial neural network (ANN) and multiple linear regression (MLR) models are appropriate for prediction of PCUs of different vehicle types. From the results observed that ANFIS model estimates were closer to the corresponding simulated PCU values compared to MLR and ANN models. It is concluded that the ANFIS model showed greater potential in predicting PCUs from v/c ratio and proportional share for all type of vehicles whereas MLR and ANN models did not perform well.
Value Stream Mapping Worskshops for Intelligent Continuous SecurityMarc Hornbeek
This presentation provides detailed guidance and tools for conducting Current State and Future State Value Stream Mapping workshops for Intelligent Continuous Security.
2. Test Details
● Fan In architecture
● Collector sending data, Input 100GB of pcap files, rate in ingress: 2MB/s.
● MiniFlume : Thrift source , mem channel, AVRO sink.
● MegaFlume: AVRO source, mem channel, HDFS SInk
● Hadoop
○ 4 DataNode
○ 1 NameNode
3. Mini Flume config
i. Virtual Machine: 8 CPU core, 16GB RAM. Ubuntu 14.04.
i. Environment: export JAVA_OPTS="-Xms100m -Xmx12000m Dcom.sun.management.jmxremote
i. Config
a. 2X thrift sources. thread count to 8.
b. 2X avro sinks. roll size 10000.
c. memory-channel.capacity = 100100
d. memory-channel.transactionCapacity=10010
5. Mega Flume config
i. Virtual Machine: 16 CPU core, 16GB RAM.
ii. Environment: export JAVA_OPTS="-Xms100m -Xmx12000m -Dcom.sun.management.jmxremote
iii. mem channel
a. memory-channel.capacity = 6200000
b. memory-channel.transactionCapacity=620000
iv. 2 avro sources 8 threads
v. 2 HDFS sinks
1. agent.sinks.hdfs-sink.hdfs.path = hdfs://master1:54310/data/EthernetCollector
2. agent.sinks.hdfs-sink.hdfs.fileType = DataStream
3. agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
4. agent.sinks.hdfs-sink.hdfs.filePrefix = ethernet
5. agent.sinks.hdfs-sink.hdfs.fileSuffix = .avro
6. agent.sinks.hdfs-sink.hdfs.inUsePrefix = part.
7. agent.sinks.hdfs-sink.serializer = avro_event
8. agent.sinks.hdfs-sink.hdfs.minBlockReplicas= 1
9. agent.sinks.hdfs-sink.hdfs.threadsPoolSize= 16
10. agent.sinks.hdfs-sink.hdfs.rollCount = 250030
11. agent.sinks.hdfs-sink.hdfs.rollSize = 0
12. agent.sinks.hdfs-sink.hdfs.batchSize = 620000
6. Hadoop Cluster
i. 4 hadoop data nodes , 12 Disks per node, 4 cores. 8GB RAM.
ii. Single Namenode, not HA , no Yarn. no ZooKeeper.
7. Flume Performance insights
i. thrift source has threads!!!
ii. VERY consuming in terms of resources
iii. file channel is MUCH slower than mem channel
iv. sink HDFS , stabilizes after long time (~ 30 min), each change takes time.
1. buffering into RAM – the RAM graphs goes up.
2. Stable ingress rate - RAM consumption is parallel line to the time axis.
3. small roll size - may crash the cluster
4. too much "pressure" on hadoop crushes the cluster, causes data-node loss. it even cause the
Name-node to enter safe mode, or even loss ALL data nodes.
5. Rule of thunk- for each sink - at least 2 data nodes with at least 2 data disks.
6. each batch request is divided to several threads. i.e 2MB/s is write speed on hadoop per sink
per node in parallel. read a little about flume performance metrics:
https://cwiki.apache.org/confluence/display/FLUME/Performance+Measurements+-+round+2
7. notice the above article: 20 data-node, 8 disks, 8 sinks.
8. Flume Performance insights
i. each flume should be considered for tuning of ingress and egress on same node. monitor the RAM
metrics via data-dog or another monitoring tool , and line should be parallel to time axis at all time.
ii. each source/sink may have threads - significant impact on performance.
iii. when increase batch size - all other parameters should increase with similar ratio.
iv. be sure to understand difference of batch vs. roll.
v. use unique numbers for debugging.
vi. each event size changes the buffering in the flume drastically. as long as it bound by min/max
values, your config is ok
vii. Consider giving unique prefix names per sink - multiply your tmp files in parallel.
9. File channel insights
● Consider 2 partitions per file channel : data and checkpoint
● Don't Consider file channels per port - unless you have separate partitions per
file channel.
● There is a limit in max capacity the flume can handle. - if you need to buffer a
week of down time, consider scaling out VIA fan in
● Consider Max file capacity written to disk - to be below RAM size, to utilize OS
caching.very efficient
● Consider increasing the transaction Capacity to 1M events for fast recovery
from filechannel reply.
10. File channel insights
● The File Channel takes a comma-separated list of data directories as the value of the
dataDirs parameter
● If the channel is stopped while it is checkpointing, the checkpoint may be incomplete
or corrupt. A corrupt or incomplete checkpoint could make the restart of the channel
extremely slow, as the channel would need to read and replay all data files.To avoid
this problem, it is recommended that the useDualCheckpointsparameter be set to
true and that backupCheckpointDir be set
● It will always retain two files per data directory, even if the files do not have any events to be taken
and committed. The channel will also not delete the file currently being written to.
● Using NFS-mounted disks with the File Channel is not a good idea
● Read: https://www.safaribooksonline.com/library/view/using-flume/9781491905326/ch04.html
● Increase checkpoint interval if you use dual backup. (checkpoint time was doubled if you use dual
11. Hadoop Performance Insights
Hadoop
i. more data node disk increases performance
ii. hadoop was designed for parallelism.but flume sinks - are not very
powerful. Yes you can add more sinks - but - you need stronger cluster - 1
sink = 2 data nodes with at least 2 data partions.
iii. you can add nodes dynamically easily while the cluster is running , so no
need to restart cluster.
iv. Increase IO buffer to 64KB or 128KB (assuming large block size)
v. Increase NN handlers to 20 X number datanode
12. General Performance Insights
Generally speaking:
i. very HARD to simulate the situation with 1 server. I over committed resources, causing failures
in the HDFS.
ii. the amount of thread is ENORMOUS! but very light, and short spanned. not CPU intensive.
iii. no Egress to engine was tested in this scenario
iv. not data correctness was tested.
v. very hard to fine tune flume - each change on file based sinks - take about 10 min to
reflect/stabilize in monitoring (unless it crushed first).