This document provides an overview of the ELK stack, including Logstash for collecting and parsing logs, Elasticsearch for indexing logs, and Kibana for visualizing logs. It discusses using the open source ELK stack as an alternative to Splunk and provides instructions for getting started with a basic ELK implementation.
The document discusses Netflix's use of Elasticsearch for querying log events. It describes how Netflix evolved from storing logs in files to using Elasticsearch to enable interactive exploration of billions of log events. It also summarizes some of Netflix's best practices for running Elasticsearch at scale, such as automatic sharding and replication, flexible schemas, and extensive monitoring.
Delivering: from Kafka to WebSockets | Adam Warski, SoftwareMillHostedbyConfluent
Here's the challenge: we've got a Kafka topic, where services publish messages to be delivered to browser-based clients through web sockets.
Sounds simple? It might, but we're faced with an increasing number of messages, as well as a growing count of web socket clients. How do we scale our solution? As our system contains a larger number of servers, failures become more frequent. How to ensure fault tolerance?
There’s a couple possible architectures. Each websocket node might consume all messages. Otherwise, we need an intermediary, which redistributes the messages to the proper web socket nodes.
Here, we might either use a Kafka topic, or a streaming forwarding service. However, we still need a feedback loop so that the intermediary knows where to distribute messages.
We’ll take a look at the strengths and weaknesses of each solution, as well as limitations created by the chosen technologies (Kafka and web sockets).
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...HostedbyConfluent
Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Current 2022
At supermassive scale, a perennial problem with Kafka is ""high fan-in"" -- a large number of producers sending records to a small number of brokers. Even a relatively modest amount of data can overwhelm a broker when there are hundreds of thousands of concurrent producer requests.
This talk discusses a few real-world applications where high fan-in becomes a problem, and presents a few strategies for dealing with it. These include: fronting Kafka with an ingestion layer; separating brokers into read-only and write-only subsets; implementing specialized partitioning strategies; and scaling across clusters with ""smart clients"".
Kibana + timelion: time series with the elastic stackSylvain Wallez
The document discusses Kibana and Timelion, which are tools for visualizing and analyzing time series data in the Elastic Stack. It provides an overview of Kibana's evolution and capabilities for creating dashboards. Timelion is introduced as a scripting language that allows users to transform, aggregate, and calculate on time series data from multiple sources to create visualizations. The document demonstrates Timelion's expression language, which includes functions, combinations, filtering, and attributes to process and render time series graphs.
The document provides an introduction to the ELK stack, which is a collection of three open source products: Elasticsearch, Logstash, and Kibana. It describes each component, including that Elasticsearch is a search and analytics engine, Logstash is used to collect, parse, and store logs, and Kibana is used to visualize data with charts and graphs. It also provides examples of how each component works together in processing and analyzing log data.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
PromQL Deep Dive - The Prometheus Query Language Weaveworks
- What is PromQL
- PromQL operators
- PromQL functions
- Hands on: Building queries in PromQL
- Hands on: Visualizing PromQL in Grafana
- Prometheus alerts in PromQL
- Hands on: Creating an alert in Prometheus with PromQL
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
So, what is the ELK Stack? "ELK" is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.
Introduction to Elasticsearch with basics of LuceneRahul Jain
Rahul Jain gives an introduction to Elasticsearch and its basic concepts like term frequency, inverse document frequency, and boosting. He describes Lucene as a fast, scalable search library that uses inverted indexes. Elasticsearch is introduced as an open source search platform built on Lucene that provides distributed indexing, replication, and load balancing. Logstash and Kibana are also briefly described as tools for collecting, parsing, and visualizing logs in Elasticsearch.
ksqlDB: Building Consciousness on Real Time Eventsconfluent
This document discusses ksqlDB, a streaming SQL engine for Apache Kafka. It allows users to write streaming applications using familiar SQL queries against Kafka topic data. Some key points made include:
- ksqlDB allows users to create, select, and join streaming data in Kafka topics using SQL queries without the need for Java or other code
- It provides a simpler way to build streaming applications compared to Kafka Streams by using SQL
- Examples show how ksqlDB can be used for real-time monitoring, anomaly detection, streaming ETL, and data transformations.
ELK Stack workshop covers real-world use cases and works with the participants to - implement them. This includes Elastic overview, Logstash configuration, creation of dashboards in Kibana, guidelines and tips on processing custom log formats, designing a system to scale, choosing hardware, and managing the lifecycle of your logs.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Log Management
Log Monitoring
Log Analysis
Need for Log Analysis
Problem with Log Analysis
Some of Log Management Tool
What is ELK Stack
ELK Stack Working
Beats
Different Types of Server Logs
Example of Winlog beat, Packetbeat, Apache2 and Nginx Server log analysis
Mimikatz
Malicious File Detection using ELK
Practical Setup
Conclusion
Apache Flink 101 - the rise of stream processing and beyondBowen Li
This document provides an overview and summary of Apache Flink. It discusses how Flink enables stateful stream processing and beyond. Key points include that Flink allows for stateful computations over event streams in an expressive, scalable, fault-tolerant way through layered APIs. It also supports batch processing, machine learning, and serving as a stream processor that unifies streaming and batch. The document highlights many use cases of Flink at Alibaba and how it powers critical systems like real-time analytics and recommendations.
Elasticsearch is a distributed, open source search and analytics engine built on Apache Lucene. It allows storing and searching of documents of any schema in JSON format. Documents are organized into indexes which can have multiple shards and replicas for scalability and high availability. Elasticsearch provides a RESTful API and can be easily extended with plugins. It is widely used for full-text search, structured search, analytics and more in applications requiring real-time search and analytics of large volumes of data.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka Elasticsearch Tutorial will help you in understanding the fundamentals of Elasticsearch along with its practical usage and help you in building a strong foundation in ELK Stack. This video helps you to learn following topics:
1. What Is Elasticsearch?
2. Why Elasticsearch?
3. Elasticsearch Advantages
4. Elasticsearch Installation
5. API Conventions
6. Elasticsearch Query DSL
7. Mapping
8. Analysis
9 Modules
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
Apache Hudi is a data lake platform, that provides streaming primitives (upserts/deletes/change streams) on top of data lake storage. Hudi powers very large data lakes at Uber, Robinhood and other companies, while being pre-installed on four major cloud platforms.
Hudi supports exactly-once, near real-time data ingestion from Apache Kafka to cloud storage, which is typically used in-place of a S3/HDFS sink connector to gain transactions and mutability. While this approach is scalable and battle-tested, it can only ingest data in mini batches, leading to lower data freshness. In this talk, we introduce a Kafka Connect Sink Connector for Apache Hudi, which writes data straight into Hudi's log format, making the data immediately queryable, while Hudi's table services like indexing, compaction, clustering work behind the scenes, to further re-organize for better query performance.
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
Kibana + timelion: time series with the elastic stackSylvain Wallez
The document discusses Kibana and Timelion, which are tools for visualizing and analyzing time series data in the Elastic Stack. It provides an overview of Kibana's evolution and capabilities for creating dashboards. Timelion is introduced as a scripting language that allows users to transform, aggregate, and calculate on time series data from multiple sources to create visualizations. The document demonstrates Timelion's expression language, which includes functions, combinations, filtering, and attributes to process and render time series graphs.
The document provides an introduction to the ELK stack, which is a collection of three open source products: Elasticsearch, Logstash, and Kibana. It describes each component, including that Elasticsearch is a search and analytics engine, Logstash is used to collect, parse, and store logs, and Kibana is used to visualize data with charts and graphs. It also provides examples of how each component works together in processing and analyzing log data.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
PromQL Deep Dive - The Prometheus Query Language Weaveworks
- What is PromQL
- PromQL operators
- PromQL functions
- Hands on: Building queries in PromQL
- Hands on: Visualizing PromQL in Grafana
- Prometheus alerts in PromQL
- Hands on: Creating an alert in Prometheus with PromQL
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
So, what is the ELK Stack? "ELK" is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.
Introduction to Elasticsearch with basics of LuceneRahul Jain
Rahul Jain gives an introduction to Elasticsearch and its basic concepts like term frequency, inverse document frequency, and boosting. He describes Lucene as a fast, scalable search library that uses inverted indexes. Elasticsearch is introduced as an open source search platform built on Lucene that provides distributed indexing, replication, and load balancing. Logstash and Kibana are also briefly described as tools for collecting, parsing, and visualizing logs in Elasticsearch.
ksqlDB: Building Consciousness on Real Time Eventsconfluent
This document discusses ksqlDB, a streaming SQL engine for Apache Kafka. It allows users to write streaming applications using familiar SQL queries against Kafka topic data. Some key points made include:
- ksqlDB allows users to create, select, and join streaming data in Kafka topics using SQL queries without the need for Java or other code
- It provides a simpler way to build streaming applications compared to Kafka Streams by using SQL
- Examples show how ksqlDB can be used for real-time monitoring, anomaly detection, streaming ETL, and data transformations.
ELK Stack workshop covers real-world use cases and works with the participants to - implement them. This includes Elastic overview, Logstash configuration, creation of dashboards in Kibana, guidelines and tips on processing custom log formats, designing a system to scale, choosing hardware, and managing the lifecycle of your logs.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Log Management
Log Monitoring
Log Analysis
Need for Log Analysis
Problem with Log Analysis
Some of Log Management Tool
What is ELK Stack
ELK Stack Working
Beats
Different Types of Server Logs
Example of Winlog beat, Packetbeat, Apache2 and Nginx Server log analysis
Mimikatz
Malicious File Detection using ELK
Practical Setup
Conclusion
Apache Flink 101 - the rise of stream processing and beyondBowen Li
This document provides an overview and summary of Apache Flink. It discusses how Flink enables stateful stream processing and beyond. Key points include that Flink allows for stateful computations over event streams in an expressive, scalable, fault-tolerant way through layered APIs. It also supports batch processing, machine learning, and serving as a stream processor that unifies streaming and batch. The document highlights many use cases of Flink at Alibaba and how it powers critical systems like real-time analytics and recommendations.
Elasticsearch is a distributed, open source search and analytics engine built on Apache Lucene. It allows storing and searching of documents of any schema in JSON format. Documents are organized into indexes which can have multiple shards and replicas for scalability and high availability. Elasticsearch provides a RESTful API and can be easily extended with plugins. It is widely used for full-text search, structured search, analytics and more in applications requiring real-time search and analytics of large volumes of data.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka Elasticsearch Tutorial will help you in understanding the fundamentals of Elasticsearch along with its practical usage and help you in building a strong foundation in ELK Stack. This video helps you to learn following topics:
1. What Is Elasticsearch?
2. Why Elasticsearch?
3. Elasticsearch Advantages
4. Elasticsearch Installation
5. API Conventions
6. Elasticsearch Query DSL
7. Mapping
8. Analysis
9 Modules
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
Apache Hudi is a data lake platform, that provides streaming primitives (upserts/deletes/change streams) on top of data lake storage. Hudi powers very large data lakes at Uber, Robinhood and other companies, while being pre-installed on four major cloud platforms.
Hudi supports exactly-once, near real-time data ingestion from Apache Kafka to cloud storage, which is typically used in-place of a S3/HDFS sink connector to gain transactions and mutability. While this approach is scalable and battle-tested, it can only ingest data in mini batches, leading to lower data freshness. In this talk, we introduce a Kafka Connect Sink Connector for Apache Hudi, which writes data straight into Hudi's log format, making the data immediately queryable, while Hudi's table services like indexing, compaction, clustering work behind the scenes, to further re-organize for better query performance.
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
- Prashant Agrawal has over 5 years of experience as a Big Data Analyst with expertise in log analytics, search engine solutions, and ETL using tools like Spark, Elasticsearch, Logstash, and Kibana.
- He has strong skills in distributed computing systems like Hadoop, Spark, and working with Hortonworks Data Platform clusters.
- His projects include log analytics and visualization using ELK, data lake modules in Spark, Spark ETL, and developing a big data platform for predictive analysis of system logs.
Getting Started with Elastic Stack.
Detailed blog for the same
http://vikshinde.blogspot.co.uk/2017/08/elastic-stack-introduction.html
This document discusses Apache Spark, an open-source cluster computing framework. It describes how Spark allows for faster iterative algorithms and interactive data mining by keeping working sets in memory. The document also provides an overview of Spark's ease of use in Scala and Python, built-in modules for SQL, streaming, machine learning, and graph processing, and compares Spark's machine learning library MLlib to other frameworks.
The document provides an introduction to the ELK stack for log analysis and visualization. It discusses why large data tools are needed for network traffic and log analysis. It then describes the components of the ELK stack - Elasticsearch for storage and search, Logstash for data collection and parsing, and Kibana for visualization. Several use cases are presented, including how Cisco and Yale use the ELK stack for security monitoring and analyzing biomedical research data.
Centralized Logging System Using ELK StackRohit Sharma
Centralized Logging System using ELK Stack
The document discusses setting up a centralized logging system (CLS) using the ELK stack. The ELK stack consists of Logstash to capture and filter logs, Elasticsearch to index and store logs, and Kibana to visualize logs. Logstash agents on each server ship logs to Logstash, which filters and sends logs to Elasticsearch for indexing. Kibana queries Elasticsearch and presents logs through interactive dashboards. A CLS provides benefits like log analysis, auditing, compliance, and a single point of control. The ELK stack is an open-source solution that is scalable, customizable, and integrates with other tools.
This slide deck talks about Elasticsearch and its features.
When you talk about ELK stack it just means you are talking
about Elasticsearch, Logstash, and Kibana. But when you talk
about Elastic stack, other components such as Beats, X-Pack
are also included with it.
what is the ELK Stack?
ELK vs Elastic stack
What is Elasticsearch used for?
How does Elasticsearch work?
What is an Elasticsearch index?
Shards
Replicas
Nodes
Clusters
What programming languages does Elasticsearch support?
Amazon Elasticsearch, its use cases and benefits
The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
Presented on Tuesday, August 7, at the 2018 LRCN (Librarians' Registration Council of Nigeria) National Workshop on Electronic Resource Management Systems in Libraries, held at the University of Nigeria, Nsukka, Enugu State, Nigeria
This document provides an overview of DevOps and the Elastic Stack. It defines IT operations (Ops) and describes the historical separation between development and operations teams. This led to bottlenecks and slow delivery. DevOps aims to break down barriers through practices like continuous integration, monitoring, and automation. The Elastic Stack is a suite of open source tools for logging, monitoring, and security including Elasticsearch, Kibana, Logstash, Beats, and X-Pack. It allows users to collect and analyze logs and metrics from multiple sources in real-time.
Deep Dive on ElasticSearch Meetup event on 23rd May '15 at www.meetup.com/abctalks
Agenda:
1) Introduction to NOSQL
2) What is ElasticSearch and why is it required
3) ElasticSearch architecture
4) Installation of ElasticSearch
5) Hands on session on ElasticSearch
Michael stack -the state of apache h basehdhappy001
The document provides an overview of Apache HBase, an open source, distributed, scalable, big data non-relational database. It discusses that HBase is modeled after Google's Bigtable and built on Hadoop for storage. It also summarizes that HBase is used by many large companies for applications such as messaging, real-time analytics, and search indexing. The project is led by an active community of committers and sees steady improvements and new features with each monthly release.
Streaming Solutions for Real time problemsAbhishek Gupta
The document is a presentation on streaming solutions for real-time problems using Apache Kafka, Kafka Streams, and Redis. It begins with an introduction and overview of the technologies. It then presents a sample monitoring application using metrics from multiple machines as a use case. The presentation demonstrates how to implement this application using Kafka as the event store, Kafka Streams for processing, and Redis as the state store. It also shows how to deploy the application components on Oracle Cloud.
The document discusses operations in the cloud and best practices. It describes how companies like Zynga and others were able to scale games and applications using AWS services like EC2, S3, EBS, ELB, and RDS. It discusses high availability, making applications stateless, monitoring, and open source alternatives. Best practices include infrastructure as code, automated provisioning, eliminating single points of failure, caching, and following the sun for development.
The document discusses operations in the cloud and best practices. It describes how companies like Zynga and others were able to scale games and applications using AWS services like EC2, S3, EBS, ELB, and RDS. It discusses high availability, making applications stateless, monitoring, and open source alternatives. Best practices include infrastructure as code, automated provisioning, eliminating single points of failure, caching, and following the sun for development.
Understanding the Tor Network and Exploring the Deep Webnabilajabin35
While the Tor network, Dark Web, and Deep Web can seem mysterious and daunting, they are simply parts of the internet that prioritize privacy and anonymity. Using tools like Ahmia and onionland search, users can explore these hidden spaces responsibly and securely. It’s essential to understand the technology behind these networks, as well as the risks involved, to navigate them safely. Visit https://torgol.com/
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHostingsteve198109
Vancouver in 2025 is more than scenic views, yoga studios, and oat milk lattes—it’s a thriving hub for eco-conscious entrepreneurs looking to make a real difference. If you’ve ever dreamed of launching a purpose-driven business, now is the time. Whether it’s urban mushroom farming, upcycled furniture sales, or vegan skincare sold online, your green idea deserves a strong digital foundation.
The 2025 Canadian eCommerce landscape is being shaped by trends like sustainability, local innovation, and consumer trust. To stay ahead, eco-startups need reliable hosting that aligns with their values. That’s where 4GoodHosting.com comes in—one of the top-rated Vancouver web hosting providers of 2025. Offering secure, sustainable, and Canadian-based hosting solutions, they help green entrepreneurs build their brand with confidence and conscience.
As eCommerce in Canada embraces localism and environmental responsibility, choosing a hosting provider that shares your vision is essential. 4GoodHosting goes beyond just hosting websites—they champion Canadian businesses, sustainable practices, and meaningful growth.
So go ahead—start that eco-friendly venture. With Vancouver web hosting from 4GoodHosting, your green business and your values are in perfect sync.
APNIC Update, presented at NZNOG 2025 by Terry SweetserAPNIC
Terry Sweetser, Training Delivery Manager (South Asia & Oceania) at APNIC presented an APNIC update at NZNOG 2025 held in Napier, New Zealand from 9 to 11 April 2025.
Reliable Vancouver Web Hosting with Local Servers & 24/7 Supportsteve198109
Looking for powerful and affordable web hosting in Vancouver? 4GoodHosting offers premium Canadian web hosting solutions designed specifically for individuals, startups, and businesses across British Columbia. With local data centers in Vancouver and Toronto, we ensure blazing-fast website speeds, superior uptime, and enhanced data privacy—all critical for your business success in today’s competitive digital landscape.
Our Vancouver web hosting plans are packed with value—starting as low as $2.95/month—and include secure cPanel management, free domain transfer, one-click WordPress installs, and robust email support with anti-spam protection. Whether you're hosting a personal blog, business website, or eCommerce store, our scalable cloud hosting packages are built to grow with you.
Enjoy enterprise-grade features like daily backups, DDoS protection, free SSL certificates, and unlimited bandwidth on select plans. Plus, our expert Canadian support team is available 24/7 to help you every step of the way.
At 4GoodHosting, we understand the needs of local Vancouver businesses. That’s why we focus on speed, security, and service—all hosted on Canadian soil. Start your online journey today with a reliable hosting partner trusted by thousands across Canada.
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025APNIC
Joyce Chen, Senior Advisor, Strategic Engagement at APNIC, presented on 'APNIC Policy Development Process' at the Local APIGA Taiwan 2025 event held in Taipei from 19 to 20 April 2025.
DNS Resolvers and Nameservers (in New Zealand)APNIC
Geoff Huston, Chief Scientist at APNIC, presented on 'DNS Resolvers and Nameservers in New Zealand' at NZNOG 2025 held in Napier, New Zealand from 9 to 11 April 2025.
Smart Mobile App Pitch Deck丨AI Travel App Presentation Templateyojeari421237
🚀 Smart Mobile App Pitch Deck – "Trip-A" | AI Travel App Presentation Template
This professional, visually engaging pitch deck is designed specifically for developers, startups, and tech students looking to present a smart travel mobile app concept with impact.
Whether you're building an AI-powered travel planner or showcasing a class project, Trip-A gives you the edge to impress investors, professors, or clients. Every slide is cleanly structured, fully editable, and tailored to highlight key aspects of a mobile travel app powered by artificial intelligence and real-time data.
💼 What’s Inside:
- Cover slide with sleek app UI preview
- AI/ML module implementation breakdown
- Key travel market trends analysis
- Competitor comparison slide
- Evaluation challenges & solutions
- Real-time data training model (AI/ML)
- “Live Demo” call-to-action slide
🎨 Why You'll Love It:
- Professional, modern layout with mobile app mockups
- Ideal for pitches, hackathons, university presentations, or MVP launches
- Easily customizable in PowerPoint or Google Slides
- High-resolution visuals and smooth gradients
📦 Format:
- PPTX / Google Slides compatible
- 16:9 widescreen
- Fully editable text, charts, and visuals
Best web hosting Vancouver 2025 for you businesssteve198109
Vancouver in 2025 is more than scenic views, yoga studios, and oat milk lattes—it’s a thriving hub for eco-conscious entrepreneurs looking to make a real difference. If you’ve ever dreamed of launching a purpose-driven business, now is the time. Whether it’s urban mushroom farming, upcycled furniture sales, or vegan skincare sold online, your green idea deserves a strong digital foundation.
The 2025 Canadian eCommerce landscape is being shaped by trends like sustainability, local innovation, and consumer trust. To stay ahead, eco-startups need reliable hosting that aligns with their values. That’s where 4GoodHosting.com comes in—one of the top-rated Vancouver web hosting providers of 2025. Offering secure, sustainable, and Canadian-based hosting solutions, they help green entrepreneurs build their brand with confidence and conscience.
As eCommerce in Canada embraces localism and environmental responsibility, choosing a hosting provider that shares your vision is essential. 4GoodHosting goes beyond just hosting websites—they champion Canadian businesses, sustainable practices, and meaningful growth.
So go ahead—start that eco-friendly venture. With Vancouver web hosting from 4GoodHosting, your green business and your values are in perfect sync.
2. Introduction
Tin Le ([email protected])
Senior Site Reliability Engineer
Formerly part of Mobile SRE team, responsible for servers
handling mobile apps (IOS, Android, Windows, RIM, etc.)
traffic.
Now responsible for guiding ELK @ LinkedIn as a whole
3. Problems
● Multiple data centers, ten of thousands of servers,
hundreds of billions of log records
● Logging, indexing, searching, storing, visualizing and
analysing all of those logs all day every day
● Security (access control, storage, transport)
● Scaling to more DCs, more servers, and even more
logs…
● ARRGGGGHHH!!!!!
4. Solutions
● Commercial
o Splunk, Sumo Logic, HP ArcSight Logger, Tibco,
XpoLog, Loggly, etc.
● Open Source
o Syslog + Grep
o Graylog
o Elasticsearch
o etc.
5. Criterias
● Scalable - horizontally, by adding more nodes
● Fast - as close to real time as possible
● Inexpensive
● Flexible
● Large user community (support)
● Open source
7. ELK at LinkedIn
● 100+ ELK clusters across 20+ teams and 6
data centers
● Some of our larger clusters have:
o Greater than 32+ billion docs (30+TB)
o Daily indices average 3.0 billion docs (~3TB)
8. ELK + Kafka
Summary: ELK is a popular open sourced application stack for
visualizing and analyzing logs. ELK is currently being used across
many teams within LinkedIn. The architecture we use is made up of
four components: Elasticsearch, Logstash, Kibana and Kafka.
● Elasticsearch: Distributed real-time search and analytics engine
● Logstash: Collect and parse all data sources into an easy-to-read
JSON format
● Kibana: Elasticsearch data visualization engine
● Kafka: Data transport, queue, buffer and short term storage
9. What is Kafka?
● Apache Kafka is a high-throughput distributed
messaging system
o Invented at LinkedIn and Open Sourced in 2011
o Fast, Scalable, Durable, and Distributed by Design
o Links for more:
http://kafka.apache.org
http://data.linkedin.com/opensource/kafka
10. Kafka at LinkedIn
● Common data transport
● Available and supported by dedicated team
o 875 Billion messages per day
o 200 TB/day In
o 700 TB/day Out
o Peak Load
10.5 Million messages/s
18.5 Gigabits/s Inbound
70.5 Gigabits/s Outbound
11. Logging using Kafka at LinkedIn
● Dedicated cluster for logs in each data center
● Individual topics per application
● Defaults to 4 days of transport level retention
● Not currently replicating between data centers
● Common logging transport for all services, languages
and frameworks
12. ELK Architectural Concerns
● Network Concerns
o Bandwidth
o Network partitioning
o Latency
● Security Concerns
o Firewalls and ACLs
o Encrypting data in transit
● Resource Concerns
o A misbehaving application can swamp production resources
15. Operational Challenges
● Data, lots of it.
o Transporting, queueing, storing, securing,
reliability…
o Ingesting & Indexing fast enough
o Scaling infrastructure
o Which data? (right data needed?)
o Formats, mapping, transformation
Data from many sources: Java, Scala, Python, Node.js, Go
16. Operational Challenges...
● Centralized vs Siloed Cluster Management
● Aggregated views of data across the entire
infrastructure
● Consistent view (trace up/down app stack)
● Scaling - horizontally or vertically?
● Monitoring, alerting, auto-remediating
17. The future of ELK at LinkedIn
● More ELK clusters being used by even more teams
● Clusters with 300+ billion docs (300+TB)
● Daily indices average 10+ billion docs, 10TB - move to
hourly indices
● ~5,000 shards per cluster
18. Extra slides
Next two slides contain example logstash
configs to show how we use input pipe plugin
with Kafka Console Consumer, and how to
monitor logstash using metrics filter.
#4: Many applications. Mobile frontend logs: average 2.4TB size (3+ billion docs).
#5: Evaluated, used by some teams. Some Acquisitions use commercial solutions.
Commercial solutions cost prohibitive.
#8: Expect storage size to increase as we migrate to using doc_values
#9: Leveraging open source stack. Large community.
Leveraging common data transport. Rock solid, proven, dedicated support team.
#10: Fast : a single Kafka broker can handle hundreds of megabytes of reads and writes per sec from thousands of clients.
Scaleable: Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can elastically and transparently expanded without downtime.
Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers.
Durable: Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages w/o performance impact.
Distributed by Design: Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
#11: These numbers are from LinkedIn Kafka presentation at Apache Con 2015.
Over 1100 brokers over 50+ clusters
Over 32000 topics
Over 350 thousand partitions, not including replication.
#12: Log retention long enough for 3 days weekend. So we can re-index from Kafka if encounter issues.
Dedicated log clusters to isolate traffic from other clusters.
Two years ago, used our own kafka input plugin. Then switched to using KCC via pipe input for performance reason.
Monitoring:
. It’s important to monitor logstash nodes. LS has a bug where an error in any of its input, filter or output will stop the entire LS process. See metrics filter config file at end of these slides.
#13: We’ve chosen to keep all of our clients local to the clusters and use a tiered architecture due to several major concerns.
The primary concern is around the networking itself. Kafka enables multiple consumers to read the same topic, which means if we are reading remotely, we are copying messages over expensive inter-datacenter connections multiple times. We also have to handle problems like network partitioning in every client. Granted, you can have a partition even within a single datacenter, but it happens much more frequently when you are dealing with large distances. There’s also the concern of latency in connections – distance increases latency. Latency can cause interesting problems in client applications, and I like life to be boring.
There are also security concerns around talking across datacenters. If we keep all of our clients local, we do not have to worry about ACL problems between the clients and the brokers (and Zookeeper as well). We can also deal with the problem of encrypting data in transit much more easily. This is one problem we have not worried about as much, but it is becoming a big concern now.
The last concern is over resource usage. Everything at LinkedIn talks to Kafka, and a problem that takes out a production cluster is a major event. It could mean we have to shift traffic out of the datacenter until we resolve it, or it could result in inconsistent behavior in applications. Any application could overwhelm a cluster, but there are some, such as applications that run in Hadoop, that are more prone to this. By keeping those clients talking to a cluster that is separate from the front end, we mitigate resource contention.
#14: For security reasons, data/logs generated in each DC stays there. Indexed by local ELK cluster. Aggregated views via Tribe nodes.
All logstash use common filters to catch most common data leakage.
How services log to Kafka - imposed common logging library. All services use common library, which automatically log to Kafka, WARN or above.
#15: General architecture for each ELK cluster.
Dedicated masters.
Tribe client node (HTTP services).
#16: Data - reliable transport, storing, queueing, consuming, indexing.
Some data (java service logs for example) not in right format.
Solutions to Data
Kafka as transport, storage queue, backbone.
More logstash instances, more Kafka partitions.
Using KCC we can consume faster than ES can index.
To increase indexing speed
More ES nodes (horizontal).
More shards (distribute work)
Customized templates
#17: Using local Kafka log clusters instead of aggregated metrics.
Tribe to aggregate clusters.
Use internal tool call Nurse to monitor and auto-remediate (restarts) hung/dead instances of LS and ES.
#18: These numbers are estimate based on growth rate and plans.
Beside logs, we have other application use cases internally.
#20: This is how we use Logstash pipe input plugin to call out to Kafka Console Consumer. This currently give us the highest ingestion throughput.
#21: It’s important to monitor logstash nodes. LS has a bug where an error in any of its input, filter or output will stop the entire LS process. You can use Logstash metrics filter to make sure that LS is still processing data. Sometime LS runs but no data goes through. This will let you know when that happens.