Log Data Analysis Platform

May 15, 2015Download as PPTX, PDF0 likes455 views

Log Data Analysis Platform is a completely automated system to ingest, process and store huge amount of log data based on Flume, Spark, Hadoop, Impala, Hive, ElasticSearch and Kibana.

Agenda
1) User-Group Introduction
2) Problematic
3) Log Data Analysis System Overview
4) Task Analysis
5) Solution Architecture
6) Trade-off Analysis
7) Automation
8) Performance Testing
9) Outcome & Plans

Demo Lab: Why we’ve started this project?
1) Increase Internal Experience
2) Create Reference Solution w/o NDA Limitations
3) Get Playground for Tests
4) Provide Demo Environment for Customers (using their data)
5) Decrease time to Market (by introducing automation)

Log Data Analysis Platform Details
Key Facts:
• ~270-300 Web Servers
• Log Types: HTTPD Access
logs, Error logs, Application
Server Servlet, OS Service
Logs
• ~500K events per minute
• 150GB of data per day
Technologies:
• Flume
• Hadoop/HDFS, MapReduce
• Hive, Impala
• Oozie
• Elasticsearch, Kibana 3
• Tableau Analytics platform
• Puppet + Vagrant

Log Data Examples
Access log:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Error log:
[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client
stopped connection before send body completed
[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist:
/home/httpd/twiki/view/Main/WebHome
Vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0
iostat
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011
avg-cpu: %user %nice %system %iowait %steal %idle
5.68 0.00 0.52 2.03 0.00 91.76

Architecture Drivers: Quality Attributes (1/3)

Architecture Drivers: Quality Attributes (2/3)

Architecture Drivers: Quality Attributes (3/3)

Solution Architecture
Batch Layer Serving Layer
Speed Layer
Raw Data
Storage
Data
Strea
m
Real-time
Views
Static Views
Precomputing
Precomputing
Ad-hoc Batch
Views
Static Batch
Views
Corporate BI
Tool
Legend:
Layer boundary
Data flow (with direction indicated)
Query flow
Apache HTTP Servers
Raw Data
Storage Pre-computing Batch Views
Real-Time Views
Dashboard/
Search
Data Stream
Real-Time Processing and
Aggregations
BI Tool
 Avro as a Raw Data Storage file
format
 Parquet as a Batch Views file
format
 Star schema as a Batch Views
data model

Hive Stinger vs Impala
Compression Ratio
Access Speed

Automation (saves time and money)
80% 20%
Development and Debugging F&P Testing, Demo
Local Development Cloud Development

Automation Process
Phase Tool Notes
VM Provisioning Vagrant — Supports:
VirtualBox, VMWare ESX, Amazon AWS
VM Bootstraping Puppet — Installs Cloudera Manager, Cloudera Distribution
Hadoop, ElasticSearch+Kibana, Flume, Microstrategy, Log
Generator.
— Creates Cluster using Cloudera Manager API.
Configure ETL
and BI
Puppet — Configures Flume, Oozie, ElasticSearch, Impala, Hive,
Microstrategy Dashboards
Integration Tests Puppet — Generates Workload and ensures data go through.
— Checks Logs for errors.
— Calculates timing/throughput.

Log Generator
1 Thread can generate:
4200 events / second (File source)
5500 events / second (TCP source)

Accurate Sizing
100k/min
50k/min
20k/min
200k
/min
Calculator!

Outcome
1) Demo lab, playground, testing platform (in 1 hour)
2) Sizing Calculator
3) Help to get 3 new customers (one is really, really
huge)
4) Strategic Partnership with Cloudera
5) Tons of experience and fun 
Plans
1) Add support for other Hadoop Distributions
(Hortonworks, MapR)
2) Make Project Open-Source

Thank You!
31
SoftServe US Office
One Congress Plaza,
111 Congress Avenue, Suite 2700 Austin, TX
78701
Tel: 512.516.8880
Contacts
Valentyn Kropov
vkrop@softserveinc.com
Tel: 866.687.3588 x4341

This document introduces a horizontally scalable machine learning (ML) pipeline platform called Hopsworks that features an integrated Feature Store. The Feature Store provides APIs for data engineers to produce features and data scientists to select features for models, simplifying pipelines. The end-to-end ML pipelines on Hopsworks can scale out to hundreds of servers using Apache Spark, Hive, and HopsFS. The demonstration will showcase a full pipeline running on Hopsworks, including feature engineering, publishing to the Feature Store, training, and model deployment. Hopsworks is also used as an interactive teaching platform for university courses on deep learning and big data.

Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks

This document discusses best practices for tuning machine learning models. It covers architectural patterns like single-machine versus distributed training and training one model per group. It also discusses workflows for hyperparameter tuning including setting up full pipelines before tuning, evaluating metrics on validation data, and tracking results for reproducibility. Finally it provides tips for handling code, data, and cluster configurations for distributed hyperparameter tuning and recommends tools to use.

Grafana optimization for PrometheusMitsuhiro Tanda

This document discusses using Grafana to optimize visualization of metrics from Prometheus in a dynamic environment. It describes deploying multiple Prometheus instances to monitor over 100 instances per service across various services running on EC2. Key Grafana features discussed include templating to dynamically filter dashboards, panel repetition to show multiple graphs, and scripted dashboards to generate dashboards from JSON definitions. The document provides examples of using these features to create service trend dashboards, dynamically refresh dashboards based on time range changes, switch data sources, and generate alert dashboards from Prometheus alert views.

Hopsworks Feature Store 2.0 a new paradigmJim Dowling

The document discusses Hopsworks Feature Store 2.0 and its capabilities for managing machine learning workflows. Key points include: - Hopsworks Feature Store allows for ingesting, storing, and reusing features to support tasks like training, serving, and experimentation. - It provides APIs for creating feature groups, training datasets, and joining features across groups. This enables end-to-end ML pipelines. - The feature store supports both online and offline usage, versioning of features and schemas, and time travel capabilities to access past feature values.

Real-Time Vote Platform BenchmarkLahav Savir

Lift scaffolding from existing databasetalexandre

Metail and Elastic MapReduceGareth Rogers

Metail allows users to discover clothes on their body shape online with minimum measurements from the user. With your avatar you can create outfits and coupled with our size advice this gives you a confidence in the size and fit. I'm part of the team within Metail that has built a pipeline to collection, enriched and serve data to the company and our clients, and which has been used to validate Metail's product. This talk was given at the AWS Loft in London 21st April 2016 where I gave an overview of the end-to-end pipeline and then went into detail how we're using AWS' EMR to perform a batch processing of the collected data which is then served internally with Redshift.

Metail at Cambridge AWS User Group Main Meetup #3Gareth Rogers

An Introduction to the Heatmap / Histogram PluginMitsuhiro Tanda

This document introduces the Heatmap and Histogram plugins for Grafana. It describes how the Histogram plugin calculates histograms from time series data and is mostly compatible with the Graph panel. The Heatmap plugin generates heatmaps from time series data using the Epoch visualization library and allows users to visualize latency, utilization, and time series trends without missing details. Future plans include supporting pre-calculated histogram and heatmap data from various data sources and improving compatibility with the Graph panel.

ADF Mapping Data Flow Private Preview MigrationMark Kromer

Presto: Distributed sql query engine kiran palaka

Presto is an open source distributed SQL query engine that allows querying large datasets ranging from gigabytes to petabytes faster and more interactively. It employs a custom query execution engine with pipelined operators designed for SQL semantics, avoiding unnecessary I/O and latency overhead. The Presto coordinator parses, analyzes, and plans queries, assigning work to nodes closest to data and monitoring progress, while clients pull results from output stages. Presto developers claim it is 10x better than Hive/MapReduce for most queries in terms of efficiency and latency.

Hopsworks - The Platform for Data-Intensive AIQAware GmbH

Sql to daxAnnie Xu

Many of us are coming from knowing SQL before they have been introduced with DAX or many of us know how to use tabular model through reporting tool but do not know they can use DAX as query language against a tabular model. This presentation focuses on three areas: 1. How to use DAX as a query language to select columns (columns from multiple tables and from value); Group Data; Filter Data; Join Tables; Build customized calculations/measures (like sum, window function, means) 2. How to use tracing tools to monitor the performance difference between SQL solution and DAX solution; 3. Use a real life example to demonstrate how to DAX and tabular modeling (Row Contexts vs. Filter Contexts + Bridge Table + Inactive relationship + extended table) to replace high resource consuming ETL processes

Module Owb RepositoriesNicholas Goodman

This document discusses Oracle Warehouse Builder (OWB) repositories, including their design, runtime, logical versus physical aspects, and installation process. It describes importing source system metadata, building logical mappings, deploying mappings, and running mappings. Repositories contain modules, mappings, transformations, process flows, data structures, and file structures. The runtime repository installs and runs mappings, monitors processes, and audits results. Logical and physical repositories differ in their contents, with physical repositories containing actual database objects.

Anomaly Detection using Spark MLlib and Spark StreamingKeira Zhou

Prometheus loves GrafanaTobias Schmidt

GrafanaCon 2015 - http://grafanacon.org/ Tobias will be giving an overview of Prometheus, an open-source monitoring system with a multi-dimensional label system, expressive query language and dashboard editor called PromDash. Learn about the highlights and differences of PromDash compared to Grafana and discuss the options to make Grafana the primary dashboard editor of the Prometheus project.

Apache NiFi: A Drag and Drop ApproachCalculated Systems

This presentation explains how open-source Apache Nifi can be used to easily consume AWS Cloud Services. Featuring drag and drop interactions with many cloud capabilities, it enables teams to quickly start handling their big data on the cloud. Both small agile and large enterprise teams can benefit from this easy to learn, rapid to implement approach to data processing. For more information, go to www.calculatedsystems.com.

Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks

Because MLflow is an API-first platform, there are many patterns for using it in complex workflows and integrating it with existing tools. In this talk, we’ll demo a few best practices for using MLflow in a more complex workflow. These include: * Run multi-step workflows on MLflow, such as data preparation steps followed by training, and organizing your projects so you can automatically reuse past work. * Tune Hyperparameter on MLflow with open source hyperparameter tuning packages. * Save a model in MLflow (eg, from a new machine learning library) and deploying it to the existing deployment tools.

MLeap: Productionize Data Science Workflows Using SparkJen Aman

MLeap is an open source library that allows Spark ML pipelines to be exported to a portable binary format called MLeap models. This enables fast deployment of ML models without Spark. MLeap models can be loaded and used for inference by any system with the MLeap runtime, and they are over 200 times faster for inference than Spark ML pipelines. The MLeap library consists of MLeap-Spark for building pipelines, MLeap-Runtime for loading models, and MLeap-Core which defines the common model format.

Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflowDatabricks

Kibana + timelion: time series with the elastic stackSylvain Wallez

The document discusses Kibana and Timelion, which are tools for visualizing and analyzing time series data in the Elastic Stack. It provides an overview of Kibana's evolution and capabilities for creating dashboards. Timelion is introduced as a scripting language that allows users to transform, aggregate, and calculate on time series data from multiple sources to create visualizations. The document demonstrates Timelion's expression language, which includes functions, combinations, filtering, and attributes to process and render time series graphs.

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...Spark Summit

MLeap is a machine learning platform that enables data scientists and engineers to collaborate using a single environment. It allows machine learning models trained using Spark to be deployed to production APIs without dependencies on Spark. MLeap addresses the common problems of data scientists and engineers having to re-write data pipelines and model code for production. It provides core machine learning components, a runtime for executing models without Spark, and tools for converting Spark models to the MLeap format. A demo is shown training and deploying models to an API in under 5 minutes.

Azure Data Factory for Redmond SQL PASS UG Sept 2018Mark Kromer

Managed Feature Store for Machine LearningLogical Clocks

All hyperscale AI companies build their machine learning platforms around a Feature Store. A feature is a measurable property of some data-sample. It could be for example an image-pixel, a word from a piece of text, the age of a person, a coordinate emitted from a sensor, or an aggregate value like the average number of purchases within the last hour. A Feature Store is a central place to store curated features within an organization. Feature Stores are a fuel for AI systems as we use them to train machine learning models so that we can make predictions for feature values that we have never seen before. During this presentation you learn: - About the concept of a Feature Store and how it can help manage feature data for Enterprises and ease the path of data from backend systems and data-lakes to Data Scientists. - Our take on Feature Stores, including best practices and use cases and: - How to ensure Consistent Features in both Training and Serving Governance, Access-Control, and Versioning - To create Training Data in the File Format of your Choice Eliminate Inconsistency between Features in Training and Inferencing Watch the webinar with a demo: https://www.logicalclocks.com/webinars

Microsoft Azure Data Factory Hands-On Lab Overview SlidesMark Kromer

This document outlines modules for a lab on moving data to Azure using Azure Data Factory. The modules will deploy necessary Azure resources, lift and shift an existing SSIS package to Azure, rebuild ETL processes in ADF, enhance data with cloud services, transform and merge data with ADF and HDInsight, load data into a data warehouse with ADF, schedule ADF pipelines, monitor ADF, and verify loaded data. Technologies used include PowerShell, Azure SQL, Blob Storage, Data Factory, SQL DW, Logic Apps, HDInsight, and Office 365.

Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationApache Apex

This document provides an overview of building a first Apache Apex application. It describes the main concepts of an Apex application including operators that implement interfaces to process streaming data within windows. The document outlines a "Sorted Word Count" application that uses various operators like LineReader, WordReader, WindowWordCount, and FileWordCount. It also demonstrates wiring these operators together in a directed acyclic graph and running the application to process streaming data.

A Short Presentation on KafkaMostafa Jubayer Khan

Apache Kafka is an open source distributed streaming platform used for building real-time data pipelines and applications. It allows for publishing and subscribing to streams of records, storing streams of records in a fault-tolerant way, and processing streams of records as they occur. Kafka has a producer-broker-consumer architecture and four core APIs. It provides advantages such as fault tolerance, scalability, and integration with stream processing systems. However, it also has limitations such as requiring coding and expertise to customize. Major companies like Apple, Netflix, and Walmart use Kafka.

Apache Eagle at Hadoop Summit 2016 San JoseHao Chen

Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop that was created by eBay and later open sourced as an Apache Incubator project. It provides security for Hadoop systems by instantly identifying access to sensitive data, recognizing attacks/malicious activity, and blocking access in real time through complex policy definitions and stream processing. Eagle was designed to handle the huge volume of metrics and logs generated by large-scale Hadoop deployments through its distributed architecture and linear scalability.

Apache Eagle: Secure Hadoop in Real TimeDataWorks Summit/Hadoop Summit

More Related Content

What's hot (20)

Metail at Cambridge AWS User Group Main Meetup #3Gareth Rogers

An Introduction to the Heatmap / Histogram PluginMitsuhiro Tanda

ADF Mapping Data Flow Private Preview MigrationMark Kromer

Presto: Distributed sql query engine kiran palaka

Hopsworks - The Platform for Data-Intensive AIQAware GmbH

Sql to daxAnnie Xu

Module Owb RepositoriesNicholas Goodman

Anomaly Detection using Spark MLlib and Spark StreamingKeira Zhou

Prometheus loves GrafanaTobias Schmidt

Apache NiFi: A Drag and Drop ApproachCalculated Systems

Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks

MLeap: Productionize Data Science Workflows Using SparkJen Aman

Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflowDatabricks

Kibana + timelion: time series with the elastic stackSylvain Wallez

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...Spark Summit

Azure Data Factory for Redmond SQL PASS UG Sept 2018Mark Kromer

Managed Feature Store for Machine LearningLogical Clocks

Microsoft Azure Data Factory Hands-On Lab Overview SlidesMark Kromer

Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationApache Apex

A Short Presentation on KafkaMostafa Jubayer Khan

Metail at Cambridge AWS User Group Main Meetup #3Gareth Rogers

An Introduction to the Heatmap / Histogram PluginMitsuhiro Tanda

ADF Mapping Data Flow Private Preview MigrationMark Kromer

Presto: Distributed sql query engine kiran palaka

Hopsworks - The Platform for Data-Intensive AIQAware GmbH

Sql to daxAnnie Xu

Module Owb RepositoriesNicholas Goodman

Anomaly Detection using Spark MLlib and Spark StreamingKeira Zhou

Prometheus loves GrafanaTobias Schmidt

Apache NiFi: A Drag and Drop ApproachCalculated Systems

Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks

MLeap: Productionize Data Science Workflows Using SparkJen Aman

Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflowDatabricks

Kibana + timelion: time series with the elastic stackSylvain Wallez

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...Spark Summit

Azure Data Factory for Redmond SQL PASS UG Sept 2018Mark Kromer

Managed Feature Store for Machine LearningLogical Clocks

Microsoft Azure Data Factory Hands-On Lab Overview SlidesMark Kromer

Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationApache Apex

A Short Presentation on KafkaMostafa Jubayer Khan

Similar to Log Data Analysis Platform (20)

Apache Eagle at Hadoop Summit 2016 San JoseHao Chen

Apache Eagle: Secure Hadoop in Real TimeDataWorks Summit/Hadoop Summit

Apache Eagle Architecture EvolvementHao Chen

This document discusses updates to Apache Eagle since version 0.5, including new features, architectural evolutions, and future plans. It provides an overview of Apache Eagle, which analyzes data activities, applications, metrics, and logs to identify security breaches, performance issues, and provide insights. New features allow for job performance monitoring, bad node detection, and service health checks. The architecture uses a distributed alert engine with Storm and Kafka to evaluate streaming policies declaratively. Future plans include using Apache Beam for a unified streaming platform, tighter integration with cluster managers, an agent for unified data collection, and cloud support.

Server Monitoring (Scaling while bootstrapped)Ajibola Aiyedogbon

This document discusses server monitoring strategies for scaling applications. It recommends: 1) Using a load balancer and multiple scalable web and database servers to handle increased traffic loads from hundreds of thousands of users on multi-platform apps. 2) Monitoring key server metrics like CPU usage, bandwidth usage, and application errors to improve stability and catch issues. 3) Tracking logs, client errors, and infrastructure bottlenecks to debug problems and prevent outages like those that led some companies to be "acquired" due to technical failures.

Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling

What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupMaarten Balliauw

We all like building and deploying cloud applications. But what happens once that’s done? How do we know if our application behaves like we expect it to behave? Of course, logging! But how do we get that data off of our machines? How do we sift through a bunch of seemingly meaningless diagnostics? In this session, we’ll look at how we can keep track of our Azure application using structured logging, AppInsights and AppInsights analytics to make all that data more meaningful.

Productionizing Machine Learning with a Microservices ArchitectureDatabricks

Cowboy Dating with Big Data or DWH Evolution in Action, Борис ТрофимовSigma Software

This document discusses the evolution of a data platform architecture at a company. It begins by introducing the initial architecture with a shared storage model and outlines pros and cons. It then discusses moving to a weak scheduling model using tools like Airflow. Next, it covers decoupling the data platform into ingestion, processing, and analytics layers. Key aspects covered include introducing schema management and data governance practices. Overall, the document provides a high-level overview of the journey of a monolithic data platform toward a more scalable and modular architecture.

Microsoft SQL Server - StreamInsight Overview PresentationMicrosoft Private Cloud

This document provides an overview of Microsoft's StreamInsight Complex Event Processing (CEP) platform. It discusses CEP concepts and benefits, the StreamInsight architecture and development environment, and deployment scenarios. The presentation aims to introduce IT professionals to CEP and Microsoft's StreamInsight solution for building event-driven applications that process streaming data with low latency.

Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic

Visualizing Big Data in RealtimeDataWorks Summit

This document provides an overview of Apache Apex and real-time data visualization. Apache Apex is a platform for developing scalable streaming applications that can process billions of events per second with millisecond latency. It uses YARN for resource management and includes connectors, compute operators, and integrations. The document discusses using Apache Apex to build real-time dashboards and widgets using the App Data Framework, which exposes application data sources via topics. It also covers exporting and packaging dashboards to include in Apache Apex application packages.

Service quality monitoring system architectureMatsuo Sawahashi

The latest distributed system utilizing the cloud is a very complicated configuration in which the components span a plurality of components. Applications for customers are part of products, and service quality targets directly linked to business indicators are needed. Legacy monitoring system based on traditional system management is not linked not only to business indicators but also to measure service quality. Google advocates the idea of site reliability engineering (SRE) and introduces efforts to measure quality of service. Based on the concept of SRE, the service quality monitoring system collects and analyzes logs from various components not only application codes but also whole infrastructure components. Since very large amounts of data must be processed in real time, it is necessary to design carefully with reference to the big data architecture. To utilize this system, you can measure the quality of service, and make it possible to continuously improve the service quality.

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.

Michael Sun presented on CBS Interactive's use of Hadoop for web analytics processing. Some key points: - CBS Interactive processes over 1 billion web logs daily from hundreds of websites on a Hadoop cluster with over 1PB of storage. - They developed an ETL framework called Lumberjack in Python for extracting, transforming, and loading data from web logs into Hadoop and databases. - Lumberjack uses streaming, filters, and schemas to parse, clean, lookup dimensions, and sessionize web logs before loading into a data warehouse for reporting and analytics. - Migrating to Hadoop provided significant benefits including reduced processing time, fault tolerance, scalability, and cost effectiveness compared to their

Cowboy dating with big data, Борис ТрофімовSigma Software

This document discusses strategies for evolving a data platform. It begins by describing some common approaches to data storage and scheduling, including using shared storage, weak scheduling, and monolithic data platforms. It then discusses challenges like availability, schema evolution, and the need for data engineering. Key recommendations include introducing granular reusable pipeline steps, separating pipeline concerns, using schema managers, and separating code and schemas. The goal is to build a reliable, scalable and fault-tolerant data platform that can efficiently handle data and schema changes.

Apache Eagle in ActionHao Chen

The document provides an overview of Apache Eagle, an open source distributed real-time monitoring and alerting engine for Hadoop from eBay. It discusses Eagle's architecture, which includes real-time data collection, a distributed policy engine using stream processing, scalable data storage and querying, and machine learning integration. The document also covers Eagle's ecosystem, how it can be used for security, and integration with tools like Ambari, Docker, and Ranger.

BizSpark Startup Night Windows Azure March 29, 2011Spiffy

This document provides an overview of Windows Azure and its core concepts. It discusses: - Why cloud computing started and how Windows Azure came about to address challenges with managing machines. - Key characteristics of cloud computing like elasticity, reduced costs, and new capabilities. - Core Windows Azure services like Blob storage, Tables, Queues and AppFabric for identity management. - How to plan application architecture, deploy to Windows Azure using tools like Visual Studio, and manage applications once deployed.

Evolving ArchitectureDouglas McClurg

Cloud Lambda Architecture PatternsAsis Mohanty

The document discusses different cloud data architectures including streaming processing, Lambda architecture, Kappa architecture, and patterns for implementing Lambda architecture on AWS. It provides an overview of each architecture's components and limitations. The key differences between Lambda and Kappa architectures are outlined, with Kappa being based solely on streaming and using a single technology stack. Finally, various AWS services that can be used to implement Lambda architecture patterns are listed.

MuleSoft Meetup Roma - Processi di Automazione su CloudHubAlfonso Martino

The document summarizes an event held by the Rome MuleSoft Meetup Group to discuss automation of processes on CloudHub using MuleSoft's Anypoint Platform. The agenda included presentations on using infrastructure as code to automate CloudHub setup, managing API proxies, and a Q&A session. A tool called the CloudHub Automation Tool was demonstrated, which uses Terraform and other open source tools to automate CloudHub configuration and setup of environments, users, and other resources through code. The document also provided information on migrating APIs from a legacy system to the Anypoint Platform at scale.

Processing Real-Time Data at Scale: A streaming platform as a central nervous...confluent

(Marcus Urbatschek, Confluent) Presentation during Confluent’s streaming event in Munich. This three-day hands-on course focused on how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka™ experts. The sessions focused on how Kafka and the Confluent Platform work, how their main subsystems interact, and how to set up, manage, monitor, and tune your cluster.

Apache Eagle at Hadoop Summit 2016 San JoseHao Chen

Apache Eagle: Secure Hadoop in Real TimeDataWorks Summit/Hadoop Summit

Apache Eagle Architecture EvolvementHao Chen

Server Monitoring (Scaling while bootstrapped)Ajibola Aiyedogbon

Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling

What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupMaarten Balliauw

Productionizing Machine Learning with a Microservices ArchitectureDatabricks

Cowboy Dating with Big Data or DWH Evolution in Action, Борис ТрофимовSigma Software

Microsoft SQL Server - StreamInsight Overview PresentationMicrosoft Private Cloud

Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic

Visualizing Big Data in RealtimeDataWorks Summit

Service quality monitoring system architectureMatsuo Sawahashi

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.

Cowboy dating with big data, Борис ТрофімовSigma Software

Apache Eagle in ActionHao Chen

BizSpark Startup Night Windows Azure March 29, 2011Spiffy

Evolving ArchitectureDouglas McClurg

Cloud Lambda Architecture PatternsAsis Mohanty

MuleSoft Meetup Roma - Processi di Automazione su CloudHubAlfonso Martino

Processing Real-Time Data at Scale: A streaming platform as a central nervous...confluent

Recently uploaded (20)

Digilocker under workingProcess Flow.pptxsatnamsadguru491

How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345

I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around! [email protected]

04302025_CCC TUG_DataVista: The Design Storyccctableauusergroup

DPR_Expert_Recruitment_notice_Revised.pdfinmishra17121973

Conic Sectionfaggavahabaayhahahahahs.pptxtaiwanesechetan

Ppt. Nikhil.pptxnshwuudgcudisisshvehsjkspanchariyasahil

Developing Security Orchestration, Automation, and Response ApplicationsVICTOR MAESTRE RAMIREZ

Calories_Prediction_using_Linear_Regression.pptxTijiLMAHESHWARI

Medical Dataset including visualizationsvishrut8750588758

Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPareaRusan

Deloitte Analytics - Applying Process Mining in an audit contextProcess mining Evangelist

Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation. Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.

183409-christina-rossetti.pdfdsfsdasggsagfardin123rahman07

Flip flop presenation-Presented By Mubahir khan.pptxmubashirkhan45461

C++_OOPs_DSA1_Presentation_Template.pptxaquibnoor22079

Cleaned_Lecture 6666666_Simulation_I.pdfalcinialbob1234

03 Daniel 2-notes.ppt seminario escatologiaAlexander Romero Arosquipa

Data Science Courses in India iim skillsdharnathakur29

This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.

GenAI for Quant Analytics: survey-analytics.aiInspirient

Classification_in_Machinee_Learning.pptxwencyjorda88

Stack_and_Queue_Presentation_Final (1).pptxbinduraniha86

Digilocker under workingProcess Flow.pptxsatnamsadguru491

How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345

04302025_CCC TUG_DataVista: The Design Storyccctableauusergroup

DPR_Expert_Recruitment_notice_Revised.pdfinmishra17121973

Conic Sectionfaggavahabaayhahahahahs.pptxtaiwanesechetan

Ppt. Nikhil.pptxnshwuudgcudisisshvehsjkspanchariyasahil

Developing Security Orchestration, Automation, and Response ApplicationsVICTOR MAESTRE RAMIREZ

Calories_Prediction_using_Linear_Regression.pptxTijiLMAHESHWARI

Medical Dataset including visualizationsvishrut8750588758

Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPareaRusan

Deloitte Analytics - Applying Process Mining in an audit contextProcess mining Evangelist

183409-christina-rossetti.pdfdsfsdasggsagfardin123rahman07

Flip flop presenation-Presented By Mubahir khan.pptxmubashirkhan45461

C++_OOPs_DSA1_Presentation_Template.pptxaquibnoor22079

Cleaned_Lecture 6666666_Simulation_I.pdfalcinialbob1234

03 Daniel 2-notes.ppt seminario escatologiaAlexander Romero Arosquipa

Data Science Courses in India iim skillsdharnathakur29

GenAI for Quant Analytics: survey-analytics.aiInspirient

Classification_in_Machinee_Learning.pptxwencyjorda88

Stack_and_Queue_Presentation_Final (1).pptxbinduraniha86

Log Data Analysis Platform

1. LOG DATA ANALYSIS PLATFORM May, 2015

2. Agenda 1) User-Group Introduction 2) Problematic 3) Log Data Analysis System Overview 4) Task Analysis 5) Solution Architecture 6) Trade-off Analysis 7) Automation 8) Performance Testing 9) Outcome & Plans

3. PROBLEMATIC

4. Demo Lab: Why we’ve started this project? 1) Increase Internal Experience 2) Create Reference Solution w/o NDA Limitations 3) Get Playground for Tests 4) Provide Demo Environment for Customers (using their data) 5) Decrease time to Market (by introducing automation)

5. LOG DATA ANALYSIS PLATFORM : OVERVIEW

6. Log Data Analysis Platform Details Key Facts: • ~270-300 Web Servers • Log Types: HTTPD Access logs, Error logs, Application Server Servlet, OS Service Logs • ~500K events per minute • 150GB of data per day Technologies: • Flume • Hadoop/HDFS, MapReduce • Hive, Impala • Oozie • Elasticsearch, Kibana 3 • Tableau Analytics platform • Puppet + Vagrant

7. Log Data Examples Access log: 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Error log: [Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client stopped connection before send body completed [Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist: /home/httpd/twiki/view/Main/WebHome Vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0 iostat Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011 avg-cpu: %user %nice %system %iowait %steal %idle 5.68 0.00 0.52 2.03 0.00 91.76

8. TASK ANALYSIS

9. Architecture Drivers: Use Cases

10. Architecture Drivers: Quality Attributes (1/3)

11. Architecture Drivers: Quality Attributes (2/3)

12. Architecture Drivers: Quality Attributes (3/3)

13. Architecture Drivers: Limitations

14. Demo Lab: Marketecture

15. SOLUTION ARCHITECTURE

16. Solution Architecture Batch Layer Serving Layer Speed Layer Raw Data Storage Data Strea m Real-time Views Static Views Precomputing Precomputing Ad-hoc Batch Views Static Batch Views Corporate BI Tool Legend: Layer boundary Data flow (with direction indicated) Query flow Apache HTTP Servers Raw Data Storage Pre-computing Batch Views Real-Time Views Dashboard/ Search Data Stream Real-Time Processing and Aggregations BI Tool  Avro as a Raw Data Storage file format  Parquet as a Batch Views file format  Star schema as a Batch Views data model

17. Architecture: Flume Topology

18. Batch ETL

19. TRADE-OFF ANALYSIS

20. Distribution Selection

21. Hive Stinger vs Impala Compression Ratio Access Speed

22. AUTOMATION

23. Automation (saves time and money) 80% 20% Development and Debugging F&P Testing, Demo Local Development Cloud Development

24. vagrant up

25. Automation Process Phase Tool Notes VM Provisioning Vagrant — Supports: VirtualBox, VMWare ESX, Amazon AWS VM Bootstraping Puppet — Installs Cloudera Manager, Cloudera Distribution Hadoop, ElasticSearch+Kibana, Flume, Microstrategy, Log Generator. — Creates Cluster using Cloudera Manager API. Configure ETL and BI Puppet — Configures Flume, Oozie, ElasticSearch, Impala, Hive, Microstrategy Dashboards Integration Tests Puppet — Generates Workload and ensures data go through. — Checks Logs for errors. — Calculates timing/throughput.

26. PERFORMANCE TESTING

27. Log Generator 1 Thread can generate: 4200 events / second (File source) 5500 events / second (TCP source)

28. Accurate Sizing 100k/min 50k/min 20k/min 200k /min Calculator!

29. OUTCOME & PLANS

30. Outcome 1) Demo lab, playground, testing platform (in 1 hour) 2) Sizing Calculator 3) Help to get 3 new customers (one is really, really huge) 4) Strategic Partnership with Cloudera 5) Tons of experience and fun  Plans 1) Add support for other Hadoop Distributions (Hortonworks, MapR) 2) Make Project Open-Source

31. Thank You! 31 SoftServe US Office One Congress Plaza, 111 Congress Avenue, Suite 2700 Austin, TX 78701 Tel: 512.516.8880 Contacts Valentyn Kropov [email protected] Tel: 866.687.3588 x4341

Log Data Analysis Platform

Recommended

More Related Content

What's hot (20)

Similar to Log Data Analysis Platform (20)

Recently uploaded (20)

Log Data Analysis Platform