By Tom White, Software Engineer at Cloudera
Video at: https://www.youtube.com/watch?v=ibgoMdca5mQ&list=PL5OOLwV_m9vaoNt0wM9BVjd_gWyseq0IR&index=1
Fast real-time approximations using Spark streaminghuguk
By Kevin Schmidt (Head of Data Science at Mind Candy)
Luis Vicente (Senior Data Engineer at Mind Candy)
For mobile games, constant tweaks are the difference between success and failure. Data and analytics have to be available in real-time, but calculating, for example, uniqueness or newness of a data point requires a list of seen data points - both memory intensive and tricky when using real-time stream processing like Spark Streaming. Probabilistic data structures allow approximation of these properties with a fixed memory representation, and are very well suited for this kind of stream processing. Getting from the theory of approximation to a useful metric at a low error rate even for many millions of users is another story. In our talk we will look at practical ways of achieving this: which approximation we used for selection of useful metrics, why we picked a specific probabilistic data structure, how we stored it in Cassandra as a time series and how we implemented it in Spark Streaming.
This document summarizes Gareth Llewellyn's experience redesigning the network architecture at DataSift to improve performance and scalability. The initial Cisco-based design suffered from issues like buffering, head of line blocking, and oversubscription of uplinks. Gareth considered moving to an Arista leaf-spine architecture with Arista 7050 core switches and 7048 top-of-rack switches, which would provide better redundancy, scalability, and throughput while reducing complexity compared to the mesh design. Questions are welcomed about the new design.
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...Michael Stack
This document summarizes a presentation on scaling a 30 TB data lake using Apache HBase and Scala. It introduces Apache HBase and Spark as technologies for building fast data platforms. It then describes a case study where they were used to architect a retail analytics platform capable of processing 4.6 billion events weekly from 30 TB of historical data. Key aspects included using HBase for data deduplication and as a master data management system, and connecting Spark to HBase using a Scala DSL for efficient querying and updates at scale. Performance was improved 5x by reengineering the data pipeline to be highly concurrent and asynchronous.
Netflix running Presto in the AWS CloudZhenxiao Luo
Netflix runs Presto in its AWS cloud environment to enable low-latency ad-hoc queries on petabyte-scale data stored in S3. Some key things Netflix did include optimizing Presto to read from and write directly to S3, fixing bugs, integrating Presto with its EMR and Ganglia monitoring, and deploying a 100+ node Presto cluster that handles over 1000 queries per day. Performance testing showed Presto was often 10x faster than Hive for various queries and joins. Netflix continues optimizing Presto for its needs like supporting Parquet, ODBC/JDBC drivers, and looking to address current limitations.
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez.
After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc.
We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation. A distributed bloom join that can create multiple bloom filters in parallel was straightforward to implement with the flexibility of Tez DAGs. It vastly improved performance and reduced disk and network utilization for our large joins. Byte code generation for projection and filtering of records is another big feature that we are targeting for Pig 0.17 which will speed up processing by reducing the virtual function calls.
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
This document summarizes Satoshi Tagomori's presentation on Treasure Data, a data analytics service company. It discusses Treasure Data's use of Ruby for various components of its platform including its logging (Fluentd), ETL (Embulk), scheduling (PerfectSched), and storage (PlazmaDB) technologies. The document also provides an overview of Treasure Data's architecture including how it collects, stores, processes, and visualizes customer data using open source tools integrated with services like Hadoop and Presto.
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks
It is common for consumer Internet companies to start off with popular third-party tools for analytics needs. Then, when the user base and the company grows, they end up building their own analytics data pipeline and query engine to cope with their data scale, satisfy custom data enrichment and reporting needs and achieve high quality of their data. That’s exactly the path that was taken at Grammarly, the popular online proofreading service.
In this session, Grammarly will share how they improved business and marketing analytics, previously done with Mixpanel, by building their own in-house analytics engine and application on top of Apache Spark. Chernetsov wil touch upon several Spark tweaks and gotchas that they experienced along the way:
– Outputting data to several storages in a single Spark job
– Dealing with Spark memory model, building a custom spillable data-structure for your data traversal
– Implementing a custom query language with parser combinators on top of Spark sql parser
– Custom query optimizer and analyzer when you want not exactly sql
– Flexible-schema storage and query against multi-schema data with schema conflicts
– Custom aggregation functions in Spark SQL
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkMichael Stack
This document discusses using Phoenix and Spark with ApsaraDB HBase. It covers the architecture of Phoenix as a service over HBase, use cases like log and internet company scenarios, best practices for table properties and queries, challenges around availability and stability, and improvements being made. It also discusses how Spark can be used for analysis, bulk loading, real-time ETL, and to provide elastic compute resources. Example architectures show Spark SQL analyzing HBase and structured streaming incrementally loading data. Scenarios discussed include online reporting, complex analysis, log indexing and querying, and time series monitoring.
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC timeMichael Stack
CCSMap is a new data structure introduced by Alibaba to improve the performance of HBase. It aims to reduce the overhead of the default Java ConcurrentSkipListMap (CSLM) data structure and improve young garbage collection times. CCSMap chunks data into fixed size blocks for better memory management and uses direct pointers between nodes for faster access. It also provides various configuration options. Alibaba has achieved significant performance gains using CCSMap in HBase, including reduced young GC times, and it continues working to integrate CCSMap further and add new features.
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
This is a sharing on a seminar held together by Cathay Bank and the AWS User Group in Taiwan. In this sharing, overview of Amazon EMR and AWS Glue is offered and CDK management on those services via practical scenarios is also presented
Data Pipeline team at Demonware (Activision) has to deal with routing large amounts of data from various sources to many destinations every day.
Our team always wanted to be able to query processed data for debugging and analytical purposes, but creating large data warehouses was never our priority, since it usually happens downstream.
AWS Athena is completely serverless query service that doesn't require any infrastructure setup or complex provisioning. We just needed to save some of our data streams to AWS S3 and define a schema. Just a few simple steps, but in the end we were able to write complex SQL queries against gigabytes of data and get results in seconds.
In this presentation I want to show multiple ways to stream your data to AWS S3, explain some underlying tech, show how to define a schema and finally share some of the best practices we applied.
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudMichael Stack
New Journey of HBase in Alibaba and Cloud discusses Alibaba's use of HBase over 8 years and improvements made. Key points discussed include:
- Alibaba began using HBase in 2010 and has since contributed to the open source community while developing internal improvements.
- Challenges addressed include JVM garbage collection pauses, separating computing and storage, and adding cold/hot data tiering. A diagnostic system was also created.
- Alibaba uses HBase across many core scenarios and has integrated it with other databases in a multi-model approach to support different workloads.
- Benefits of running HBase on cloud include flexibility, cost savings, and making it
This document provides steps for performing big data analytics using Amazon Redshift, EC2, and S3. It outlines how to 1) plan and launch a Redshift cluster, 2) connect a client and load data from S3, and 3) query the Redshift database from an external client. Key points are that Redshift is a fast, managed data warehouse service, optimized for processing large datasets ranging from GB to PB for low cost compared to other solutions. It also takes on administrative tasks so users can focus on analytics.
HBaseConAsia2018 Track3-2: HBase at China TelecomMichael Stack
HBase is used at China Telecom for various applications including persistence for streaming jobs, online reading and writing, and as a data store for their core system. They operate several HBase clusters storing over 500 TB of data ingesting 1 TB per day. They monitor HBase using Ganglia for basic metrics and Zabbix for critical alerts. When issues arise, such as a system hang, they investigate debug cases and perform optimizations like changing the garbage collector from CMS to G1 and implementing read/write splitting.
Amazon Redshift is a fully managed petabyte-scale data warehouse service in the cloud. It provides fast query performance at a very low cost. Updates since re:Invent 2013 include new features like distributed tables, remote data loading, approximate count distinct, and workload queue memory management. Customers have seen query performance improvements of 20-100x compared to Hive and cost reductions of 50-80%. Amazon Redshift makes it easy to setup, operate, and scale a data warehouse without having to worry about provisioning and managing hardware.
Rethinking the database for the cloud (iJAWS)Rasmus Ekman
The document discusses rethinking database architecture for cloud environments. Traditional on-premises architectures with a single relational database can have problems with scalability, management difficulty, cost and performance. The cloud allows for distributing data across several specialized services matched to data type and access patterns. Examples show mapping services like DynamoDB, S3, RDS and Redshift to use cases like social gaming and e-commerce based on factors like data temperature, latency and cost. Choosing the right architecture is important for performance, reliability and scalability in the cloud.
This document provides an agenda and overview for a workshop on building a data lake on AWS. The agenda includes reviewing data lakes, modernizing data warehouses with Amazon Redshift, data processing with Amazon EMR, and event-driven processing with AWS Lambda. It discusses how data lakes extend traditional data warehousing approaches and how services like Redshift, EMR, and Lambda can be used for analytics in a data lake on AWS.
- Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service in the cloud. It uses massively parallel processing and columnar storage to enable fast queries on large data sets for a fraction of the cost of traditional data warehousing.
- Some key features include automatic scaling, continuous backups, integrated security and access controls, integration with other AWS services like S3 and DynamoDB, and simple point-and-click management.
- Customers are seeing significant improvements in performance, often 50-100x faster than alternatives like Hive, as well as large cost reductions of up to 80% compared to on-premises data warehousing.
이제 빅데이터란 개념은 익숙한 것이 되었지만 이를 비지니스에 적용하고 최대의 효과를 얻는 방법에 대한 고찰은 여전히 필요합니다. 소중한 데이터를 쉽게 저장 및 분석하고 시각화하는 것은 비즈니스에 대한 통찰을 얻기 위한 중요한 과정입니다.
이 강연에서는 AWS Elastic MapReduce, Amazon Redshift, Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구축하는 방법에 대해 소개합니다.
이 강연에서는 NoSQL 데이터베이스 서비스인 Amazon DynamoDB 서비스를 간단하게 소개하고, 새롭게 발표된 신규 시간 기반 (TTL) 데이터 관리 기능 및 인메모리 캐시 신규 기능 (Amazon DynamoDB Accelerator) 등에 대해 함께 설명해 드릴 예정입니다.
연사: Pranav Nambiar, 아마존 웹서비스 Amazon DynamoDB 총괄 프로덕트 매니저
Scaling on AWS for the First 10 Million Users at Websummit DublinIan Massingham
In this talk from the Dublin Websummit 2014 AWS Technical Evangelist Ian Massingham discusses the techniques that AWS customers can use to create highly scalable infrastructure to support the operation of large scale applications on the AWS cloud.
Includes a walk-through of how you can evolve your architecture as your application becomes more popular and you need to scale up your infrastructure to support increased demand.
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
As Hadoop became mainstream, the need to simplify and speed up analytics processes grew rapidly. Data wrangling emerged as a necessary step in any analytical pipeline, and is often considered to be its crux, taking as much as 80% of an analyst's time. In this presentation we will discuss how data wrangling solutions can be leveraged to streamline, strengthen and improve data analytics initiatives on Hadoop, including use cases from Trifacta customers.
Bio: Olivier is EMEA Solutions Lead at Trifacta. He has 7 years experience in analytics with prior roles as technical lead for business analytics at Splunk and quantitative analyst at Accenture and Aon.
Stephen Taylor is the community manager for Ether Camp. They provide an analysis tool for the Ethereum blockchain, ‘Block Explorer’ and also an ‘Intergrated Development Environment’ (I.D.E) that empowers developers to build, test and deploy applications in a sandbox environment. This November they are launching their second annual hackathon, hack.ether.camp which is aiming to deliver a more sustained approach to the hackathon ideology, by utilising blockchain technology.
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
At Google Cloud Platform, we're combining the Apache Spark and Hadoop ecosystem with our software and hardware innovations. We want to make these awesome tools easier, faster, and more cost-effective, from 3 to 30,000 cores. This presentation will showcase how Google Cloud Platform is innovating with the goal of bringing the Hadoop ecosystem to everyone.
Bio: "I love data because it surrounds us - everything is data. I also love open source software, because it shows what is possible when people come together to solve common problems with technology. While they are awesome on their own, I am passionate about combining the power of open source software with the potential unlimited uses of data. That's why I joined Google. I am a product manager for Google Cloud Platform and manage Cloud Dataproc and Apache Beam (incubating). I've previously spent time hanging out at Disney and Amazon. Beyond Google, love data, amateur radio, Disneyland, photography, running and Legos."
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks
It is common for consumer Internet companies to start off with popular third-party tools for analytics needs. Then, when the user base and the company grows, they end up building their own analytics data pipeline and query engine to cope with their data scale, satisfy custom data enrichment and reporting needs and achieve high quality of their data. That’s exactly the path that was taken at Grammarly, the popular online proofreading service.
In this session, Grammarly will share how they improved business and marketing analytics, previously done with Mixpanel, by building their own in-house analytics engine and application on top of Apache Spark. Chernetsov wil touch upon several Spark tweaks and gotchas that they experienced along the way:
– Outputting data to several storages in a single Spark job
– Dealing with Spark memory model, building a custom spillable data-structure for your data traversal
– Implementing a custom query language with parser combinators on top of Spark sql parser
– Custom query optimizer and analyzer when you want not exactly sql
– Flexible-schema storage and query against multi-schema data with schema conflicts
– Custom aggregation functions in Spark SQL
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkMichael Stack
This document discusses using Phoenix and Spark with ApsaraDB HBase. It covers the architecture of Phoenix as a service over HBase, use cases like log and internet company scenarios, best practices for table properties and queries, challenges around availability and stability, and improvements being made. It also discusses how Spark can be used for analysis, bulk loading, real-time ETL, and to provide elastic compute resources. Example architectures show Spark SQL analyzing HBase and structured streaming incrementally loading data. Scenarios discussed include online reporting, complex analysis, log indexing and querying, and time series monitoring.
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC timeMichael Stack
CCSMap is a new data structure introduced by Alibaba to improve the performance of HBase. It aims to reduce the overhead of the default Java ConcurrentSkipListMap (CSLM) data structure and improve young garbage collection times. CCSMap chunks data into fixed size blocks for better memory management and uses direct pointers between nodes for faster access. It also provides various configuration options. Alibaba has achieved significant performance gains using CCSMap in HBase, including reduced young GC times, and it continues working to integrate CCSMap further and add new features.
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
This is a sharing on a seminar held together by Cathay Bank and the AWS User Group in Taiwan. In this sharing, overview of Amazon EMR and AWS Glue is offered and CDK management on those services via practical scenarios is also presented
Data Pipeline team at Demonware (Activision) has to deal with routing large amounts of data from various sources to many destinations every day.
Our team always wanted to be able to query processed data for debugging and analytical purposes, but creating large data warehouses was never our priority, since it usually happens downstream.
AWS Athena is completely serverless query service that doesn't require any infrastructure setup or complex provisioning. We just needed to save some of our data streams to AWS S3 and define a schema. Just a few simple steps, but in the end we were able to write complex SQL queries against gigabytes of data and get results in seconds.
In this presentation I want to show multiple ways to stream your data to AWS S3, explain some underlying tech, show how to define a schema and finally share some of the best practices we applied.
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudMichael Stack
New Journey of HBase in Alibaba and Cloud discusses Alibaba's use of HBase over 8 years and improvements made. Key points discussed include:
- Alibaba began using HBase in 2010 and has since contributed to the open source community while developing internal improvements.
- Challenges addressed include JVM garbage collection pauses, separating computing and storage, and adding cold/hot data tiering. A diagnostic system was also created.
- Alibaba uses HBase across many core scenarios and has integrated it with other databases in a multi-model approach to support different workloads.
- Benefits of running HBase on cloud include flexibility, cost savings, and making it
This document provides steps for performing big data analytics using Amazon Redshift, EC2, and S3. It outlines how to 1) plan and launch a Redshift cluster, 2) connect a client and load data from S3, and 3) query the Redshift database from an external client. Key points are that Redshift is a fast, managed data warehouse service, optimized for processing large datasets ranging from GB to PB for low cost compared to other solutions. It also takes on administrative tasks so users can focus on analytics.
HBaseConAsia2018 Track3-2: HBase at China TelecomMichael Stack
HBase is used at China Telecom for various applications including persistence for streaming jobs, online reading and writing, and as a data store for their core system. They operate several HBase clusters storing over 500 TB of data ingesting 1 TB per day. They monitor HBase using Ganglia for basic metrics and Zabbix for critical alerts. When issues arise, such as a system hang, they investigate debug cases and perform optimizations like changing the garbage collector from CMS to G1 and implementing read/write splitting.
Amazon Redshift is a fully managed petabyte-scale data warehouse service in the cloud. It provides fast query performance at a very low cost. Updates since re:Invent 2013 include new features like distributed tables, remote data loading, approximate count distinct, and workload queue memory management. Customers have seen query performance improvements of 20-100x compared to Hive and cost reductions of 50-80%. Amazon Redshift makes it easy to setup, operate, and scale a data warehouse without having to worry about provisioning and managing hardware.
Rethinking the database for the cloud (iJAWS)Rasmus Ekman
The document discusses rethinking database architecture for cloud environments. Traditional on-premises architectures with a single relational database can have problems with scalability, management difficulty, cost and performance. The cloud allows for distributing data across several specialized services matched to data type and access patterns. Examples show mapping services like DynamoDB, S3, RDS and Redshift to use cases like social gaming and e-commerce based on factors like data temperature, latency and cost. Choosing the right architecture is important for performance, reliability and scalability in the cloud.
This document provides an agenda and overview for a workshop on building a data lake on AWS. The agenda includes reviewing data lakes, modernizing data warehouses with Amazon Redshift, data processing with Amazon EMR, and event-driven processing with AWS Lambda. It discusses how data lakes extend traditional data warehousing approaches and how services like Redshift, EMR, and Lambda can be used for analytics in a data lake on AWS.
- Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service in the cloud. It uses massively parallel processing and columnar storage to enable fast queries on large data sets for a fraction of the cost of traditional data warehousing.
- Some key features include automatic scaling, continuous backups, integrated security and access controls, integration with other AWS services like S3 and DynamoDB, and simple point-and-click management.
- Customers are seeing significant improvements in performance, often 50-100x faster than alternatives like Hive, as well as large cost reductions of up to 80% compared to on-premises data warehousing.
이제 빅데이터란 개념은 익숙한 것이 되었지만 이를 비지니스에 적용하고 최대의 효과를 얻는 방법에 대한 고찰은 여전히 필요합니다. 소중한 데이터를 쉽게 저장 및 분석하고 시각화하는 것은 비즈니스에 대한 통찰을 얻기 위한 중요한 과정입니다.
이 강연에서는 AWS Elastic MapReduce, Amazon Redshift, Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구축하는 방법에 대해 소개합니다.
이 강연에서는 NoSQL 데이터베이스 서비스인 Amazon DynamoDB 서비스를 간단하게 소개하고, 새롭게 발표된 신규 시간 기반 (TTL) 데이터 관리 기능 및 인메모리 캐시 신규 기능 (Amazon DynamoDB Accelerator) 등에 대해 함께 설명해 드릴 예정입니다.
연사: Pranav Nambiar, 아마존 웹서비스 Amazon DynamoDB 총괄 프로덕트 매니저
Scaling on AWS for the First 10 Million Users at Websummit DublinIan Massingham
In this talk from the Dublin Websummit 2014 AWS Technical Evangelist Ian Massingham discusses the techniques that AWS customers can use to create highly scalable infrastructure to support the operation of large scale applications on the AWS cloud.
Includes a walk-through of how you can evolve your architecture as your application becomes more popular and you need to scale up your infrastructure to support increased demand.
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
As Hadoop became mainstream, the need to simplify and speed up analytics processes grew rapidly. Data wrangling emerged as a necessary step in any analytical pipeline, and is often considered to be its crux, taking as much as 80% of an analyst's time. In this presentation we will discuss how data wrangling solutions can be leveraged to streamline, strengthen and improve data analytics initiatives on Hadoop, including use cases from Trifacta customers.
Bio: Olivier is EMEA Solutions Lead at Trifacta. He has 7 years experience in analytics with prior roles as technical lead for business analytics at Splunk and quantitative analyst at Accenture and Aon.
Stephen Taylor is the community manager for Ether Camp. They provide an analysis tool for the Ethereum blockchain, ‘Block Explorer’ and also an ‘Intergrated Development Environment’ (I.D.E) that empowers developers to build, test and deploy applications in a sandbox environment. This November they are launching their second annual hackathon, hack.ether.camp which is aiming to deliver a more sustained approach to the hackathon ideology, by utilising blockchain technology.
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
At Google Cloud Platform, we're combining the Apache Spark and Hadoop ecosystem with our software and hardware innovations. We want to make these awesome tools easier, faster, and more cost-effective, from 3 to 30,000 cores. This presentation will showcase how Google Cloud Platform is innovating with the goal of bringing the Hadoop ecosystem to everyone.
Bio: "I love data because it surrounds us - everything is data. I also love open source software, because it shows what is possible when people come together to solve common problems with technology. While they are awesome on their own, I am passionate about combining the power of open source software with the potential unlimited uses of data. That's why I joined Google. I am a product manager for Google Cloud Platform and manage Cloud Dataproc and Apache Beam (incubating). I've previously spent time hanging out at Disney and Amazon. Beyond Google, love data, amateur radio, Disneyland, photography, running and Legos."
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
This talk will describe his research into using Hadoop to query and manage big geographic datasets, specifically OpenStreetMap(OSM). OSM is an “open-source” map of the world, growing at a large rate, currently around 5TB of data. The talk will introduce OSM, detail some aspects of the research, but also discuss his experiences with using the SpatialHadoop stack on Azure and Google Cloud.
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
Big organisations have a wealth of rich customer data which opens up huge new opportunities. However, they have the challenge of how to extract value from this data while protecting the privacy of their individual customers. He will talk about the risks organisations face, and what they should do about it. He will survey the techniques which can be used to make data safe for analysis, and talk briefly about how they are solving this problem at Privitar.
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
IBM is developing the Watson Ecosystem to leverage its Developer Cloud, APIs, Content Store and Talent Hub. This is part of IBM's recent announcement of the $1B investment in Watson as a new business unit including Silicon Alley NYC headquarters. For the first time, IBM will open up Watson as a development platform in the Cloud to spur innovation and fuel a new ecosystem of entrepreneurial software app providers who will bring forward a new generation of applications infused with Watson's cognitive computing intelligence.
In this talk about Apache Flink we will touch on three main things, an introductory look at Flink, a look under the hood and a demo.
* In the introduction we will briefly look at the history of Flink and then go on to the API and different use cases. Here we will also see how it can be deployed in practice and what some of the pitfalls in a cluster setting can be.
* In the second section we will look at the streaming execution engine that lies at the heart of Flink. Here we will see what makes it tick and also what distinguishes it from other approaches, such as the mini-batch execution model.
Ufuk Celebi - PMC member at Apache Flink and co-founder and software engineer at data Artisans
* In the final section we will see a live demo of a fault-tolerant streaming job that performs analysis of the wikipedia edit-stream.
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
Sean Owen – Director of Data Science @Cloudera
Building machine learning models is all well and good, but how do they get productionized into a service? It's a long way from a Python script on a laptop, to a fault-tolerant system that learns continuously, serves thousands of queries per second, and scales to terabytes. The confederation of open source technologies we know as Hadoop now offers data scientists the raw materials from which to assemble an answer: the means to build models but also ingest data and serve queries, at scale.
This short talk will introduce Oryx 2, a blueprint for building this type of service on Hadoop technologies. It will survey the problem and the standard technologies and ideas that Oryx 2 combines: Apache Spark, Kafka, HDFS, the lambda architecture, PMML, REST APIs. The talk will touch on a key use case for this architecture -- recommendation engines.
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
Martin Oberhuber and Eliano Marques, Senior Data Scientists @Think Big International
In this talk Think Big International Lead Data Scientists will discuss the options that exist today for engineering and data science teams aiming to use big data patterns to solve new business problems. With the enterprise adoption of the Hadoop ecosystem and the emerging momentum of open source projects like Spark it is becoming mandatory to have an approach that solves for business results but remains flexible to adapt and change with the open source market.
This document discusses venture capital, funding, and pitching. It provides an overview of venture capital, including how venture capital funds work with startups and limited partners. It then discusses how the rise of cloud computing, open source software, and public cloud infrastructure have significantly lowered costs and increased innovation for startups, leading to changes in typical venture funding amounts and models over time. The document concludes with tips for an effective pitch, emphasizing the importance of clearly communicating your business model, metrics, strategy, and execution plan in addition to product details and forecasts.
Signal Media: Real-Time Media & News Monitoringhuguk
Startup pitch presented by CTO Wesley Hall. Signal Media is a real-time media and news monitoring platform that tracks media outlets. News items are analysed for brand & media monitoring as well as market intelligence.
Digital Catapult is a UK nonprofit organization that aims to advance digital ideas and technologies to create new jobs, services, and economic growth. It works in four challenge areas - closed organizational data, personal data, creative content, and internet of things. Digital Catapult establishes centers and platforms to enable collaboration between large organizations and startups to unlock proprietary data through pilot projects. Its goal is to contribute £365 million to the UK economy and help 10,000 organizations by 2018 by convening open innovation across sectors.
Startup pitch presented by Aeneas Wiener. Cytora is a real-time geopolitical risk analysis platform that extracts events from open-source intelligence and evaluates these events on their geopolitical impact.
The document introduces Cubitic, a startup providing a predictive analytics platform for IoT applications. It summarizes the founders' backgrounds and experience. Jaco Els is the CEO with a degree in IT and experience at major companies. Ryan Topping is the Chief Scientist with degrees in mathematics and bioinformatics. Renjith Nair is the CTO with a master's degree in networking and experience developing scalable systems. The founders met working at King and saw an opportunity to build their own predictive analytics solution for IoT, launching initial prototypes in 2015.
Startup pitch presented by co-founder and CEO Corentin Guillo. Bird.i is building a platform for up-to-date earth observation data that will bring satellite imagery to the mass market. Providing fresh imagery together with analytics around the forecast of localised demand opens up innovative opportunities in sectors like construction, tourism, real-estate and remote facility monitoring.
Startup pitch presented by co-founders Laure Andrieux and Nic Greenway. Aiseedo applies real-time machine learning, where the model of the world is constantly updated, to build adaptive systems which can be applied to robotics, the Internet of Things and healthcare.
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
This talk will cover the design and implementation decisions that have been key to the success of Apache Spark over other competing cluster computing frameworks. It will be delving into the whitepaper behind Spark and cover the design of Spark RDDs, the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases: Spark SQL, Spark Streaming, MLib and GraphX. RDDs allow Spark to outperform existing models by up to 100x in multi-pass analytics.
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
Technical developments in the area of data warehousing have allowed companies to push their analysis a step further and, therefore, allowed data scientists to deliver more value to business areas. In that session, we will focus on the case of performance marketing at King and demonstrate how we use Hadoop capabilities to exploit user-level data efficiently. That approach results in obtaining a more holistic view in a return-on-investment analysis of TV advertisement.
Hadoop - Looking to the Future By Arun Murthyhuguk
Hadoop - Looking to the Future
By Arun Murthy (Founder of Hortonworks, Creator of YARN)
The Apache Hadoop ecosystem began as just HDFS & MapReduce nearly 10 years ago in 2006.
Very much like the Ship of Theseus (http://en.wikipedia.org/wiki/Ship_of_Theseus), Hadoop has undergone incredible amount of transformation from multi-purpose YARN to interactive SQL with Hive/Tez to machine learning with Spark.
Much more lies ahead: whether you want sub-second SQL with Hive or use SSDs/Memory effectively in HDFS or manage Metadata-driven security policies in Ranger, the Hadoop ecosystem in the Apache Software Foundation continues to evolve to meet new challenges and use-cases.
Arun C Murthy has been involved with Apache Hadoop since the beginning of the project - nearly 10 years now. In the beginning he led MapReduce, went on to create YARN and then drove Tez & the Stinger effort to get to interactive & sub-second Hive. Recently he has been very involved in the Metadata and Governance efforts. In between he founded Hortonworks, the first public Hadoop distribution company.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
6. Utility computing
On demand Pay as you go
Uniform Available
Compute
Storage
Security
Scaling
Database
Networking
Monitoring
Messaging
Workflow
DNS
Load
Balancing
Backup
CDN
7. No
Up-‐Front
Capital
Expense
Pay
Only
for
What
You
Use
Self-‐Service
Infrastructure
Easily
Scale
Up
and
Down
Improve
Agility
&
Time-‐to-‐Market
Low
Cost
Deploy
Cloud computing benefits
12. NumberofEC2Instances
4/12/2008 4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/20084/17/20084/13/2008
40
servers
to
5000
in
3
days
EC2 scaled to peak of 5000
instances
“Techcrunched”
Launch of Facebook
modification
Steady state of ~40
instances
13. Compute
Storage
AWS
Global
Infrastructure
Database
App
Services
Deployment
&
AdministraNon
Networking
Global Infrastructure
14. Global Infrastructure
Region
US-WEST (N. California)
EU-WEST (Ireland)
ASIA PAC (Tokyo)
ASIA PAC
(Singapore)
US-WEST (Oregon)
SOUTH AMERICA (Sao Paulo)
US-EAST (Virginia)
GOV CLOUD
ASIA PAC
(Sydney)
16. Customer Needs
• Store
Any
Amount
of
Data
– Without
Capacity
Planning
• Perform
Complex
Analysis
on
Any
Data
– Scale
on
Demand
• Store
Data
Securely
• Decrease
Time
to
Market
– Build
Environments
Quickly
• Reduce
Costs
– Reduce
Capital
Expenditure
• Enable
Global
Reach
18. ElasNc
Block
Store
High performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
IMAGE
Availability
99.99%
Durability
99.999999999%
Is a Web Store
Not a file system
No Single Points of Failure
Eventually consistent
Paradigm Object store
Performance Very Fast
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.095/GB/month
Typical use
case
Write once, read many
Limits 100 Buckets, Unlimited
Storage, 5TB Objects
Simple
Storage
Service
Highly
scalable
object
storage
for
the
internet
1
byte
to
5TB
in
size
99.999999999%
durability
19. Peak Requests: 830,000+ per second
Total Number of Objects Stored in Amazon S3
14 Billion
40 Billion
102 Billion
762 Billion
262 Billion
1.3 Trillion
Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012
Objects in S3
20. Glacier
Long
term
object
archive
Extremely
low
cost
per
gigabyte
99.999999999%
durability
ElasNc
Block
Store
High performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
IMAGE
Durability
99.999999999%
Designed for Archival
Not a file system
Vaults & Archives
3-5 Hour Retrieval Time
Paradigm Archive Store
Performance Configurable - Low
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.011/GB/month
Typical use
case
Write once, read
infrequently
< 10% / Month
21. Simple
Storage
Service
Highly
scalable
object
storage
1
byte
to
5TB
in
size
99.999999999%
durability
Glacier
Long
term
object
archive
Extremely
low
cost
per
gigabyte
99.999999999%
durability
Storage
Lifecycle
IntegraNon
23. Compute
Storage
AWS
Global
Infrastructure
Database
App
Services
Deployment
&
AdministraNon
Networking
Database
Relational Database Service
Managed Oracle, MySQL & SQL Server
Dynamo DB
Managed NOSQL Database
Amazon Redshift
Massively Parallel Petabyte Scale Data Warehouse
RDS Dynamo
DB
Redshift
24. Compute
Storage
AWS
Global
Infrastructure
Database
App
Services
Deployment
&
AdministraNon
Networking
Database
Relational Database Service
Database-as-a-Service
No need to install or manage database instances
Scalable and fault tolerant configurations
Integration with Data Pipeline
RDS Dynamo
DB
Redshift
25. Compute
Storage
AWS
Global
Infrastructure
Database
App
Services
Deployment
&
AdministraNon
Networking
Database
DynamoDB
Provisioned throughput NoSQL database
Fast, predictable, configurable performance
Fully distributed, fault tolerant HA architecture
Integration with EMR & Hive
RDS Dynamo
DB
Redshift
26. Compute
Storage
AWS
Global
Infrastructure
Database
App
Services
Deployment
&
AdministraNon
Networking
Database
Redshift
Managed Massively Parallel Petabyte Scale Data
Warehouse
Streaming Backup/Restore to S3
Extensive Security
2 TB -> 1.6 PB
RDS Dynamo
DB
Redshift
32. Input Datanode: This could be a S3 bucket, RDS
table, EMR Hive table, etc.
Activity: This is a data aggregation,
manipulation, or copy that runs on a user-
configured schedule.
Output Datanode: This supports all the same
datasources as the input datanode, but they don’t
have to be the same type.
Amazon Data Pipeline
35. Benefits only possible in the Cloud
Pay as you
Go
Lower
Overall
Costs
Stop
Guessing
Capacity
Agility /
Speed /
Innovation
Avoid
Undifferentiated
Heavy Lifting
Go Global
in Minutes
✔ ✔ ✔ ✔ ✔ ✔
“Private
Cloud” /
On
Premises
X X X X X X
37. Ease of Operation
Compute
Infrastructure
Hadoop
ConfiguraNon
Local
Disk
OperaNng
System
Config
HDFS
Networking
Hive
Pig
HBase
User
Defined
Sogware
InstallaNon
38. Ease of Operation
Compute
Infrastructure
Hadoop
ConfiguraNon
Local
Disk
OperaNng
System
Config
HDFS
Networking
Hive
Pig
HBase
User
Defined
Sogware
InstallaNon
Multiple Hadoop
Distributions - Open Source
& MapR
Clusters Launched with 1
Command
Up in 5 Minutes
Hard Partitioned per
Customer on CPU, Memory
and Disk
Dynamic Cluster Resizing
In any of 8 Regions around
the Globe
40. Lower TCO
June
2013
Study
by
Accenture
Technology
Labs
Not
Sponsored
or
Funded
by
Amazon
“Accenture
assessed
the
price-‐
performance
raJo
between
bare-‐metal
Hadoop
clusters
and
Hadoop-‐as-‐a-‐Service
on
Amazon
Web
Services…[and]
revealed
that
Hadoop-‐as-‐a-‐Service
offers
bePer
price-‐performance
raJo…”
hkp://www.accenture.com/us-‐en/Pages/insight-‐hadoop-‐
deployment-‐comparison.aspx
41. • Spot allows customers
to bid on unused EC2
capacity
• Spot price based on
supply/demand of
instance types in an
Availability Zone
• Customers are fulfilled
when their bid price is
higher than the Spot
Price
• Instances will be
interrupted when the
Spot price exceed the
bid price
Spot 101 - What are Spot Instances