Data analytics at a petabyte scale final

Apr 21, 2019Download as PPTX, PDF0 likes513 views

Or Koren, Head of Data @ ironSource presents how re-designed it's big data analytics using Presto to analyze at petabyte scale,.

Data analytics at a PB scale
200 researchers using just Presto and a data-lake
Or Koren | Head of Data @ ironSource

About Me ● Married ● 35 ● Tel Aviv ● Coding

Agenda:
ironSource overview
The past: 2016-2017
The present: 2018-2019
The future: 2020-2021

ESTABLISHE
D
ACQUISITIONS TO
DATE
EMPLOYEES
San Francisco
United States
New York
United States
London
United Kingdom
Berlin
Germany
Kiev
Ukraine
Tel Aviv
Israel
Bangalore
India
Hong Kong
China
Tokyo
Japan
Seoul
South Korea
Beijing
Shanghai
Shenzhen
China
53
7
11
1
39
5
3
1
5
624
30
ironSource Overview
ESTABLISHED
SEP. 2010
ACQUISITIONS TO DATE
8
EMPLOYEES
779
R&D EMPLOYEES
395

ironSource Solutions & Products
Developer Solutions
In-app advertising network and
mediation platform for app developers
PRODUCTS & PLATFORMS
ironSource Mediation
In-App Advertising Network
PRODUCTS & PLATFORMS
Enterprise Solutions
Engagement platform for Carriers & OEMs
PRODUCTS & PLATFORMS
ironSource Aura
PRODUCTS & PLATFORMS
Digital Solutions
Software delivery platform and B2C
security products
PRODUCTS & PLATFORMS
Delivery & Ad Monetization Platform
Security products
PRODUCTS & PLATFORMS
*
*In advanced negotiations

Our old data architecture
● 10 redshift clusters
● 5 RDS clusters
● 1000+ ETLs
● 1 Tableau
● Hard to scale
● Hard to maintain
● Hard to work
● Limited data
● Expensive

To the future
● Lifetime data
● Fast SQL
● Easy scale
● Data science
● Open source

Data Lake
Parquet
● Files based
● Open Source
● Column oriented
● S3 bucket

Hive
Apache Hive is a data warehouse software project
which was built on top of Apache Hadoop in order
to provide data query and analysis.
● One place to rule them all
● Hadoop Ecosystem
● Presto
● Spark
● Athena
Data Lake

Presto & Qubole
Qubole delivers a Self-Service Platform for Big Data
Analytics built on Amazon Web Services, Microsoft
and Google Clouds.
Scalable Clusters
Qubole configuration scales clusters up and down by
looking over the execution plan of the queries.
Spots
Maintenance & Versions
Qubole takes care of new versions & 24/7 support
For Every Query

Our Volume
500TB
Daily scan (from S3)
70K
Daily queries over
Presto 200
Users
500
Dashboards

● 10 redshift clusters
● 5 RDS clusters
● 1000+ ETLs
● 1 Tableau
● Hard to scale
● Hard to maintain
● Hard to work
● Limited data
● Expensive
Our data architecture
● 1 redshift cluster
● 0 RDS clusters
● 300 ETLs
● 1 Tableau & 1 Re/dash
● Reduce costs by 50%
● Agile to the business
+
Our new data architecture

● Replace 90% of our ETLs to ELTs
● Help our data science team by being more clear
on the logic, reducing their work time by 80%
● Keeping raw data without any manipulation
Reduce ML Model deployment time by 50%
● No ETL time - no schedule
The New ETL
is ELT
Extract,
Load,
Transform.

Presto Connectors
Kafka
Real-time alerts over presto
ScyllaDB
Increase our insights with our ML
models
Elasticsearch
Join business KPIs with R&D logs

Key notes to take home
Data-Lake Keep all your raw data in one place.
It will help you in the future with costs, research, reduce resources and ML models
Qubole Enjoy the benefits of 3rd party companies and continue to work on your business
Scale Reach endless data with big clusters that scale per query
ELT Move 90% of your ETLs to ELTs, to reduce lags and costs
Agile Promote your business with quick insights
Free to Learn Take 10% of your time and learn!
Try and play with the data :)

Thank You
Or Koren
or@ironsrc.com
Linkedin: korenor

This document discusses data and analytics at Wix, including details about Presto and Quix. Wix is a large company with over 150 million users, 2600 employees, and 1000 microservices. It uses Presto for analytics with over 400,000 weekly queries across 34,000 tables and 11 catalogs. Presto runs on AWS with custom plugins and handles a variety of data sources. Quix is a self-service IDE tool developed at Wix with over 1300 employees using it to run over 8,000 daily queries across 34,000 tables in notes, notebooks, and scheduled jobs. Quix is now being developed as an open source project.

Presto summit israel 2019-04Ori Reshef

Presto for apps deck varada prestoconfOri Reshef

Presto Summit 2018 - 08 - FINRAkbajda

Presto talk @ Global AI conference 2018 Bostonkbajda

Presented at Global AI Conference in Boston 2018: http://www.globalbigdataconference.com/boston/global-artificial-intelligence-conference-106/speaker-details/kamil-bajda-pawlikowski-62952.html Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, LinkedIn, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years. Presto is really a SQL-on-Anything engine in a single query can access data from Hadoop, S3-compatible object stores, RDBMS, NoSQL and custom data stores. This talk will cover some of the best use cases for Presto, recent advancements in the project such as Cost-Based Optimizer and Geospatial functions as well as discuss the roadmap going forward.

The Evolution of the Fashion Retail Industry in the Age of AI with Kshitij Ku...Databricks

AI is fundamentally transforming how we live and work. Zalando is a data driven company. We deliver an optimal customer experience that drives engagement. We continue to improve this experience by leveraging the latest technologies and machine learning techniques — such as building a cutting edge cloud based infrastructure to support our operations at scale. We provide our data scientists across Zalando with the means to implement artificial intelligence use cases, leveraging data from all parts of our company and the best machine learning techniques from across the industry. Apache Spark delivered through Databricks is at the core of this strategy. In this keynote, I’ll share our AI journey thus far, and share how we are exploring ways to unify data through A.I. with Spark and Databricks.

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...Databricks

Spark has established itself as the most popular platform for advanced scale-out analytical applications. It is deeply integrated with the Hadoop ecosystem, offers a set of powerful libraries and supports both Python and R. Because of these reasons Data Scientists have started to adopt Spark to train and deploy their models. When Spark 1.4 was released back in 2015, it included the new SparkR library: this API gave R users the exciting new option to run R code on Spark. And while the initial promise to provide a full R environment in Spark has been kept, it takes a deeper understanding of SparkR’s inner workings to make optimal use of its capabilities. This talk will give a comprehensive update on where we stand with Data Science applications in R based on the latest Spark releases. We will share insights from both a Startup solution and a Fortune 100 company where SparkR does Machine Learning in the Cloud on a scale that would have not been feasible previously: it’s parallel execution model runs in minutes and hours whereas conventional sequential approaches would take days and months. Suggested Topics: • An update on the SparkR architecture in the latest Spark release: using R with SparkSQL, MLlib and Spark’s Structured Streaming • How to handle practical challenges, e.g. running R on the cluster without a local installation, storing non-tabular results, such as Data Science models or plots, mixing Scala and R. • Scaling Big Compute Applications with SparkR: Parallelizing SparkR applications with User-Defined Functions (UDFs) and elastic scaling of resources in the Cloud • An Outlook on Machine Learning with SparkR and its ecosystem, frameworks and tools. • Plus: “Do I need to learn Python?”

Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Databricks

Detecting Mobile Malware with Apache Spark with David PryceDatabricks

“The ability to detect malware has needed to drastically change in the past few years away from traditional signature or list based techniques. Couple this with the rise of mobile device based attacks, where the scale of the data is predicted to be 60% of the internet in 2018*, our online lives will need Machine Learning (ML) and Data Science to ensure its security. At Wandera we have successfully implemented a malware detection (and classification) ML model at scale with the use of Apache Spark (MLib) and the PMML via OpenScoring paradigm. In this talk we will touch on the training data and why we use Spark at all, the features we extract from mobile phone applications and how we then obtain our high accuracy scores in the cloud. At Wandera we have successfully implemented a Malware detection (and classification) ML model at scale with the use of Apache Spark (MLib) and the PMML via OpenScoring paradigm. *https://blog.cloudflare.com/our-predictions-for-2018/”

Big Data Meets Learning Science: Keynote by Al EssaSpark Summit

How do we learn and how can we learn better? Educational technology is undergoing a revolution fueled by learning science and data science. The promise is to make a high-quality personalized education accessible and affordable by all. In this presentation Alfred will describe how Apache Spark and Databricks are at the center of the innovation pipeline at McGraw Hill for developing next-generation learner models and algorithms in support of millions of learners and instructors worldwide.

Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichDatabricks

The term “Lambda Architecture” stands for a generic, scalable and fault-tolerant data processing architecture. As the hyper-scale now offers a various PaaS services for data ingestion, storage and processing, the need for a revised, cloud-native implementation of the lambda architecture is arising. In this talk we demonstrate the blueprint for such an implementation in Microsoft Azure, with Azure Databricks — a PaaS Spark offering – as a key component. We go back to some core principles of functional programming and link them to the capabilities of Apache Spark for various end-to-end big data analytics scenarios. We also illustrate the “Lambda architecture in use” and the associated tread-offs using the real customer scenario – Rijksmuseum in Amsterdam – a terabyte-scale Azure-based data platform handles data from 2.500.000 visitors per year.

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.

Find your dataOliver Busse

This document summarizes Oliver Busse's presentation on using Graph databases in XPages applications. It introduces Graph databases and their terminology like vertices and edges. It describes how GraphNSF, a Graph database implementation in Domino, stores vertices and edges as documents in NSF files. It provides examples of creating vertices and edges programmatically and adding metadata to existing data using a graph structure. It concludes with a live demo of a GraphNSF implementation in an XPages application.

2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong

Building a data platform doesn’t have to be like entering a portal to Stranger Things. Join us in one hour for Tableau in the Cloud: A Netflix Original where Albert Wong, Netflix’s analytics expert, will show you how to simplify your data stack to deliver self-service analytics at scale. Albert will discuss the details of connecting to big data, finding datasets, and discovering critical insights from visualizations. He will also share how Netflix is developing and growing their analytics ecosystem with Tableau, and how they prioritize sustaining their data culture of freedom and responsibility.

Presto Summit 2018 - 03 - Starburst CBOkbajda

Elastic Stack roadmap deep diveElasticsearch

- Elastic provides a search and analytics platform called the Elastic Stack that includes the Elastic Stack, Beats data shippers, and Kibana analytics and visualization tools. - The presentation discussed updates to Elastic's products including performance improvements to search, new features for distributed search across data centers, and enhanced security options for authentication and authorization. - Elastic aims to provide customizable and extensible solutions for users to ingest, store, search, analyze and visualize large volumes of data from various sources.

Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks

Quby, an Amsterdam-based technology company, offers solutions to empower homeowners to stay in control of their electricity, gas and water usage. Using Europe’s largest energy dataset, consisting of petabytes of IoT data, the company has developed AI powered products that are used by hundreds of thousands of users on a daily basis. Delta Lake ensures the quality of incoming records though schema enforcement and evolution. But it is the Data Engineers role to check whether the expected data is ingested in to the Delta Lake at the right time with expected metrics so that downstream processes will function their duties. Re-training models and serving on the fly might go wrong unless we put the right monitoring infrastructure too.

How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiDatabricks

As big data technology matures, you’d think there would be more talent available to hire. Although the number of people interested and engaged in the big data world has dramatically increased, job demand is far ahead. Companies like Google or Facebook have access to the best talent — thousands of engineers with PhDs from the best schools, which is why they are able to innovate. How can a company close the skills gap while innovating and creating product advantage? This talk highlights how the right technology can allow you to compete without having an army of PhDs at your disposal. At iPass, we’ve created an environment where our engineers can be empowered to create value without getting bogged down by big data and Ops challenges. As a result, we have been able to more easily recruit internal engineers to our big data team, leveraging their current expertise, while bringing them up to speed on big data projects much faster. Join this talk to learn how you can do the same for your organization.

A (XPages) developers guide to CloudantFrank van der Linden

Frank van der Linden presented on connecting XPages applications to Cloudant. He began with an introduction to Cloudant, describing it as the cloud version of CouchDB that stores data as JSON documents. He then covered how to connect to Cloudant directly via REST or through an OSGi plugin, and described storing and retrieving data from Cloudant using a Java connector. Finally, he demonstrated integrating Cloudant with an XPages application to store and search job documents, attachments, and rich text.

Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Databricks

"In the oil and gas industry, utilizing vast amounts of data has long been identified as an important indicator of operational performance. The measurement of key performance indicators is a routine practice in well construction, but a systematic way of statistically analyzing performance against a large data bank of offset wells is not a common practice. The performance of statistical analysis in real-time is even less common. With the adoption of distributed computing platforms, like Apache Spark, new analysis opportunities become available to leverage large-scale time-series data sets to optimize performance. Two case studies are presented in this talk: the rate of penetration (ROP) and the amount of vibration per run. By collecting real-time, telemetry data and comparing it with historic sample datasets within the Databricks Unified Analytics Platform, the optimization team was able to quickly determine whether the performance being delivered matched or exceeded past performance with statistical certainty. This is extremely important while trying new techniques with data that is highly variable. By substituting anecdotal evidence with statistical analysis, decision making is more precise and better informed. In this talk we'll share how we accomplished this and the lessons learned along the way."

Bridging the Completeness of Big Data on DatabricksDatabricks

Data completeness is key for building any machine learning and deep learning model. The reality is that outliers and nulls widely exist in the data. The traditional methods of using fixed values or statistical metrics (min, max and mean) does not consider the relationship and patterns within the data. Most time it offers poor accuracy and would introduce additional outliers. Also, given our large data size, the computation is an extremely time-consuming process and a lot of time it could be constrained by the limited resource on local computer. To address those issues, we have developed a new approach that will first leverage the similarity within our data points based on the nature of data source then using a collaborative AI model to fill null values and correct outliers. In this talk, we will walk through the way we use a distributed framework to partition data by KDB tree for neighbor discovery and a collaborative filtering AI technology to fill the missing values and correct outliers. In addition, we will demonstrate how we reply on delta lake and MLflow for data and model management.

Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Databricks

How did Devon move from a traditional reporting and data warehouse approach to a modern data lake? What did it take to go from a slow and brittle technical landscape to an a flexible, scalable, and agile platform? In the past, Devon addressed data solutions in dozens of ways depending on the user and the requirements. Through a visionary program, driven by Databricks, Devon has begun a transformation of how it consumes data and enables engineers, analysts, and IT developers to deliver data driven solutions along all levels of the data analytics spectrum. We will share the vision, technical architecture, influential decisions, and lessons learned from our journey. Join us to hear the unique Databricks success story at Devon.

How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks

Upwork has the biggest closed-loop online dataset of jobs and job seekers in labor history (>10M Profiles, >100M Job Posts, Job Proposals and Hiring Decisions, >10B of Messages, Transaction and Feedback Data). Besides sheer quantity, our data is also contextually very rich. We have client and contractor data for the entire job-funnel – from finding jobs to getting the job done. For various machine learning applications including search and recommendations and labor marketplace optimization (rate, supply and demand), we heavily relied on a Greenplum-based data warehouse solution for data processing and ad-hoc ML pipelines (weka, scikit-learn, R) for offline model development and online model scoring. In this talk, we present our modernization efforts in moving towards a 1) holistic data processing infrastructure for batch and stream data processing using S3, Kinesis, Spark and Spark Structured Streaming 2) model development using Spark MLlib and other ML libraries for Spark 3) model serving using Databricks Model Scoring, Scoring over Structured Streams and microservices and 3) how we orchestrate and streamline all these processes using Apache Airflow and a CI/CD workflow customized to our Data Science product engineering needs. The focus of this talk is on how we were able to leverage the Databricks service offering to reduce DevOps overhead and costs, complete the entire modernization with moderate efforts and adopt a collaborative notebook-based solution for all our data scientists to develop model, reuse features and share results. We will shared the core lessons learned and pitfalls we encountered during this journey.

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Databricks

How did eBay move their ETL computation from conventional RDBMS environment over to Spark? What did it take to go from a strategic vision to a viable solution? This paper will take you through a journey which lead to an implementation of a 1000+ node Spark Cluster running 10,000+ ETL jobs daily, all done in a span of less than 6 months, by a team with limited Spark experience. We will share the vision, technical architecture, critical Management decisions, Challenges and Road ahead. This will be a unique opportunity to look into this awesome Spark success story at eBay!

Spark - Migration Story Roman Chukh

The document discusses a company's migration from their in-house computation engine to Apache Spark. It describes five key issues encountered during the migration process: 1) difficulty adapting to Spark's low-level RDD API, 2) limitations of DataSource predicates, 3) incomplete Spark SQL functionality, 4) performance issues with round trips between Spark and other systems, and 5) OutOfMemory errors due to large result sizes. Lessons learned include being aware of new Spark features and data formats, and designing architectures and data structures to minimize data movement between systems.

How R Developers Can Build and Share Data and AI Applications that Scale with...Databricks

This document discusses how R developers can build and share scalable data and AI applications using RStudio and Databricks. It outlines how RStudio and Databricks can be used together to overcome challenges of processing large amounts of data in R, including limited server memory and performance issues. Developers can use hosted RStudio servers on Databricks clusters, connect to Spark from RStudio using Databricks Connect, and share scalable Shiny apps deployed with RStudio Connect. The ODBC toolchain provides a performant way to connect R to Spark without issues encountered when using sparklyr directly.

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

Spline 2 - Vision and Architecture OverviewVaclav Kosar

AWS vs Azure vs Google (GCP) - SlidesTobyWilman

Customer migration to azure sql database from on-premises SQL, for a SaaS app...George Walters

More Related Content

What's hot (20)

Detecting Mobile Malware with Apache Spark with David PryceDatabricks

Big Data Meets Learning Science: Keynote by Al EssaSpark Summit

Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichDatabricks

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

Find your dataOliver Busse

2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong

Presto Summit 2018 - 03 - Starburst CBOkbajda

Elastic Stack roadmap deep diveElasticsearch

Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks

How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiDatabricks

A (XPages) developers guide to CloudantFrank van der Linden

Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Databricks

Bridging the Completeness of Big Data on DatabricksDatabricks

Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Databricks

How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Databricks

Spark - Migration Story Roman Chukh

How R Developers Can Build and Share Data and AI Applications that Scale with...Databricks

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

Spline 2 - Vision and Architecture OverviewVaclav Kosar

Detecting Mobile Malware with Apache Spark with David PryceDatabricks

Big Data Meets Learning Science: Keynote by Al EssaSpark Summit

Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichDatabricks

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

Find your dataOliver Busse

2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong

Presto Summit 2018 - 03 - Starburst CBOkbajda

Elastic Stack roadmap deep diveElasticsearch

Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks

How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiDatabricks

A (XPages) developers guide to CloudantFrank van der Linden

Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Databricks

Bridging the Completeness of Big Data on DatabricksDatabricks

Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Databricks

How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Databricks

Spark - Migration Story Roman Chukh

How R Developers Can Build and Share Data and AI Applications that Scale with...Databricks

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

Spline 2 - Vision and Architecture OverviewVaclav Kosar

Similar to Data analytics at a petabyte scale final (20)

AWS vs Azure vs Google (GCP) - SlidesTobyWilman

Customer migration to azure sql database from on-premises SQL, for a SaaS app...George Walters

Platform Requirements for CI/CD Success—and the Enterprises Leading the WayVMware Tanzu

All enterprises want to increase the speed of software delivery to get new products to market faster. The means for achieving this is often through the practice of continuous integration/continuous delivery. But speed alone isn’t enough—teams also require the ability to pivot when conditions change. They must ensure their software is stable and reliable, and be able to roll out patches and other security measures quickly and at scale. A cloud-native platform coupled with test-driven development and CI/CD practices can help make this a reality. In this webinar, 451 Research’s Jay Lyman presents the results of his research into cloud-native platform requirements for enterprise CI/CD and DevOps success. Pivotal’s James Ma joins Lyman to discuss best practices from DevOps teams charged with running and managing cloud-native platforms, including applying CI/CD to the platform itself. Speakers: James Ma, Pivotal and Jay Lyman, 451 Research

Accelerate Self-Service Analytics with Data Virtualization and VisualizationDenodo

Watch full webinar here: https://bit.ly/39AhUB7 Enterprise organizations are shifting to self-service analytics as business users need real-time access to holistic and consistent views of data regardless of its location, source or type for arriving at critical decisions. Data Virtualization and Data Visualization work together through a universal semantic layer. Learn how they enable self-service data discovery and improve performance of your reports and dashboards. In this session, you will learn: - Challenges faced by business users - How data virtualization enables self-service analytics - Use case and lessons from customer success - Overview of the highlight features in Tableau

Microsoft SQL Server 2016 - Everything Built InDavid J Rosenthal

Industry leading Build mission-critical, intelligent apps with breakthrough scalability, performance, and availability. Security + performance Protect data at rest and in motion. SQL Server is the most secure database for six years running in the NIST vulnerabilities database. End-to-end mobile BI Transform data into actionable insights. Deliver visual reports on any device—online or offline—at one-fifth the cost of other self-service solutions. In-database advanced analytics Analyze data directly within your SQL Server database using R, the popular statistics language. Consistent experiences Whether data is in your datacenter, in your private cloud, or on Microsoft Azure, you’ll get a consistent experience.

Zakir_Hussain_cvzakir hussain

Zakir Hussain has over 7 years of experience in telecommunications and IT, including roles as a lead engineer and software engineering analyst. He has strong skills in programming languages like PL/SQL, PHP, C/C++, Java, and databases like Oracle and SQL Server. He has extensive experience designing and developing ETL processes, data warehouses, reports, and portals using tools such as Informatica, Oracle Data Integrator, Oracle Discoverer, and Crystal Reports. He has worked on projects for companies like Grameenphone, Accenture, Summit Power, BRAC, and N-Tier Solutions.

Data Culture Series - Keynote - 3rd DecJonathan Woodward

Big data. Small data. All data. You have access to an ever-expanding volume of data inside the walls of your business and out across the web. The potential in data is endless – from predicting election results to preventing the spread of epidemics. But how can you use it to your advantage to help move your business forward? Drive a Data Culture within your organisation Keynote include Ric Howe & Anthony Saxby

Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Dataconomy Media

Architecting an Open Source AI Platform 2018 editionDavid Talby

Customer migration to Azure SQL database, December 2019George Walters

Bluegranite AA Webinar FINAL 28JUN16Andy Lathrop

This document provides an overview of Microsoft R and its capabilities for advanced analytics. It discusses how Microsoft R can enable businesses to analyze large volumes of data across multiple environments including locally, on Azure, and with SQL Server and HDInsight. The presentation includes a demonstration of R used with SQL Server, HDInsight, Azure Machine Learning, and Power BI. It highlights how Microsoft R provides a unified platform for data science and analytics that allows users to write code once and deploy models anywhere.

RedisGraph A Low Latency Graph DB: Pieter CailliauRedis Labs

This document summarizes a presentation about RedisGraph, a graph database that runs on Redis. The presentation discusses RedisGraph's capabilities, use cases where graph databases are useful, and what new features are upcoming for RedisGraph. Specific points mentioned include RedisGraph's support for the Cypher query language, improvements in performance and functionality since its general availability, and how the graph database can power features for IBM's Multicloud Manager product.

Data Amp South Africa - SQL Server 2017Travis Wright

Digital transformation with microsoft data and ai MichaelRoenker

This document discusses how data, AI, and digital transformation can drive business value. It promotes Microsoft's data and AI platform for enabling organizations to optimize their data platforms, transform with analytics and AI, and differentiate with new applications. The platform provides security, flexibility to use any data from anywhere, and capabilities for the cloud, hybrid scenarios, and on-premises. Case studies show how various companies achieved benefits like improved performance, reduced costs, accelerated innovation, and new revenue through use of Microsoft's data and AI tools.

Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon

At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms. As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.

TestGuild and QuerySurge Presentation -DevOps for Data TestingRTTS

This slide deck is from one of our 4 webinars in our half-day series in conjunction with Test Guild. Chris Thompson and Mike Calabrese, Senior Solution Architects and QuerySurge experts, provide great information, a demo and lots of humor in this webinar on how to implement DevOps for Data in your DataOps pipeline. This webinar was performed in conjunction with Test Guild. To watch the video, go to: https://youtu.be/1ihuRPgY_rs

Sql 2017 net rafMaximiliano Accotto

Sql 2016 2017 fullMaximiliano Accotto

SQL Server 2017 provides more flexibility by allowing users to run it on Linux, Docker, and Windows. It features support for graph queries, machine learning with R and Python, and adaptive query processing. SQL Server 2017 also provides enhanced security, performance, and analytics capabilities including in-database machine learning and data insights from diverse sources. It allows businesses to deploy, manage and analyze their data on the platform of their choice.

Bringing your data to life using Power BI - SPS London 2016Chirag Patel

This document provides an overview of Power BI, including its key components and how to use them. Power BI Desktop allows users to connect to various data sources, transform the data using Power Query, build reports using Power Pivot and Power View, and publish dashboards to the Power BI service. The Power BI service allows users to build dashboards with tiles linked to reports, ask questions of the data using natural language, and access reports and dashboards on mobile devices. Office 365 Groups integration allows content to be shared and collaborated on within groups. The presenter provides demonstrations of connecting to data, building reports and dashboards, asking questions of the data, and using the mobile app.

Modern Business Intelligence and Advanced AnalyticsCollective Intelligence Inc.

This document summarizes how businesses can transform through business intelligence (BI) and advanced analytics using Microsoft's modern BI platform. It outlines the Power BI and Azure Analysis Services tools for visualization, data modeling, and analytics. It also discusses how Collective Intelligence and Microsoft can help customers accelerate their move to a data-driven culture and realize benefits like increased productivity and cost savings by implementing BI and advanced analytics solutions in the cloud. The presentation includes demonstrations of Power BI and Azure Analysis Services.

AWS vs Azure vs Google (GCP) - SlidesTobyWilman

Customer migration to azure sql database from on-premises SQL, for a SaaS app...George Walters

Platform Requirements for CI/CD Success—and the Enterprises Leading the WayVMware Tanzu

Accelerate Self-Service Analytics with Data Virtualization and VisualizationDenodo

Microsoft SQL Server 2016 - Everything Built InDavid J Rosenthal

Zakir_Hussain_cvzakir hussain

Data Culture Series - Keynote - 3rd DecJonathan Woodward

Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Dataconomy Media

Architecting an Open Source AI Platform 2018 editionDavid Talby

Customer migration to Azure SQL database, December 2019George Walters

Bluegranite AA Webinar FINAL 28JUN16Andy Lathrop

RedisGraph A Low Latency Graph DB: Pieter CailliauRedis Labs

Data Amp South Africa - SQL Server 2017Travis Wright

Digital transformation with microsoft data and ai MichaelRoenker

Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon

TestGuild and QuerySurge Presentation -DevOps for Data TestingRTTS

Sql 2017 net rafMaximiliano Accotto

Sql 2016 2017 fullMaximiliano Accotto

Bringing your data to life using Power BI - SPS London 2016Chirag Patel

Modern Business Intelligence and Advanced AnalyticsCollective Intelligence Inc.

Recently uploaded (20)

What is Model Context Protocol(MCP) - The new technology for communication bw...Vishnu Singh Chundawat

The MCP (Model Context Protocol) is a framework designed to manage context and interaction within complex systems. This SlideShare presentation will provide a detailed overview of the MCP Model, its applications, and how it plays a crucial role in improving communication and decision-making in distributed systems. We will explore the key concepts behind the protocol, including the importance of context, data management, and how this model enhances system adaptability and responsiveness. Ideal for software developers, system architects, and IT professionals, this presentation will offer valuable insights into how the MCP Model can streamline workflows, improve efficiency, and create more intuitive systems for a wide range of use cases.

The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john

Drupalcamp Finland – Measuring Front-end Energy ConsumptionExove

Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok

Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo

Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company

Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.

Into The Box Conference Keynote Day 1 (ITB2025)Ortus Solutions, Corp

Cyber Awareness overview for 2025 month of securityriccardosl1

AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB

I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.

2025-05-Q4-2024-Investor-Presentation.pptxSamuele Fogagnolo

HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/ HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar. Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten. In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich - Zugriff auf die Konsole - Auffinden und Interpretieren von Protokolldateien - Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS) - Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien - Nutzung der Client Clocking-Funktion

Splunk Security Update | Public Sector Summit Germany 2025Splunk

Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB

Want to learn practical tips for designing systems that can scale efficiently without compromising speed? Join us for a workshop where we’ll address these challenges head-on and explore how to architect low-latency systems using Rust. During this free interactive workshop oriented for developers, engineers, and architects, we’ll cover how Rust’s unique language features and the Tokio async runtime enable high-performance application development. As you explore key principles of designing low-latency systems with Rust, you will learn how to: - Create and compile a real-world app with Rust - Connect the application to ScyllaDB (NoSQL data store) - Negotiate tradeoffs related to data modeling and querying - Manage and monitor the database for consistently low latencies

Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...BookNet Canada

Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next. Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/ Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.

Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan

This is a Quick Research Guide (QRG). QRGs include the following: - A brief, high-level overview of the QRG topic. - A milestone timeline for the QRG topic. - Links to various free online resource materials to provide a deeper dive into the QRG topic. - Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic. QRGs planned for the series: - Artificial Intelligence QRG - Quantum Computing QRG - Big Data Analytics QRG - Spacecraft Guidance, Navigation & Control QRG (coming 2026) - UK Home Computing & The Birth of ARM QRG (coming 2027) Any questions or comments? - Please contact Arthur Morgan at [email protected]. 100% human made.

Build Your Own Copilot & Agents For DevsBrian McKeiver

Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell

Linux Professional Institute LPIC-1 Exam.pdfRHCSA Guru

Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan

Greenhouse_Monitoring_Presentation.pptx.hpbmnnxrvb

What is Model Context Protocol(MCP) - The new technology for communication bw...Vishnu Singh Chundawat

The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john

Drupalcamp Finland – Measuring Front-end Energy ConsumptionExove

Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok

Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo

Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company

Into The Box Conference Keynote Day 1 (ITB2025)Ortus Solutions, Corp

Cyber Awareness overview for 2025 month of securityriccardosl1

AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB

2025-05-Q4-2024-Investor-Presentation.pptxSamuele Fogagnolo

HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda

Splunk Security Update | Public Sector Summit Germany 2025Splunk

Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB

Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...BookNet Canada

Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan

Build Your Own Copilot & Agents For DevsBrian McKeiver

Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell

Linux Professional Institute LPIC-1 Exam.pdfRHCSA Guru

Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan

Greenhouse_Monitoring_Presentation.pptx.hpbmnnxrvb

Data analytics at a petabyte scale final

1. Data analytics at a PB scale 200 researchers using just Presto and a data-lake Or Koren | Head of Data @ ironSource

2. About Me ● Married ● 35 ● Tel Aviv ● Coding

3. Agenda: ironSource overview The past: 2016-2017 The present: 2018-2019 The future: 2020-2021

4. ironSource overview

5. ESTABLISHE D ACQUISITIONS TO DATE EMPLOYEES San Francisco United States New York United States London United Kingdom Berlin Germany Kiev Ukraine Tel Aviv Israel Bangalore India Hong Kong China Tokyo Japan Seoul South Korea Beijing Shanghai Shenzhen China 53 7 11 1 39 5 3 1 5 624 30 ironSource Overview ESTABLISHED SEP. 2010 ACQUISITIONS TO DATE 8 EMPLOYEES 779 R&D EMPLOYEES 395

6. ironSource Solutions & Products Developer Solutions In-app advertising network and mediation platform for app developers PRODUCTS & PLATFORMS ironSource Mediation In-App Advertising Network PRODUCTS & PLATFORMS Enterprise Solutions Engagement platform for Carriers & OEMs PRODUCTS & PLATFORMS ironSource Aura PRODUCTS & PLATFORMS Digital Solutions Software delivery platform and B2C security products PRODUCTS & PLATFORMS Delivery & Ad Monetization Platform Security products PRODUCTS & PLATFORMS * *In advanced negotiations

7. The past

8. Our old data architecture ● 10 redshift clusters ● 5 RDS clusters ● 1000+ ETLs ● 1 Tableau ● Hard to scale ● Hard to maintain ● Hard to work ● Limited data ● Expensive

9. Our data scientist…

10. To the future ● Lifetime data ● Fast SQL ● Easy scale ● Data science ● Open source

11. The present

12. Data Lake Parquet ● Files based ● Open Source ● Column oriented ● S3 bucket

13. Hive Apache Hive is a data warehouse software project which was built on top of Apache Hadoop in order to provide data query and analysis. ● One place to rule them all ● Hadoop Ecosystem ● Presto ● Spark ● Athena Data Lake

14. Presto & Qubole Qubole delivers a Self-Service Platform for Big Data Analytics built on Amazon Web Services, Microsoft and Google Clouds. Scalable Clusters Qubole configuration scales clusters up and down by looking over the execution plan of the queries. Spots Maintenance & Versions Qubole takes care of new versions & 24/7 support For Every Query

15. Auto scaling demo Presto UI

16. Our Volume 500TB Daily scan (from S3) 70K Daily queries over Presto 200 Users 500 Dashboards

17. Our data scientist …

18. ● 10 redshift clusters ● 5 RDS clusters ● 1000+ ETLs ● 1 Tableau ● Hard to scale ● Hard to maintain ● Hard to work ● Limited data ● Expensive Our data architecture ● 1 redshift cluster ● 0 RDS clusters ● 300 ETLs ● 1 Tableau & 1 Re/dash ● Reduce costs by 50% ● Agile to the business + Our new data architecture

19. The future

20. ● Replace 90% of our ETLs to ELTs ● Help our data science team by being more clear on the logic, reducing their work time by 80% ● Keeping raw data without any manipulation Reduce ML Model deployment time by 50% ● No ETL time - no schedule The New ETL is ELT Extract, Load, Transform.

21. Presto Connectors Kafka Real-time alerts over presto ScyllaDB Increase our insights with our ML models Elasticsearch Join business KPIs with R&D logs

22. Key notes to take home Data-Lake Keep all your raw data in one place. It will help you in the future with costs, research, reduce resources and ML models Qubole Enjoy the benefits of 3rd party companies and continue to work on your business Scale Reach endless data with big clusters that scale per query ELT Move 90% of your ETLs to ELTs, to reduce lags and costs Agile Promote your business with quick insights Free to Learn Take 10% of your time and learn! Try and play with the data :)

23. Thank You Or Koren [email protected] Linkedin: korenor

Editor's Notes

#2: Good morning everyone!!! I am Or and I will show you today how we use presto and datalake at a PB scale
#3: Before we start i want to show you a bit about myself and my team, so this picture was taken at last Purim CUSTOME party here next by, at hangar 11 (eleven). For those have that noticed, i heart my knee 4 weeks ago Skiing in Val-Morel, france So i had to sit in the sun, drink and relax… I am: Married, 35, leave in Tel-Aviv. And Coding has been my life, since i was eleven….
#4: I will show you a bita about ironSource. Then I will take you to a journey of time since 2016 (before Presto) until today (with presto) and what we are going to use in the future.
#6: ironSource was created 10 years ago. We are almost 800 employees & more than 50% of us are R&D. Our headquarters & R&D center is located in Tel-Aviv & we have 9 more offices around the world.
#7: ironSource has few different business divisions: Developer solutions - This division focuses on providing tools and technology to mobile app developers - specifically game developers. We offer an SDK which essentially enables the developer to run ads in his app to make more money. We are very strong with rewarded video - so if any of you are gamers, you may be familiar with the moment in a game when you run out of lives and you are offered a rewarded video to watch in order to continue playing. That’s an example of what we do Enterprise solutions - Focusing on helping mobile device manufacturers and mobile carriers to engage with their customers. Instead of having 20 different applications pre-installed on your device, users have the power to set up their device the way they want to, with the apps they really want and need. Digital solutions - This is my division, we are focusing on the desktop world, (Mac, PC). We help software developers with technologies that help monetize their software and distribute it to new users.
#9: Lets have a look back on our -- AR-KI-TECH-TURE We had 10 different Redshift clusters One for BI One for Researchers One for R&D One for Data science One for Realtime data One for Historic Data One for DWH One for QA One for Critical ETLs One for Backups As you can imagine, it was really hard to work with. We had 5 RDS clusters - Mainly for our Applications (Like OLAP) We had more than 1000 ETLs... We had 1 Tableau Server And it was really hard. Hard to scale - Redshift scale very slow - from few hours till days... Hard to maintain - We had Vacum Tables, Delete old data, move data from one redshift to another. Hard to work - From two aspects: Not all the tables where on the same cluser. 30% of our Clusters power, went only for the insertion of the data. Limited data - We could not insert all the data into one cluster. Very Expensive
#10: This is how our data scientist looked like at the time. Or Even like that.
#11: So, We stop and thought where we want to be in the future. First of all, we wanted lifetime data, which is very important to our business Fast SQL - We wanted SQL that is fast enough for our dashboard usage We wanted the ability to scale very fast We focused on our data science team, as we know we are going to increase our data science team and ML models Open source - we did not wanted to be attached to a certain company
#13: So we started to create our Data lake, we choose to use parquet files, which is open source and column oriented. We keep all of our data in S3 as we convert the data from json into parquet in batch operations on near-realtime.
#14: Hive & Hive MetaStore - we have one source of truth for our table definition. Which works perfect with any Hadoop Ecosystem Such as: Presto Spark Athena And more.
#15: Presto and Qubole. We use presto to query our data-lake via Qubole. Qubole is self service platform that enables us to configure presto clusters that easily scales, uses SPOTS, and they take care of Maintenance, new Versions & 24h support. Once you configured your cluster, it can increase itself, from 3 to 50 nodes for example within seconds… And that is being done for every query you do
#16: Let’s see an example of Auto scaling over presto. I run around 50 different dashboards that uses presto and saved Presto UI snapshot every few seconds. As you can see, at the start there are 3 nodes and 4 queries. And as i run the dashboards, the number of nodes is increasing as the number of queries. After all queries are finished, the cluster is decreasing back to normal.
#17: A bit about our volume. We have around 70 thousands queries running via Presto every Day. We have 200 users 500 dashboards and increasing And half of Peta-Byte scan per day, just from S3, without the caching of Presto.
#18: Remember our data scientist? Well, I think this is the best picture to explain how he feels.
#19: Lets see how our -- AR-KI-TECH-TURE ---- looks like today We eliminate 9 of our Redshift’s, kept only one for Finance/DWH. We eliminate all our RDS’s - all the data is stored in the data-lake We reduce 70% of our ETL’s - as we don’t need to move data from one place to another. We have add to our Tableau Server, a Re/Dash server.Re/Dash is an open source BI tool, we use Re/Dash for the short terms solutions and Tableau to the long term solutions By adding the Data-Lake, all of our problems disappeared! In addition, we have reduced our costs by 50%! And most important, we became much more agile to the business, instead of having first insights for a new project in 2 to 8 weeks, we are giving the first insights in the first day OR even the first hour!
#20: What we expect to use more in the future.
#21: First of all. ELT. The new ETL is ELT. If you don’t know what is ELT: Extract, Load, Transform. It means you need to create your Business logic in a big query (OR VIEW) We are going to reduce around 90% of our ETL’s and move them to ELT. Why ELT? Data science. We see that the ELT reduce the Data science work by 80%!The main reason is that they can create a dataset within minutes. By cloning an ELT of specific business unit And add more features. 2. Deployment - ELT helps Data engineers deploy the ML model, since all the RAW Data is in one place, and the model was created upon this data and not aggregated data. 3. No lag - NO scheduler - you become more realtime.
#22: We are going to increase our usage in Presto Connectors. Kafka, we are going to change our alerts system ( for business KPI’s) from Data-lake to kafka.To ensure faster findings on real-time! Scylla DB - increase our insights into Scylla-DB for our ML models. ElasticSearch - We use ElasticSearch via Kibana to monitor server logs and r&d logs, we see strong needs to be able to join those logs with business KPI’s
#23: A few notes to take home Data-Lake - Keep all your data in one place, it will save you time, effort & money. Qubole - Use Big Data services like Qubole to be able to focus on your business and not on the maintenance Scale - Presto Scales just works perfect ELT - don’t do ETL’s, you don’t need them anymore. With Presto, you can be much more agile to your business Free to LearnAs i always encourage my team TO DO, take 10% of your time, learn & play with the data.

Data analytics at a petabyte scale final

Recommended

More Related Content

What's hot (20)

Similar to Data analytics at a petabyte scale final (20)

Recently uploaded (20)

Data analytics at a petabyte scale final

Editor's Notes