Architecting and productionising data science applications at scalesamthemonad
This document discusses architecting and productionizing data science applications at scale. It covers topics like parallel processing with Spark, streaming platforms like Kafka, and scalable machine learning approaches. It also discusses architectures for data pipelines and productionizing models, with a focus on automation, avoiding SQL databases, and using Kafka streams and Spark for batch and streaming workloads.
Functional programming is the most extreme programmingsamthemonad
Functional Programming is the most Extreme form of Programming according to the document. It introduces Functional Programming concepts and compares them to Object Oriented Programming. It argues that FP is better aligned with Extreme Programming principles like YAGNI and KISS. OOP can introduce unnecessary complexity and tight coupling between components, whereas FP avoids this through purity and avoiding side effects and mutable state.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Hadoop has become synonymous to Big Data. Oracle has release the latest standard to Java EE stack: Batch Processing JSR 352. Batch processing has been around for decades and there are many Java framework already available such Spring Batch. This talks provides a perspective about Hadoop and JSR352. Knowing when to use or the other or both together.
Microservices, Containers, and Machine LearningPaco Nathan
Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.
Daniel Abadi Keynote at EDBT 2013
This talk discusses: (1) Motivation for HadoopDB (2) Overview of HadoopDB (3) Lessons learned from commercializing HadoopDB into Hadapt (4) Ideas for overcoming the loading challenge (Invisible Loading)
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
The proliferation of different database systems has led to data silos and inconsistencies. In the past, there was a single data warehouse but now there are many types of databases optimized for different purposes like transactions, analytics, streaming, etc. This can be addressed by having a common platform like Hadoop that supports different database types to reduce silos and enable data integration. However, more integration tools are still needed to fully realize this vision.
Spark is an open-source software framework for rapid calculations on in-memory datasets. It uses Resilient Distributed Datasets (RDDs) that can be recreated if lost and supports transformations and actions on RDDs. Spark is useful for batch, interactive, and real-time processing across various problem domains like SQL, streaming, and machine learning via MLlib.
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
The document discusses how complex data structures can be modeled in a database using an extended relational model. It begins with an agenda that includes discussing loose typing, examples of what can be done, and looking at a real database with 10-20x fewer tables. It then contrasts the traditional relational model with HBase and discusses how structuring allows complex objects in fields and references between objects. Examples are given of modeling time-series data and music metadata in fewer tables using these techniques. Apache Drill is presented as a way to perform SQL queries over these complex data structures.
This document discusses SQL engines for Hadoop, including Hive, Presto, and Impala. Hive is best for batch jobs due to its stability. Presto provides interactive queries across data sources and is easier to manage than Hive with Tez. Presto's distributed architecture allows queries to run in parallel across nodes. It supports pluggable connectors to access different data stores and has language bindings for multiple clients.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Apache Spark: The Analytics Operating SystemAdarsh Pannu
This presentation was delivered by Adarsh Pannu at IBM's Insight Conference in Nov 2015. For a recording, visit: https://www.youtube.com/watch?v=Tbm7HIlmwJQ
The presentation provides an overview of Apache Spark, a general-purpose big data processing engine built around speed, ease of use and sophisticated analytics. It enumerates the benefits of incorporating Spark in the enterprise, including how it allows developers to write fully-featured distributed applications ranging from traditional data processing pipelines to complex machine learning. The presentation uses the Airline "On Time" data set to explore various components of the Spark stack.
NoSQL databases are non-relational databases designed for large volumes of data across many servers. They emerged to address scaling and reliability issues with relational databases. While different technologies, NoSQL databases are designed for distribution without a single point of failure and to sacrifice consistency for availability if needed. Examples include Dynamo, BigTable, Cassandra and CouchDB.
This document summarizes key abstractions that were important to the success of Comdb2, a highly available clustered relational database system developed at Bloomberg. The four main abstractions discussed are:
1. The relational model and use of SQL provided important abstraction that simplified application development and improved performance and reliability compared to a noSQL approach.
2. A goal of "perfect availability" where the database is always available and applications do not need error handling for failures.
3. Ensuring serializability so the database acts as if it has no concurrency to simplify application development.
4. Presenting the distributed database as a "single system image" so applications do not need to account
This document provides an agenda for a presentation on using machine learning with Apache Spark. The presentation introduces Apache Spark and its architecture, Scala notebooks in Spark, machine learning components in Spark, pipelines for machine learning tasks, and examples of regression, classification, clustering, collaborative filtering, and model tuning in Spark. It aims to provide a hands-on, example-driven introduction to applying machine learning using Apache Spark without deep dives into Spark architecture or algorithms.
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...Daniel Abadi
VLDB 2013 Early Career Research Contribution Award Presentation
Abstract: Four years ago at VLDB 2009, a paper was published about a research prototype, called HadoopDB, that attempted to transform Hadoop --- a batch-oriented scalable system designed for processing unstructured data --- into a full-fledged parallel database system that can achieve real-time (interactive) query responses across both structured and unstructured data. In 2010 it was commercialized by Hadapt, a start-up that was formed to accelerate the engineering of the HadoopDB ideas, and to harden the codebase for deployment in real-world, mission-critical applications. In this talk I will give an overview of HadoopDB, and how it combines ideas from the Hadoop and database system communities. I will then describe how the project transitioned from a research prototype written by PhD students at Yale University into enterprise-ready software written by a team of experienced engineers. We will examine particular technical features that are required in enterprise Hadoop deployments, and technical challenges that we ran into while making HadoopDB robust enough to be deployed in the real world. The talk will conclude with an analysis of how starting a company impacts the tenure process, and some thoughts for graduate students and junior faculty considering a similar path.
Hadoop and Graph Data Management: Challenges and OpportunitiesDaniel Abadi
HadoopWorld 2011 Presentation by Daniel Abadi
As Hadoop rapidly becomes the universal standard for scalable data analysis and processing, it is increasingly important to understand its strengths and weaknesses for particular application scenarios in order to avoid inefficiency pitfalls. For example, Hadoop has great potential to perform scalable graph analysis if it is used correctly. Recent benchmarking has shown that simple implementations can be 1300 times less efficient than a more optimal Hadoop-centered implementation. In this talk, Daniel Abadi gives an overview of a recent research project at Yale University that investigates how to perform sub-graph pattern matching within a Hadoop-centered system that is three orders of magnitude faster than a more simple approach. This talk highlights how the cleaning, transforming, and parallel processing strengths of Hadoop are combined with storage optimized for graph data analysis. It then discusses further changes that are needed in the core Hadoop framework to take performance to the next level.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It implements Google's MapReduce programming model and the Hadoop Distributed File System (HDFS) for reliable data storage. Key components include a JobTracker that coordinates jobs, TaskTrackers that run tasks on worker nodes, and a NameNode that manages the HDFS namespace and DataNodes that store application data. The framework provides fault tolerance, parallelization, and scalability.
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
PyTorch, the popular open-source ML framework, has continued to evolve rapidly since the introduction of PyTorch 1.0, which brought an accelerated workflow from research to production.
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
The audience will participate in a live, interactive demo that generates high-quality recommendations using the latest Spark-Cassandra integration for real time, approximate, and advanced analytics including machine learning, graph processing, and text processing.
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsDatabricks
This document summarizes Daniel Galvez's presentation on creating The People's Speech Dataset using Apache Spark and TPUs. The key points are:
1) The dataset aims to provide 86,000 hours of speech data with forced alignments between audio and transcripts in order to be challenging, free to use, and have a commercial license.
2) The conceptual workload is to take hour-long audio files, split them into 15 second segments, and use a pretrained speech recognition model to discover when each word in the transcript was said.
3) Creating the dataset encountered limitations with accelerator-aware scheduling in Spark, memory issues with PySpark UDFs, crashes in TPUs, and the need to reorder data by
The document discusses Spark application development and common problems that can occur. It introduces Unravel Data, a startup that aims to help developers visualize Spark job data, optimize performance through automated analysis and diagnoses, and strategize to prevent problems and meet goals. Key points include discussing common issues like failures, wrong results, poor performance, and resource problems; the difficulty of debugging using logs alone; and demonstrating Unravel's platform to address these challenges.
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
Hadoop was developed to solve problems with data warehousing systems at Yahoo and Facebook that were limited in processing large amounts of raw data in real-time. Hadoop uses HDFS for scalable storage and MapReduce for distributed processing. It allows for agile access to raw data at scale for ad-hoc queries, data mining and analytics without being constrained by traditional database schemas. Hadoop has been widely adopted for large-scale data processing and analytics across many companies.
Spark is an in-memory cluster computing framework that provides high performance for large-scale data processing. It excels over Hadoop by keeping data in memory as RDDs (Resilient Distributed Datasets) for faster processing. The document provides an overview of Spark architecture including its core-based execution model compared to Hadoop's JVM-based model. It also demonstrates Spark's programming model using RDD transformations and actions through an example of log mining, showing how jobs are lazily evaluated and distributed across the cluster.
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.
Accompanying Blog: Coming Soon!
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
[email protected]
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Spark is an open-source software framework for rapid calculations on in-memory datasets. It uses Resilient Distributed Datasets (RDDs) that can be recreated if lost and supports transformations and actions on RDDs. Spark is useful for batch, interactive, and real-time processing across various problem domains like SQL, streaming, and machine learning via MLlib.
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
The document discusses how complex data structures can be modeled in a database using an extended relational model. It begins with an agenda that includes discussing loose typing, examples of what can be done, and looking at a real database with 10-20x fewer tables. It then contrasts the traditional relational model with HBase and discusses how structuring allows complex objects in fields and references between objects. Examples are given of modeling time-series data and music metadata in fewer tables using these techniques. Apache Drill is presented as a way to perform SQL queries over these complex data structures.
This document discusses SQL engines for Hadoop, including Hive, Presto, and Impala. Hive is best for batch jobs due to its stability. Presto provides interactive queries across data sources and is easier to manage than Hive with Tez. Presto's distributed architecture allows queries to run in parallel across nodes. It supports pluggable connectors to access different data stores and has language bindings for multiple clients.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Apache Spark: The Analytics Operating SystemAdarsh Pannu
This presentation was delivered by Adarsh Pannu at IBM's Insight Conference in Nov 2015. For a recording, visit: https://www.youtube.com/watch?v=Tbm7HIlmwJQ
The presentation provides an overview of Apache Spark, a general-purpose big data processing engine built around speed, ease of use and sophisticated analytics. It enumerates the benefits of incorporating Spark in the enterprise, including how it allows developers to write fully-featured distributed applications ranging from traditional data processing pipelines to complex machine learning. The presentation uses the Airline "On Time" data set to explore various components of the Spark stack.
NoSQL databases are non-relational databases designed for large volumes of data across many servers. They emerged to address scaling and reliability issues with relational databases. While different technologies, NoSQL databases are designed for distribution without a single point of failure and to sacrifice consistency for availability if needed. Examples include Dynamo, BigTable, Cassandra and CouchDB.
This document summarizes key abstractions that were important to the success of Comdb2, a highly available clustered relational database system developed at Bloomberg. The four main abstractions discussed are:
1. The relational model and use of SQL provided important abstraction that simplified application development and improved performance and reliability compared to a noSQL approach.
2. A goal of "perfect availability" where the database is always available and applications do not need error handling for failures.
3. Ensuring serializability so the database acts as if it has no concurrency to simplify application development.
4. Presenting the distributed database as a "single system image" so applications do not need to account
This document provides an agenda for a presentation on using machine learning with Apache Spark. The presentation introduces Apache Spark and its architecture, Scala notebooks in Spark, machine learning components in Spark, pipelines for machine learning tasks, and examples of regression, classification, clustering, collaborative filtering, and model tuning in Spark. It aims to provide a hands-on, example-driven introduction to applying machine learning using Apache Spark without deep dives into Spark architecture or algorithms.
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...Daniel Abadi
VLDB 2013 Early Career Research Contribution Award Presentation
Abstract: Four years ago at VLDB 2009, a paper was published about a research prototype, called HadoopDB, that attempted to transform Hadoop --- a batch-oriented scalable system designed for processing unstructured data --- into a full-fledged parallel database system that can achieve real-time (interactive) query responses across both structured and unstructured data. In 2010 it was commercialized by Hadapt, a start-up that was formed to accelerate the engineering of the HadoopDB ideas, and to harden the codebase for deployment in real-world, mission-critical applications. In this talk I will give an overview of HadoopDB, and how it combines ideas from the Hadoop and database system communities. I will then describe how the project transitioned from a research prototype written by PhD students at Yale University into enterprise-ready software written by a team of experienced engineers. We will examine particular technical features that are required in enterprise Hadoop deployments, and technical challenges that we ran into while making HadoopDB robust enough to be deployed in the real world. The talk will conclude with an analysis of how starting a company impacts the tenure process, and some thoughts for graduate students and junior faculty considering a similar path.
Hadoop and Graph Data Management: Challenges and OpportunitiesDaniel Abadi
HadoopWorld 2011 Presentation by Daniel Abadi
As Hadoop rapidly becomes the universal standard for scalable data analysis and processing, it is increasingly important to understand its strengths and weaknesses for particular application scenarios in order to avoid inefficiency pitfalls. For example, Hadoop has great potential to perform scalable graph analysis if it is used correctly. Recent benchmarking has shown that simple implementations can be 1300 times less efficient than a more optimal Hadoop-centered implementation. In this talk, Daniel Abadi gives an overview of a recent research project at Yale University that investigates how to perform sub-graph pattern matching within a Hadoop-centered system that is three orders of magnitude faster than a more simple approach. This talk highlights how the cleaning, transforming, and parallel processing strengths of Hadoop are combined with storage optimized for graph data analysis. It then discusses further changes that are needed in the core Hadoop framework to take performance to the next level.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It implements Google's MapReduce programming model and the Hadoop Distributed File System (HDFS) for reliable data storage. Key components include a JobTracker that coordinates jobs, TaskTrackers that run tasks on worker nodes, and a NameNode that manages the HDFS namespace and DataNodes that store application data. The framework provides fault tolerance, parallelization, and scalability.
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
PyTorch, the popular open-source ML framework, has continued to evolve rapidly since the introduction of PyTorch 1.0, which brought an accelerated workflow from research to production.
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
The audience will participate in a live, interactive demo that generates high-quality recommendations using the latest Spark-Cassandra integration for real time, approximate, and advanced analytics including machine learning, graph processing, and text processing.
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsDatabricks
This document summarizes Daniel Galvez's presentation on creating The People's Speech Dataset using Apache Spark and TPUs. The key points are:
1) The dataset aims to provide 86,000 hours of speech data with forced alignments between audio and transcripts in order to be challenging, free to use, and have a commercial license.
2) The conceptual workload is to take hour-long audio files, split them into 15 second segments, and use a pretrained speech recognition model to discover when each word in the transcript was said.
3) Creating the dataset encountered limitations with accelerator-aware scheduling in Spark, memory issues with PySpark UDFs, crashes in TPUs, and the need to reorder data by
The document discusses Spark application development and common problems that can occur. It introduces Unravel Data, a startup that aims to help developers visualize Spark job data, optimize performance through automated analysis and diagnoses, and strategize to prevent problems and meet goals. Key points include discussing common issues like failures, wrong results, poor performance, and resource problems; the difficulty of debugging using logs alone; and demonstrating Unravel's platform to address these challenges.
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
Hadoop was developed to solve problems with data warehousing systems at Yahoo and Facebook that were limited in processing large amounts of raw data in real-time. Hadoop uses HDFS for scalable storage and MapReduce for distributed processing. It allows for agile access to raw data at scale for ad-hoc queries, data mining and analytics without being constrained by traditional database schemas. Hadoop has been widely adopted for large-scale data processing and analytics across many companies.
Spark is an in-memory cluster computing framework that provides high performance for large-scale data processing. It excels over Hadoop by keeping data in memory as RDDs (Resilient Distributed Datasets) for faster processing. The document provides an overview of Spark architecture including its core-based execution model compared to Hadoop's JVM-based model. It also demonstrates Spark's programming model using RDD transformations and actions through an example of log mining, showing how jobs are lazily evaluated and distributed across the cluster.
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.
Accompanying Blog: Coming Soon!
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
[email protected]
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Spark is an open-source cluster computing framework that can run analytics applications much faster than Hadoop by keeping data in memory rather than on disk. While Spark can access Hadoop's HDFS storage system and is often used as a replacement for Hadoop's MapReduce, Hadoop remains useful for batch processing and Spark is not expected to fully replace it. Spark provides speed, ease of use, and integration of SQL, streaming, and machine learning through its APIs in multiple languages.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.
In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
4 most important Open Source Big Data Techs in the Oracle Big Data Cloud Service.
https://www.youtube.com/watch?v=OX9el8qXvQo
The main idea of a Data Lake is to expose the company data in an agile and flexible way to the people within the company, but preserve safeguard and auditing features that are required for the company’s critical data. The way that most projects in this direction start out is by depositing all of the data in Hadoop, trying to infer the schema on top of the data and then use the data for analytics purposes via Hive or Spark. Described stack is a really good approach for many use cases, as it provides cheaply storing data in files and rich analytics on top. But many pitfalls and problems might show up on this road, which can be easily met by extending the toolset. The potential bottlenecks will be displayed as soon as the users arrive and start exploiting the Lake. From all of these reasons, planning and building a Data Lake within an organization requires strategic approach, in order to build an architecture that can support it.
Presentation from Data Science Conference 2.0 held in Belgrade, Serbia. The focus of the talk was to address the challenges of deploying a Data Lake infrastructure within the organization.
Josh Patterson gave a presentation on Hadoop and how it has been used. He discussed his background working on Hadoop projects including for the Tennessee Valley Authority. He outlined what Hadoop is, how it works, and examples of use cases. This includes how Hadoop was used to store and analyze large amounts of smart grid sensor data for the openPDC project. He discussed integrating Hadoop with existing enterprise systems and tools for working with Hadoop like Pig and Hive.
Architecting Agile Data Applications for ScaleDatabricks
Data analytics and reporting platforms historically have been rigid, monolithic, hard to change and have limited ability to scale up or scale down. I can’t tell you how many times I have heard a business user ask for something as simple as an additional column in a report and IT says it will take 6 months to add that column because it doesn’t exist in the datawarehouse. As a former DBA, I can tell you the countless hours I have spent “tuning” SQL queries to hit pre-established SLAs. This talk will talk about how to architect modern data and analytics platforms in the cloud to support agility and scalability. We will include topics like end to end data pipeline flow, data mesh and data catalogs, live data and streaming, performing advanced analytics, applying agile software development practices like CI/CD and testability to data applications and finally taking advantage of the cloud for infinite scalability both up and down.
This document summarizes the lessons learned from Traveloka's journey in building a scalable data pipeline. Some key lessons include: (1) splitting data pipelines based on query patterns and SLAs, (2) using technologies like Kafka to decouple data publishing and consumption and handle high throughput, (3) planning for a data warehouse from the beginning, and (4) testing scalability and choosing technologies suited for specific use cases. The document also outlines Traveloka's future plans to simplify their data architecture through a single entry point for data and less operational complexity.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
This document provides an overview of big data and Hadoop. It discusses what big data is, its types including structured, semi-structured and unstructured data. Some key sources of big data are also outlined. Hadoop is presented as a solution for managing big data through its core components like HDFS for storage and MapReduce for processing. The Hadoop ecosystem including other related tools like Hive, Pig, Spark and YARN is also summarized. Career opportunities in working with big data are listed in the end.
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
Les mégadonnées représentent un vrai enjeu à la fois technique, business et de société
: l'exploitation des données massives ouvre des possibilités de transformation radicales au
niveau des entreprises et des usages. Tout du moins : à condition que l'on en soit
techniquement capable... Car l'acquisition, le stockage et l'exploitation de quantités
massives de données représentent des vrais défis techniques.
Une architecture big data permet la création et de l'administration de tous les
systèmes techniques qui vont permettre la bonne exploitation des données.
Il existe énormément d'outils différents pour manipuler des quantités massives de
données : pour le stockage, l'analyse ou la diffusion, par exemple. Mais comment assembler
ces différents outils pour réaliser une architecture capable de passer à l'échelle, d'être
tolérante aux pannes et aisément extensible, tout cela sans exploser les coûts ?
Le succès du fonctionnement de la Big data dépend de son architecture, son
infrastructure correcte et de son l’utilité que l’on fait ‘’ Data into Information into Value ‘’.
L’architecture de la Big data est composé de 4 grandes parties : Intégration, Data Processing
& Stockage, Sécurité et Opération.
Azure BI Cloud Architectural Guidelines.pdfpbonillo1
This document provides guidelines for building cloud BI project architectures. It discusses considerations for architectural design such as data sources, volumes, model complexity and sharing needs. It then presents four common architecture templates - Hulk, Iron Man, Thor and Hawkeye - tailored to different needs around reporting demand, data volume and complexity. Key aspects of architectures like sources, transportation, processing, storage, live calculation, data access and orchestration are examined. Finally, it compares features of technologies that can fulfill different functional roles.
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
This document provides an overview and comparison of RDBMS, Hadoop, and Spark. It introduces RDBMS and describes its use cases such as online transaction processing and data warehouses. It then introduces Hadoop and describes its ecosystem including HDFS, YARN, MapReduce, and related sub-modules. Common use cases for Hadoop are also outlined. Spark is then introduced along with its modules like Spark Core, SQL, and MLlib. Use cases for Spark include data enrichment, trigger event detection, and machine learning. The document concludes by comparing RDBMS and Hadoop, as well as Hadoop and Spark, and addressing common misconceptions about Hadoop and Spark.
Thingyan is now a global treasure! See how people around the world are search...Pixellion
We explored how the world searches for 'Thingyan' and 'သင်္ကြန်' and this year, it’s extra special. Thingyan is now officially recognized as a World Intangible Cultural Heritage by UNESCO! Dive into the trends and celebrate with us!
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
Just-in-time: Repetitive production system in which processing and movement of materials and goods occur just as they are needed, usually in small batches
JIT is characteristic of lean production systems
JIT operates with very little “fat”
6. Data Lake - Definition - Martin Kleppman
"Hadoop opened up the possibility of indiscriminately dumping data into HDFS, and only later figuring out how to process it further.
By contrast, MPP databases typically require careful up-front modeling of the data and query patterns before importing the data
into the database's proprietary storage format.
From a purist's point of view, it may seem that this careful modeling and import is desirable, because it means users of the
database have better-quality data to work with. However, in practice, it appears that simply making data available quickly - even if it
is in a quirky, difficult-to-use, raw format [/schema] - is often more valuable than trying to decide on the ideal data model up front.
The idea is similar to a data warehouse: simply bringing data from various parts of a large organization together in one place is
valuable, because it enables joins across datasets that were previously disparate. The careful schema design required by an MPP
database slows down that centralised data collection; collecting data in it's raw form, and worrying about schema design later,
allows the data collection to be speeded up (a concept sometimes known as a "data lake" or "enterprise data hub").
... the interpretation of the data becomes the consumer's problem (the schema-on-read approach). ... There may not even be one
ideal data model, but rather different views onto the data that are suitable for different purposes. Simply dumping data in its raw
form allows for several such transformations. This approach has been dubbed the sushi principle: "raw data is better".
Page 415, 416, Martin Kleppmann - Designing Data-Intensive Applications
7. Data Lake - Definition - Martin Fowler
> The idea [Data Lake] is to have a single store for all of the raw data that anyone in an organization might
need to analyze.
...
> But there is a vital distinction between the data lake and the data warehouse. The data lake stores raw
data, in whatever form the data source provides. There is no assumptions about the schema of the data,
each data source can use whatever schema it likes. It's up to the consumers of that data to make sense of
that data for their own purposes.
> data put into the lake is immutable
> The data lake is “schemaless” [or Schema-on-Read]
> storage is oriented around the notion of a large schemaless structure … HDFS
https://martinfowler.com/bliki/DataLake.html ← MUST READ!
11. Trends (Sample = 11 Data Lakes)
Good Bad
Cost < £1m, business value in weeks/months Cost many millions, years before business value
Schema on Read
Documentation as code - Internal Open Source
Schema on Write
Metastores, data dictionaries, confluence
Cloud, PAAS, e.g. EMR, Dataproc
S3 for long term storage
On prem, Cloudera/Hortonworks
HDFS for long term storage
Scala / Java apps, Jenkins, CircleCi, etc with Bash
for deployment and lightweight scheduling
Hive / Pig scripts, manual releases, heavyweight
scheduling (Oozie, Airflow, Control-M, Luigi, etc)
High ratio 80%+; committed developers/engineers
that write code.
Small in house teams, highly skilled.
High ratio of involved people who do not commit
code.
Large low skilled offshore teams
Flat structure, cross-functional teams
Agile
Hierarchical, authoritarian
Waterfall
12. Trends (Sample = 11 Data Lakes)
Success Failure
XP, KISS, YAGNI BUFD, Tools, Processes, Governance, Complexity,
Documentation
Cross functional individuals (can architect, code &
do analysis) form a team that can deliver end to
end business value right from source to
consumption.
Co-dependent component teams, no one team can
deliver an end to end solution.
Clear focus on 1 business problem, solve it, then
solve 2nd business problem, solve it, then
deduplicate (DRY)
No clear business focus, too many goals, lofty
overly ambitious ideas, silver bullets, big hammers
Motivation - The WHY:
Satisfaction from solving problems & automation
Motivation - The WHY:
Deskilling & centralisation of power
14. Silver Bullets & Big Hammers
- Often built to demo/pitch to Architects (that don’t code) & non-technical/non-engineers
- Consequently have well polished UIs but often lacking quality under the hood
- Generally only handle happy cases
- Tend to assume all use cases are the same. Your use case will probably invalidate their assumptions
- The devil is in the details, and they obscure those details
- Generally make performance problems more complicated due to inherited and obscured complexity
- Often commercially motivated
- Few engineers/data scientists would recommend as they know what it's really like to build a Data Lake
and know that most of these tools won't work
- Often aim at deskilling, literally claiming that Engineers/Data Scientists are not necessary. They are
necessary, but now you have to pay a vendor/consultancy for those skills
- at very high markup
- with a lot of lost in translation issues and communication bottlenecks
- long delays in implementation
- Generally appeal to non-technical people that want centralisation and power, some tools literally
referring to users as “power users”
Note that there are exceptions, for example Uber’s Hudi seems to be built to solve real internal PB data problems, then later Open Sourced. There may be other exceptions.
15. Data Lake Principles
1. Immutability & Reproducibility - Datasets should be immutable, Any
queries/jobs run on the Data Lake should be reproducible
2. A Dataset corresponds to a directory and all the files in that directory, not files
- Big Data is too big to fit into single files. Avoid appending to a directory as
this is just like mutating it, thus violating 1.
3. An easy way to identify when new data has arrived - no scanning, no joining,
or complex event notification systems should be necessary. Simply partition
by landed date and consumers keep track of their own offsets (like in Kafka)
4. Schema On Read - Parquet headers plus directory structure form self
describing metadata (more next!).
16. Metadata
- Schema-on-read - parquet header has the schema
- Add lineage fields to data at each stage of a pipeline, especially later stages
- Internal Open Source via Monorepo
- Code is unambiguous
- Invest in high quality code control - Stop here!
- Analogy:
- An enterprise investing large amounts in meta-data services is like a restaurant investing large
amounts in menus
- In the best restaurants chefs write the specials of the day on a blackboard
- In the best enterprises innovation and code is created every day
- etc
17. Technology Choices - Analytics
Requirement Recommendation
SQL Interface to Parquet & Databases (via JDBC) Apache Zeppelin
Spark, Scala, Java, Python, R, CLI, and more Apache Zeppelin
Charts, graphs & visualisations (inc JS, D3 etc) Apache Zeppelin
Free and open source Apache Zeppelin
Integrated into infra (EMR, HDInsights) out-of-box - NoOps Apache Zeppelin
Lightweight scheduling and job management Apache Zeppelin
Basic Source Control & JSON Exports Apache Zeppelin
In memory compute (via Spark) Apache Zeppelin
Quickly implement dashboards & reports via WYSIWYG or JS Apache Zeppelin
Active Directory & ACL integration Apache Zeppelin
18. Technology Choices - Software
Requirement Recommendation
Parquet Format Spark
Production quality stable Spark APIs Scala/Java
Streaming Architecture on Kafka Scala/Java
Quick development & Dev Cycle Statically typed languages
Production quality software / low bug density Statically typed languages
Huge market of low skilled cheap resource
where speed of delivery, software quality and
data quality is not important
(please read The Mythical Man-Month!)
Python
https://insights.stackoverflow.com/survey/2019?utm_source=Iterable&utm_medium=email&utm_campaign=dev-survey-2019#top-paying-technologies
21. Conclusion
Yes you can build a Data Lake in an Agile way.
● Code first
● Upskill, not deskill
● Do not trust all vendor marketing literature & blogs
● Avoid most big tools, especially proprietary ones
23. Brief History of Spark
Version & Date Notes
Up to 0.5, 2010 - 2012 Spark created as a better Hadoop MapReduce.
- Awesome functional typed Scala API (RDD)
- In memory caching
- Broadcast variables
- Mesos support
0.6, 14/10/2012 - Java API (anticipating Java 8 & Scala 2.12 interop!)
0.7, 27/02/2013 - Python API: PySpark
- Spark”Streaming” Alpha
0.8, 25/09/2013 - Apache Incubator in June 2013
- September 2013, Databricks raises $13.9 million
- MLlib (nice idea, poor API) see https://github.com/samthebest/sceval/blob/master/README.md
0.9 - 1.6, 02/02/2014 - 04/01/2016
Hype years!
- February 2014, Spark becomes Top-Level Apache Project
- SparkSQL, and more Shiny (GraphX, Dataframe API, SparkR)
- Covariant RDDs requested
https://issues.apache.org/jira/browse/SPARK-1296
Key: Good idea - technically motivated,
Not so good idea (probably commercially motivated?)
24. Brief History of Spark
Version & Date Notes
2.0, 26/07/2016 Datasets API (nice idea, poor design):
- Typed, semi declarative class based API
- Improved serialisation
- No way to inject custom serialisation
StructuredStreaming API
- Uses same API as Datasets (so what are .cache, .filter, .mapPartitions supposed
to do? How do we branch? How to access a microbatch? How to control
microbatch sizes? etc)
2.3, 28/02/2018 StructuredStreaming trying to play catch up with Kafka Streams, Akka Streams, etc
???, 2500? - Increase parallelism without shuffling https://issues.apache.org/jira/browse/SPARK-5997
- Num partitions no longer respects num files
https://issues.apache.org/jira/browse/SPARK-24425
- Multiple SparkContexts https://issues.apache.org/jira/browse/SPARK-2243
- Closure cleaner bugs https://issues.apache.org/jira/browse/SPARK-26534
- Spores https://docs.scala-lang.org/sips/spores.html
- RDD covariance https://issues.apache.org/jira/browse/SPARK-1296
- Frameless to become native? https://github.com/typelevel/frameless
- Datasets to offer injectable custom serialisation based on this
https://typelevel.org/frameless/Injection.html
Key: Good idea - technically motivated,
Not so good idea (probably commercially motivated?)
25. Spark APIs - RDD
- Motivated by true Open Source & Unix philosophy - solve a specific real
problem well, simply and flexibly
- Oldest most stable API, has very few bugs
- Boils down to two functions that neatly correspond to MapReduce paradigm:
- `mapPartitions`
- `combineByKey`
- Simple flexible API design
- Can customise serialisation (using `mapParitions` and byte arrays)
- Can customise reading and writing (e.g. `binaryFiles`)
- Fairly functional, but does mutate state (e.g. `.cache()`)
- Advised API for experienced developers / data engineers, especially in the Big
Data space
26. Spark APIs - Dataset / Dataframe
- Motivated by increasing market size for vendors by targeting non-developers,
e.g. Analysts, Data Scientists and Architects
- Very buggy, e.g. (bugs I found in the last couple of months)
- Reading: nulling entire rows, parsing timestamps incorrectly, not handling escaping properly, etc
- Non-optional reference types are treated as nullable
- Closure cleaner seems more buggy with Datasets (unnecessarily serialises)
- API Design inflexible
- cannot inject custom serialisation
- No functional `combineByKey` API counterpart, have to instantiate an Aggregator
- Declarative API breaks MapReduce semantics E.g.
- A call to `groupBy` may not actually cause a groupby operation
- Advised API for those new to Big Data and generally trying to solve
“little/middle data” problems (i.e. extensive optimisations are not necessary),
and where data quality and application stability less important (e.g. POCs).
27. Spark APIs - SparkSQL
- Buggy, unstable, unpredictable
- SQL optimiser is quite immature
- MapReduce is a functional paradigm while SQL is declarative, consequently these
two don’t get along very well
- All the usual problems with SQL; hard to test, no compiler, not turing complete, etc
- Advised API for interactive analytical use only - never use for production
applications!
28. Frameless - Awesome!
- All of the benefits of Datasets without string literals
scala> fds.filter(fds('i) === 10).select(fds('x))
<console>:24: error: No column Symbol with shapeless.tag.Tagged[String("x")] of type A in Foo
fds.filter(fds('i) === 10).select(fds('x))
^
- Custom serialisation https://typelevel.org/frameless/Injection.html
- Cats integration, e.g. can join RDDs using `|+|`
- Advised API for both experienced Big Data Engineers and people new to Big
Data