Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
O'Reilly ebook: Operationalizing the Data LakeVasu S
Best practices for building a cloud data lake operation—from people and tools to processes
https://www.qubole.com/resources/ebooks/ebook-operationalizing-the-data-lake
The document summarizes a presentation about data vault automation at a Dutch department store chain called de Bijenkorf. It discusses the project objectives of having a single source of reports and integrating with production systems. An architectural overview is provided, including the use of AWS services, a Snowplow event tracker, and Vertica data warehouse. Automation was implemented for loading data from over 250 source tables into the data vault and then into information marts. This reduced ETL development time and improved auditability. The data vault supports customer analysis, personalization, and business intelligence uses at de Bijenkorf. Drivers of the project's success included the AWS infrastructure, automation approach, and Pentaho ETL framework.
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
Apache Hadoop is revolutionizing business intelligence and data analytics by providing a scalable and fault-tolerant distributed system for data storage and processing. It allows businesses to explore raw data at scale, perform complex analytics, and keep data alive for long-term analysis. Hadoop provides agility through flexible schemas and the ability to store any data and run any analysis. It offers scalability from terabytes to petabytes and consolidation by enabling data sharing across silos.
Cloud Storage Spring Cleaning: A Treasure HuntSteven Moy
1) The document discusses how Yelp analyzed their S3 access logs stored in AWS to optimize their cloud storage costs.
2) They used Spark to convert the log files to Parquet format for easier analysis. AWS Athena was then used to run SQL queries on the Parquet files to understand access patterns and age of data.
3) This analysis found that 20% of their data was rarely accessed after 90-400 days and could be moved to the cheaper Infrequent Access storage tier, while 50% of their data was over 400 days old and could be archived to Glacier to reduce costs by around 25% ongoing.
Should I move my database to the cloud?James Serra
So you have been running on-prem SQL Server for a while now. Maybe you have taken the step to move it from bare metal to a VM, and have seen some nice benefits. Ready to see a TON more benefits? If you said “YES!”, then this is the session for you as I will go over the many benefits gained by moving your on-prem SQL Server to an Azure VM (IaaS). Then I will really blow your mind by showing you even more benefits by moving to Azure SQL Database (PaaS/DBaaS). And for those of you with a large data warehouse, I also got you covered with Azure SQL Data Warehouse. Along the way I will talk about the many hybrid approaches so you can take a gradual approve to moving to the cloud. If you are interested in cost savings, additional features, ease of use, quick scaling, improved reliability and ending the days of upgrading hardware, this is the session for you!
Microsoft Data Platform - What's includedJames Serra
This document provides an overview of a speaker and their upcoming presentation on Microsoft's data platform. The speaker is a 30-year IT veteran who has worked in various roles including BI architect, developer, and consultant. Their presentation will cover collecting and managing data, transforming and analyzing data, and visualizing and making decisions from data. It will also discuss Microsoft's various product offerings for data warehousing and big data solutions.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
In this presentation, we:
1. Look at the challenges and opportunities of the data era
2. Look at key challenges of the legacy data warehouses such as data diversity, complexity, cost, scalabilily, performance, management, ...
3. Look at how modern data warehouses in the cloud not only overcome most of these challenges but also how some of them bring additional technical innovations and capabilities such as pay as you go cloud-based services, decoupling of storage and compute, scaling up or down, effortless management, native support of semi-structured data ...
4. Show how capabilities brought by modern data warehouses in the cloud, help businesses, either new or existing ones, during the phases of their lifecycle such as launch, growth, maturity and renewal/decline.
5. Share a Near-Real-Time Data Warehousing use case built on Snowflake and give a live demo to showcase ease of use, fast provisioning, continuous data ingestion, support of JSON data ...
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
a talk about azure synapse aimed to help people who are not data experts understand what synapse is and how you can integrate it with other technologies
This is a run-through at a 200 level of the Microsoft Azure Big Data Analytics for the Cloud data platform based on the Cortana Intelligence Suite offerings.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
Design Principles for a Modern Data WarehouseRob Winters
This document discusses design principles for a modern data warehouse based on case studies from de Bijenkorf and Travelbird. It advocates for a scalable cloud-based architecture using a bus, lambda architecture to process both real-time and batch data, a federated data model to handle structured and unstructured data, massively parallel processing databases, an agile data model like Data Vault, code automation, and using ELT rather than ETL. Specific technologies used by de Bijenkorf include AWS services, Snowplow, Rundeck, Jenkins, Pentaho, Vertica, Tableau, and automated Data Vault loading. Travelbird additionally uses Hadoop for initial data processing before loading into Redshift
The document discusses evolving data warehousing strategies and architecture options for implementing a modern data warehousing environment. It begins by describing traditional data warehouses and their limitations, such as lack of timeliness, flexibility, quality, and findability of data. It then discusses how data warehouses are evolving to be more modern by handling all types and sources of data, providing real-time access and self-service capabilities for users, and utilizing technologies like Hadoop and the cloud. Key aspects of a modern data warehouse architecture include the integration of data lakes, machine learning, streaming data, and offering a variety of deployment options. The document also covers data lake objectives, challenges, and implementation options for storing and analyzing large amounts of diverse data sources.
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
The document discusses trends in data modeling for analytics. It outlines weaknesses in traditional enterprise data architectures that rely on ETL processes and large centralized data warehouses. A modern approach uses a data lake to store raw data files and enable just-in-time analytics using data virtualization. Key aspects of the data lake include storing data in folders by level of processing (raw, staging, ODS, aggregated), using file formats like Parquet, and creating star schemas and aggregations on top of the stored data.
Richard Vermillion, CEO of After, Inc. and Fulcrum Analytics, Inc. discusses data lakes and their value in supporting the warranty and extended service plain chain.
Dustin Vannoy presented on using Delta Lake with Azure Databricks. He began with an introduction to Spark and Databricks, demonstrating how to set up a workspace. He then discussed limitations of Spark including lack of ACID compliance and small file problems. Delta Lake addresses these issues with transaction logs for ACID transactions, schema enforcement, automatic file compaction, and performance optimizations like time travel. The presentation included demos of Delta Lake capabilities like schema validation, merging, and querying past versions of data.
Log Analytics and Application Insights can help with monitoring and managing integration solutions built with Microsoft technologies. They provide performance monitoring of APIs, functions, logic apps and other components. While end-to-end tracing has some limitations, the tools allow for custom logging, out-of-box views of data, and testing the availability of key applications and services.
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It provides an overview of big data and why data warehouses need Hadoop. It also gives examples of how Hadoop can be integrated into a data warehouse, including using Sqoop to import and export data between Hadoop and Oracle. Finally, it discusses best practices for using Hadoop efficiently and avoiding common pitfalls when integrating Hadoop with a data warehouse.
The main idea of a Data Lake is to expose the company data in an agile and flexible way to the people within the company, but preserve safeguard and auditing features that are required for the company’s critical data. The way that most projects in this direction start out is by depositing all of the data in Hadoop, trying to infer the schema on top of the data and then use the data for analytics purposes via Hive or Spark. Described stack is a really good approach for many use cases, as it provides cheaply storing data in files and rich analytics on top. But many pitfalls and problems might show up on this road, which can be easily met by extending the toolset. The potential bottlenecks will be displayed as soon as the users arrive and start exploiting the Lake. From all of these reasons, planning and building a Data Lake within an organization requires strategic approach, in order to build an architecture that can support it.
Presentation from Data Science Conference 2.0 held in Belgrade, Serbia. The focus of the talk was to address the challenges of deploying a Data Lake infrastructure within the organization.
Should I move my database to the cloud?James Serra
So you have been running on-prem SQL Server for a while now. Maybe you have taken the step to move it from bare metal to a VM, and have seen some nice benefits. Ready to see a TON more benefits? If you said “YES!”, then this is the session for you as I will go over the many benefits gained by moving your on-prem SQL Server to an Azure VM (IaaS). Then I will really blow your mind by showing you even more benefits by moving to Azure SQL Database (PaaS/DBaaS). And for those of you with a large data warehouse, I also got you covered with Azure SQL Data Warehouse. Along the way I will talk about the many hybrid approaches so you can take a gradual approve to moving to the cloud. If you are interested in cost savings, additional features, ease of use, quick scaling, improved reliability and ending the days of upgrading hardware, this is the session for you!
Microsoft Data Platform - What's includedJames Serra
This document provides an overview of a speaker and their upcoming presentation on Microsoft's data platform. The speaker is a 30-year IT veteran who has worked in various roles including BI architect, developer, and consultant. Their presentation will cover collecting and managing data, transforming and analyzing data, and visualizing and making decisions from data. It will also discuss Microsoft's various product offerings for data warehousing and big data solutions.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
In this presentation, we:
1. Look at the challenges and opportunities of the data era
2. Look at key challenges of the legacy data warehouses such as data diversity, complexity, cost, scalabilily, performance, management, ...
3. Look at how modern data warehouses in the cloud not only overcome most of these challenges but also how some of them bring additional technical innovations and capabilities such as pay as you go cloud-based services, decoupling of storage and compute, scaling up or down, effortless management, native support of semi-structured data ...
4. Show how capabilities brought by modern data warehouses in the cloud, help businesses, either new or existing ones, during the phases of their lifecycle such as launch, growth, maturity and renewal/decline.
5. Share a Near-Real-Time Data Warehousing use case built on Snowflake and give a live demo to showcase ease of use, fast provisioning, continuous data ingestion, support of JSON data ...
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
a talk about azure synapse aimed to help people who are not data experts understand what synapse is and how you can integrate it with other technologies
This is a run-through at a 200 level of the Microsoft Azure Big Data Analytics for the Cloud data platform based on the Cortana Intelligence Suite offerings.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
Design Principles for a Modern Data WarehouseRob Winters
This document discusses design principles for a modern data warehouse based on case studies from de Bijenkorf and Travelbird. It advocates for a scalable cloud-based architecture using a bus, lambda architecture to process both real-time and batch data, a federated data model to handle structured and unstructured data, massively parallel processing databases, an agile data model like Data Vault, code automation, and using ELT rather than ETL. Specific technologies used by de Bijenkorf include AWS services, Snowplow, Rundeck, Jenkins, Pentaho, Vertica, Tableau, and automated Data Vault loading. Travelbird additionally uses Hadoop for initial data processing before loading into Redshift
The document discusses evolving data warehousing strategies and architecture options for implementing a modern data warehousing environment. It begins by describing traditional data warehouses and their limitations, such as lack of timeliness, flexibility, quality, and findability of data. It then discusses how data warehouses are evolving to be more modern by handling all types and sources of data, providing real-time access and self-service capabilities for users, and utilizing technologies like Hadoop and the cloud. Key aspects of a modern data warehouse architecture include the integration of data lakes, machine learning, streaming data, and offering a variety of deployment options. The document also covers data lake objectives, challenges, and implementation options for storing and analyzing large amounts of diverse data sources.
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
The document discusses trends in data modeling for analytics. It outlines weaknesses in traditional enterprise data architectures that rely on ETL processes and large centralized data warehouses. A modern approach uses a data lake to store raw data files and enable just-in-time analytics using data virtualization. Key aspects of the data lake include storing data in folders by level of processing (raw, staging, ODS, aggregated), using file formats like Parquet, and creating star schemas and aggregations on top of the stored data.
Richard Vermillion, CEO of After, Inc. and Fulcrum Analytics, Inc. discusses data lakes and their value in supporting the warranty and extended service plain chain.
Dustin Vannoy presented on using Delta Lake with Azure Databricks. He began with an introduction to Spark and Databricks, demonstrating how to set up a workspace. He then discussed limitations of Spark including lack of ACID compliance and small file problems. Delta Lake addresses these issues with transaction logs for ACID transactions, schema enforcement, automatic file compaction, and performance optimizations like time travel. The presentation included demos of Delta Lake capabilities like schema validation, merging, and querying past versions of data.
Log Analytics and Application Insights can help with monitoring and managing integration solutions built with Microsoft technologies. They provide performance monitoring of APIs, functions, logic apps and other components. While end-to-end tracing has some limitations, the tools allow for custom logging, out-of-box views of data, and testing the availability of key applications and services.
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It provides an overview of big data and why data warehouses need Hadoop. It also gives examples of how Hadoop can be integrated into a data warehouse, including using Sqoop to import and export data between Hadoop and Oracle. Finally, it discusses best practices for using Hadoop efficiently and avoiding common pitfalls when integrating Hadoop with a data warehouse.
The main idea of a Data Lake is to expose the company data in an agile and flexible way to the people within the company, but preserve safeguard and auditing features that are required for the company’s critical data. The way that most projects in this direction start out is by depositing all of the data in Hadoop, trying to infer the schema on top of the data and then use the data for analytics purposes via Hive or Spark. Described stack is a really good approach for many use cases, as it provides cheaply storing data in files and rich analytics on top. But many pitfalls and problems might show up on this road, which can be easily met by extending the toolset. The potential bottlenecks will be displayed as soon as the users arrive and start exploiting the Lake. From all of these reasons, planning and building a Data Lake within an organization requires strategic approach, in order to build an architecture that can support it.
Presentation from Data Science Conference 2.0 held in Belgrade, Serbia. The focus of the talk was to address the challenges of deploying a Data Lake infrastructure within the organization.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Prague data management meetup 2018-03-27Martin Bém
This document discusses different data types and data models. It begins by describing unstructured, semi-structured, and structured data. It then discusses relational and non-relational data models. The document notes that big data can include any of these data types and models. It provides an overview of Microsoft's data management and analytics platform and tools for working with structured, semi-structured, and unstructured data at varying scales. These include offerings like SQL Server, Azure SQL Database, Azure Data Lake Store, Azure Data Lake Analytics, HDInsight and Azure Data Warehouse.
Is the traditional data warehouse dead?James Serra
With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
Les mégadonnées représentent un vrai enjeu à la fois technique, business et de société
: l'exploitation des données massives ouvre des possibilités de transformation radicales au
niveau des entreprises et des usages. Tout du moins : à condition que l'on en soit
techniquement capable... Car l'acquisition, le stockage et l'exploitation de quantités
massives de données représentent des vrais défis techniques.
Une architecture big data permet la création et de l'administration de tous les
systèmes techniques qui vont permettre la bonne exploitation des données.
Il existe énormément d'outils différents pour manipuler des quantités massives de
données : pour le stockage, l'analyse ou la diffusion, par exemple. Mais comment assembler
ces différents outils pour réaliser une architecture capable de passer à l'échelle, d'être
tolérante aux pannes et aisément extensible, tout cela sans exploser les coûts ?
Le succès du fonctionnement de la Big data dépend de son architecture, son
infrastructure correcte et de son l’utilité que l’on fait ‘’ Data into Information into Value ‘’.
L’architecture de la Big data est composé de 4 grandes parties : Intégration, Data Processing
& Stockage, Sécurité et Opération.
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
Businesses are generating and ingesting an unprecedented volume of structured and unstructured data to be analyzed. Needed is a scalable Big Data infrastructure that processes and parses extremely high volume in real-time and calculates aggregations and statistics. Banking trade data where volumes can exceed billions of messages a day is a perfect example.
Firms are fast approaching 'the wall' in terms of scalability with relational databases, and must stop imposing relational structure on analytics data and map raw trade data to a data model in low latency, preserve the mapped data to disk, and handle ad-hoc data requests for data analytics.
Joe discusses and introduces NoSQL databases, describing how they are capable of scaling far beyond relational databases while maintaining performance , and shares a real-world case study that details the architecture and technologies needed to ingest high-volume data for real-time analytics.
For more information, visit www.casertaconcepts.com
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
This document discusses architecting a data lake. It begins by introducing the speaker and topic. It then defines a data lake as a repository that stores enterprise data in its raw format including structured, semi-structured, and unstructured data. The document outlines some key aspects to consider when architecting a data lake such as design, security, data movement, processing, and discovery. It provides an example design and discusses solutions from vendors like AWS, Azure, and GCP. Finally, it includes an example implementation using Azure services for an IoT project that predicts parts failures in trucks.
Building Data Intensive Analytic Application on Top of Delta LakesDatabricks
Why to build your own analytics application on top on Delta lake : – Every enterprise is building a data lake. However, these data lakes are plagued by low user adoption, poor data quality, and result in lower ROI. – BI tools may not be enough for your use case, especially, when you want to build a data driven analytical web application such as paysa. – Delta’s ACID guarantees allows you to build a real-time reporting app that displays consistent and reliable data
In this talk we will learn :
how to build your own analytics app on top of delta lake.
how Delta Lake helps you build pristine data lake with several ways to expose data to end-users
how analytics web application can be backed by custom Query layer that executes Spark SQL in remote Databricks cluster.
We’ll explore various options to build an analytics application using various backend technologies.
Various Architecture pattern/components/frameworks can be used to build custom analytics platform in no time.
How to leverage machine learning to build advanced analytics applications Demo: Analytics application built on Play Framework(for back-end), React(for front-end), Structured Streaming for ingesting data from Delta table. Live query analytics on real time data ML predictions based on analytics data
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
This document provides an overview and outline of a 1-hour introduction to building a big data pipeline using Docker, Cassandra, Spark, Spark-Notebook and Akka. The introduction is presented as a half-day workshop at Devoxx November 2015. It uses a data pipeline environment from Data Fellas and demonstrates how to use scalable distributed technologies like Docker, Spark, Spark-Notebook and Cassandra to build a reactive, repeatable big data pipeline. The key takeaway is understanding how to construct such a pipeline.
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
Data lakes are providing immense value to organizations embracing data science.
In this webinar, William will discuss the value of having broad, detailed, and seemingly obscure data available in cloud storage for purposes of expanding Data Science in the organization.
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
This document discusses building data pipelines with Apache Spark and DataStax Enterprise (DSE) for both static and real-time data. It describes how DSE provides a scalable, fault-tolerant platform for distributed data storage with Cassandra and real-time analytics with Spark. It also discusses using Kafka as a messaging queue for streaming data and processing it with Spark. The document provides examples of using notebooks, Parquet, and Akka for building pipelines to handle both large static datasets and fast, real-time streaming data sources.
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
GenAI for Quant Analytics: survey-analytics.aiInspirient
Pitched at the Greenbook Insight Innovation Competition as apart of IIEX North America 2025 on 30 April 2025 in Washington, D.C.
Join us at survey-analytics.ai!
Just-in-time: Repetitive production system in which processing and movement of materials and goods occur just as they are needed, usually in small batches
JIT is characteristic of lean production systems
JIT operates with very little “fat”
Telangana State, India’s newest state that was carved from the erstwhile state of Andhra
Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’
to seek a permanent and sustainable solution to the drinking water problem in the state. MB is
designed to provide potable drinking water to every household in their premises through
piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable
piped drinking water supply from surface water sources
Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation.
Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
Ad
So You Want to Build a Data Lake?
1. So, you want to build a
Data Lake?
The Basics of Data Lakes, Key Considerations, and Lessons Learned
David P. Moore
12/15/2020
2. Agenda
• Introduction
• What is a Data Lake?
• Architecture and Design
• Governance and Support
• Lessons Learned
• What’s Next?
3. About Me…
• Sr. Software Developer at CarMax since 2019,
Consultant with CapTech for 3+ years
Before that worked at Capital One in a variety of
roles including Developer, Data Modeler, Tech
Lead
• Have worked on 3 data lake implementations
at 3 different companies using 3 different
technologies
• 20+ years in data and software dev, with a
passion for continuous improvement
• Two Fun facts:
I have a black belt in Silkisondan Karate
I love to play guitar and listen to music
5. First a little data history lesson…
Data warehouse and proprietary ETL and database tools
• 1990’s to mid 2000’s – Data Warehouse Popularized
Ralph Kimball – Star Schema, Data Marts
Bill Inmon - EDW
• SMP Database Systems (Oracle, SQL Server, Sybase)
• ETL Tools (Informatica, Ab Initio, Talend, etc)
• MPP Database Systems (Teradata, Netezza, Greenplum, etc)
ELT, 3NF
6. Open-source, big data and the
cloud…
• 2003, 2004 – Google File System, and Google MapReduce Papers
published
• 2006 – Hadoop started by Doug Cutting and Mike Cafarell
• 2008 - Companies like Cloudera, Hortonworks, MapR form to
package and distribute open-source Hadoop
• 2006 – AWS launched, followed by Google in 2008 and Azure in 2010
• 2010 – Apache Spark started by Matei Zaharia
• 2013 – Databricks launched offering Spark as a Service
• 2019 – Delta Lake released by Databricks
7. What is Big Data?
• Big Data is a term used to describe massive volumes of data that can
flood a business daily
• This data can be either structured or unstructured, but ultimately
the datasets are so large that they cannot be processed on a single
machine in a reasonable amount of time
• 3 V’s, popularized by Doug Laney from Gartner:
Volume Variety Velocity
8. What is a Data Lake?
• “A data lake is a system or repository of data stored in its natural/raw
format, usually object blobs or files.”
“A data lake is usually a single store of all enterprise data including raw copies
of source system data and transformed data used for tasks such as reporting,
visualization, advanced analytics and machine learning. A data lake can
include structured data from relational databases (rows and columns), semi-
structured data (CSV, logs, XML, JSON), unstructured data (emails,
documents, PDFs) and binary data (images, audio, video).”
Source: https://en.wikipedia.org/wiki/Data_lake
James Dixon of Pentaho:
“If you think of a datamart as a store of bottled water –
cleansed and packaged and structured for easy consumption –
the data lake is a large body of water in a more natural state.
The contents of the data lake stream in from a source to fill the
lake, and various users of the lake can come to examine, dive
in, or take samples.”
9. Data Warehouse vs. Data Lake
Data Warehouse Data Lake
Data Format Structured Structured, Semi-
structured, Unstructured
Data Schema / Modeling Schema-on-Write Schema-on-Read
Relative Cost $$$ $
Flexibility Less agile Highly agile
Performance Tuned for fast query
response
General purpose access,
slower responses
Data Quality High quality, curated data Lower quality, raw data
Target Users Business Analysts Data Scientists
Typical Use Cases Reporting, Visualizations Predictive Analytics,
Machine Learning
10. What is Delta Lake?
“Delta Lake is an open source storage layer that brings reliability to data lakes.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies
streaming and batch data processing. Delta Lake runs on top of your existing data
lake and is fully compatible with Apache Spark APIs.”
https://docs.delta.io/latest/delta-faq.html
Created by Databricks, and open sourced and contributed to the Linux Foundation as an
open standard, Delta Lake is a technology layer compatible with Apache Spark that
adds some database-like features to a data lake.
11. The cloud has enabled a massive
transformation in data capabilities
• Going from on-premises data centers, where provisioning new
hardware took weeks or months, to being able to scale up within
minutes
• Decoupling of compute from storage allows for flexible scalability and
optimizing costs
14. Cloud vs. On-Premises?
• Flexibility & Agility
• Scalability
• Op-ex cost model
• No data center
• Lack of control of data
• Depending on workload
costs can be higher
• Slower Time to Market
• Limit of Scalability
• Cap-ex cost model
• Full control over data
• Depending on workload
costs could be lower
16. Data Lake Environments
DEV
TEST
PRODUCTION
• As in any traditional systems development, having multiple environments for
developing and testing code is necessary.
• Changes to each subsequent environment should be made via automation
• Pre-prod environments need to be kept in sync with prod
Refresh
Process
17. Data Lake Zones
Landing
Raw (Bronze)
Clean/Valid (Silver)
Refined (Gold)
Secure
Sandbox
Data Lakes typically are divided into separate zones with data going through a
refining process as it progresses from one zone to the next.
Progression
18. Data Lake Storage paradigms
The Data Lake has two primary storage paradigms for accessing and dealing
with its data:
Hierarchical File System
Typically based on HDFS
Data organized into Files and Folders
N-levels deep
Based on Posix file system standard
Database
Typically based on Hive
Data is organized into Databases and Tables
2-levels deep
Compatible with SQL-based access
Most Data Lake systems use both at the same time, where the Database layer sits
on top of the File System. This can cause confusion for users.
19. Storage Design Decisions
• Datasets in a data lake are typically defined at a folder level
instead of at the file level.
• At the top level there is typically a folder structure that aligns
with the zones
• There are two primary types of data to consider:
Event/Fact data (Clicks, Transactions, Sensor readings, etc)
Reference/Master/Dimension data (Customer, Product, etc)
• Reference/Dimension data requires thinking about how to store
history of changes:
1. Snapshots
2. Deltas
20. File formats and compression
An important design choice is what file format to use in the Lake as
well as whether to compress the data
For the Landing/Raw zone, the convention is preserve the data in
whatever format it arrived in.
For subsequent zones, it makes sense to conform to a standard
format that is designed for data lakes that includes schema
information
Parquet is popular for analytics (Columnar) with Snappy
Compression
Delta Lake uses Parquet with additional metadata
ORC is an alternative columnar format popular on Hadoop
Avro is row-based popular for streaming (Kafka)
Avoid CSV or plain text formats where possible
Consider whether the format is splittable for parallel processing
CSV and Gzip may not be splittable formats
21. Data Ingestion Choices
ETL frameworks:
• GUI-Based
• Code-based
• Notebooks
• Metadata-driven
Frequency:
• Batch
Weekly
Daily
Hourly
• Micro batch
Every N minutes
• Streaming / Real time
Push vs. Pull:
• Push – systems send
their data to the lake
• Pull – The lake
initiates extracts
Ingestion is the process of getting data into the lake. When designing ingestion
systems, there are many options and choices that need to be made such as:
22. Data Catalog
The data catalog is a central part of managing the lake
and should have features such as:
• Dataset definitions
• Fields/column definitions
• Tags: Owner, Classification, PII
• Subject Matter Experts (SMEs)
Modern catalog tools also provide features such as:
• Crowdsourcing of metadata and gamification
• Automated annotation
Some examples:
Alation
Lumada Data Catalog
IBM Watson Knowledge
Catalog
AWS Glue
Azure Data Catalog /
Purview
23. Hive Metastore
• Most data lakes that are Hadoop-based or Spark-based rely on a
metadata catalog called the Hive Metastore
• It is important to consider how this should be provisioned and
managed
• The metastore is a relational database and supports a variety of
DBMS types including both open source (PostgreSQL, MySQL) and
closed (Oracle, MS SQL Server)
• Some configurations allow for an external metastore that can be
shared by workspaces (i.e. Databricks)
24. Data Lake Consuming Systems
The lake will most likely host
multiple consuming systems
including:
• Data Warehouses
• Data Marts
• Operational Data Stores
• Feature Stores
• Data products or applications
Dashboards
Alerts/Notifications
Automated Actions
Datasets
Designing and architecting for data
consumption will require answering
questions such as:
• Will systems pull data from the
lake, or will data be pushed?
• How will these systems access the
data?
• How will systems be notified that
data is available?
• What environments will these
systems use for developing and
testing?
• What apis will be used? (JDBC,
ODBC, REST, SFTP)
25. Example: Modern Data Warehouse
in Azure
https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse
28. Keeping the Lake Secure
• Network security controls
• Role-Based Access Controls (RBAC)
• Encryption
Transparent Data Encryption
Explicit Encryption
• Row level and column level access
29. Keeping the Lake Available
• Service Level Agreements
RPO and RTO
• Backups
Data
Configuration
Secrets
• Version Control
• Resource Locks
• Geo-Redundancy
• Automation
What’s your disaster recovery plan?
30. Access Patterns and Roles
The Lake needs to support several different types of access patterns:
1. System Access
Platform systems
Applications
2. Business User Access
Data Analysts
Data Scientists
3. Technology User Access
Support Access
Developer Access
Each of these groups need to have different access rights appropriate to the
role.
31. Regulations and Policies
impacting the Lake
External Regulations
• GDPR
• CCPA
• HIPAA
• PCI
Internal Policies
• PII and Privacy
• Information Classification
Some regulations such as GDPR and CCPA require customer data to
be disclosed and/or deleted. This requires careful design.
32. User Support
• Data Catalog
• Access to Data and Tools
• Training
• Sandbox Provisioning
• Help & Support
33. Technical Exploration and Tool
Selection
• Explore and select tools and technologies
• Minimize number of tools
• Choose best of breed
• Consider Total Cost of Ownership (TCO)
• Select compatible technologies
36. 1. Managing Environments is Hard
2. Automate Everything
3. Don’t rush to fill the lake, you might wind up with a swamp
4. Know your data
5. Pick a high value use case and demonstrate value quickly
6. Minimize complexity
7. Make sure you have backups
8. Enable self-service
9. But set limits and controls on user space
10. Try out different options, but settle on a single solution
38. Machine Learning and AI
The Data Lake should not be an end of itself, but instead
should be an enabler of new ways of using data for the benefit
of the business and its customers.
Machine Learning and Artificial intelligence hold much
promise and potential to leverage big data to create
innovative data products.
Some newer capabilities that are critical to this include:
• Feature Stores – Systems for storing and managing
“features” used by machine learning pipelines or models
• Model Registries – Systems for storing, managing and
operationalizing predictive models
40. The Event Streaming Platform
Championed by Confluent (creators of Kafka)
this enterprise architecture pattern uses a
hub-and-spoke model where systems stream
events to a hub, which can be read by other
systems.
• Enables real-time event driven systems
• Simplifies point to point dependencies
• Compliments Data Lakes, Data Warehouses
and other systems