Dustin Vannoy presented on using Delta Lake with Azure Databricks. He began with an introduction to Spark and Databricks, demonstrating how to set up a workspace. He then discussed limitations of Spark including lack of ACID compliance and small file problems. Delta Lake addresses these issues with transaction logs for ACID transactions, schema enforcement, automatic file compaction, and performance optimizations like time travel. The presentation included demos of Delta Lake capabilities like schema validation, merging, and querying past versions of data.
Modern DW Architecture
- The document discusses modern data warehouse architectures using Azure cloud services like Azure Data Lake, Azure Databricks, and Azure Synapse. It covers storage options like ADLS Gen 1 and Gen 2 and data processing tools like Databricks and Synapse. It highlights how to optimize architectures for cost and performance using features like auto-scaling, shutdown, and lifecycle management policies. Finally, it provides a demo of a sample end-to-end data pipeline.
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Microsoft Tech Community
In this session you will learn how to develop data pipelines in Azure Data Factory and build a Cloud-based analytical solution adopting modern data warehouse approaches with Azure SQL Data Warehouse and implementing incremental ETL orchestration at scale. With the multiple sources and types of data available in an enterprise today Azure Data factory enables full integration of data and enables direct storage in Azure SQL Data Warehouse for powerful and high-performance query workloads which drive a majority of enterprise applications and business intelligence applications.
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
- The document discusses an IBM Cloud Day 2021 event focused on well-architected data lakes. It provides an overview of two sessions on data lake architecture and building a cloud native data lake on IBM Cloud.
- It also summarizes the key capabilities organizations need from a data lake, including visualizing data, flexibility/accessibility, governance, and gaining insights. Cloud data lakes can address these needs for various roles.
At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
This presentation focuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Spark as a Service with Azure DatabricksLace Lofranco
Presented at: Global Azure Bootcamp (Melbourne)
Participants will get a deep dive into one of Azure’s newest offering: Azure Databricks, a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. In this session, we will go through Azure Databricks key collaboration features, cluster management, and tight data integration with Azure data sources. We’ll also walk through an end-to-end Recommendation System Data Pipeline built using Spark on Azure Databricks.
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...Databricks
At Lennox International, we have thousands of IoT connected devices streaming data into the Azure platform with a minute level polling interval. The challenge was to use these data sets, combine with external data sources such as weather, and predict equipment failure with high levels of accuracy along with their influencing patterns and parameters. Previously the team was using a combination of on-premise and desktop tools to run algorithms on a sample set of devices. The result was low accuracy levels (around 65%) on a process that took more than 6 hours.
The team had to work through several data orchestration challenges and identify a machine learning platform which enabled them to collaborate between our engineering SME’s, Data Engineers and Data Scientists. The team decided to use Azure Databricks to build the data engineering pipelines, appropriate machine learning models and extract predictions using PySpark. To enhance the sophistication of the learning, the team worked on a variety of Spark ML models such as Gradient Boosted Trees and Random Forest. The team also implemented stacking, ensemble methods using H2O driverless AI and sparkling water on Azure Databricks clusters, which can scale up to 1000 cores.
Join us in this session and see how this resulted in models that run in 40 minutes with minimal tuning and predict failures with accuracy of about 90%.
This document discusses architecting a data lake. It begins by introducing the speaker and topic. It then defines a data lake as a repository that stores enterprise data in its raw format including structured, semi-structured, and unstructured data. The document outlines some key aspects to consider when architecting a data lake such as design, security, data movement, processing, and discovery. It provides an example design and discusses solutions from vendors like AWS, Azure, and GCP. Finally, it includes an example implementation using Azure services for an IoT project that predicts parts failures in trucks.
Apache Spark is a fast and general engine for large-scale data processing. It was created by UC Berkeley and is now the dominant framework in big data. Spark can run programs over 100x faster than Hadoop in memory, or more than 10x faster on disk. It supports Scala, Java, Python, and R. Databricks provides a Spark platform on Azure that is optimized for performance and integrates tightly with other Azure services. Key benefits of Databricks on Azure include security, ease of use, data access, high performance, and the ability to solve complex analytics problems.
In this session we will delve into the world of Azure Databricks and analyze why it is becoming a tool for data Scientist and/or fundamental data Engineer in conjunction with Azure services
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
This ebook deep dives into Apache Spark optimizations that improve performance, reduce costs and deliver unmatched scale
https://www.qubole.com/resources/ebooks/accelerating-time-to-value-of-big-data-of-apache-spark
This document summarizes an IBM Cloud Day 2021 presentation on IBM Cloud Data Lakes. It describes the architecture of IBM Cloud Data Lakes including data skipping capabilities, serverless analytics, and metadata management. It then discusses an example COVID-19 data lake built on IBM Cloud to provide trusted COVID-19 data to analytics applications. Key aspects included landing, preparation, and integration zones; serverless pipelines for data ingestion and transformation; and a data mart for querying and reporting.
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
Presentation by James Baker and myself on Running cost effective big data workloads with Azure Synapse and Azure Datalake Storage (ADLS) at Microsoft Ignite 2020. Covers Modern Data warehouse architecture supported by Azure Synapse, integration benefits with ADLS and some features that reduce cost such as Query Acceleration, integration of Spark and SQL processing with integrated meta data and .NET For Apache Spark support.
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
The developer world is changing as we create and generate new data patterns and handling processes within our applications. Additionally, with the massive interest in machine learning and advanced analytics how can we as developers build intelligence directly into our applications that can integrate with the data and data paths we are creating? The answer is Azure Databricks and by attending this session you will be able to confidently develop smarter and more intelligent applications and solutions which can be continuously built upon and that can scale with the growing demands of a modern application estate.
Data Con LA 2020
Description
Data warehouses are not enough. Data lakes are the backbone of a modern data environment. Data Lakes are best built leveraging unique services of the cloud provider to reduce operations complexity. This session will explain why everyone's talking about data lakes, break down the best services in Azure to build a Data Lake, and walk through code for querying and loading with Azure Databricks and Event Hubs for Kafka. Attendees will leave the session with a firm grasp of why we build data lakes and how Azure Databricks fits in for ETL and querying.
Speaker
Dustin Vannoy, Dustin Vannoy Consulting, Principal Data Engineer
Using Redash for SQL Analytics on DatabricksDatabricks
This talk gives a brief overview with a demo performing SQL analytics with Redash and Databricks. We will introduce some of the new features coming as part of our integration with Databricks following the acquisition earlier this year, along with a demo of the other Redash features that enable a productive SQL experience on top of Delta Lake.
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Azure Data Lake and Azure Data Lake AnalyticsWaqas Idrees
This document provides an overview and introduction to Azure Data Lake Analytics. It begins with defining big data and its characteristics. It then discusses the history and origins of Azure Data Lake in addressing massive data needs. Key components of Azure Data Lake are introduced, including Azure Data Lake Store for storing vast amounts of data and Azure Data Lake Analytics for performing analytics. U-SQL is covered as the query language for Azure Data Lake Analytics. The document also touches on related Azure services like Azure Data Factory for data movement. Overall it aims to give attendees an understanding of Azure Data Lake and how it can be used to store and analyze large, diverse datasets.
Big Data Adavnced Analytics on Microsoft AzureMark Tabladillo
This presentation provides a survey of the advanced analytics strengths of Microsoft Azure from an enterprise perspective (with these organizations being the bulk of big data users) based on the Team Data Science Process. The talk also covers the range of analytics and advanced analytics solutions available for developers using data science and artificial intelligence from Microsoft Azure.
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
The driving force behind Apache Spark (Databricks Inc.) and Microsoft have designed a joint service to quickly and easily create Big Data and Advanced Analytics solutions. The combination of the comprehensive Databricks Unified Analytics platform and the powerful capabilities of Microsoft Azure make it easy to analyse data streams or large amounts of data, as well asthe training of AI models. Sascha Dittmann shows in this session how the new Azure service can be set up and used in various real-world scenarios. He also shows, how to connect the various Azure Services to the Azure Databricks service.
Azure Databricks - An Introduction 2019 Roadshow.pptxpascalsegoul
Structure proposée du PowerPoint
1. Introduction au contexte
Objectif métier
Pourquoi Snowflake ?
Pourquoi Data Vault ?
2. Architecture cible
Schéma simplifié : zone RAW → Data Vault → Data Marts
Description des schémas : RAW, DV, DM
3. Données sources
Exemple : fichier CSV de commandes (client, produit, date, montant, etc.)
Structure des fichiers
4. Zone de staging (RAW)
CREATE STAGE
COPY INTO → vers table RAW
Screenshot du script SQL + résultat
5. Création des HUBs
HUB_CLIENT, HUB_PRODUIT…
Définition métier
Script SQL avec INSERT DISTINCT
6. Création des LINKS
LINK_COMMANDE (Client ↔ Produit ↔ Date)
Structure avec clés techniques
Script SQL + logique métier
7. Création des SATELLITES
SAT_CLIENT_DETAILS, SAT_PRODUIT_DETAILS…
Historisation avec LOAD_DATE, END_DATE, HASH_DIFF
Script SQL (MERGE ou INSERT conditionnel)
8. Orchestration
Exemple de flux via dbt ou Airflow (ou simplement séquence SQL)
Screenshot modèle YAML dbt ou DAG Airflow
9. Création des vues métiers (DM)
Vue agrégée des ventes mensuelles
SELECT complexe sur HUB + LINK + SAT
Screenshot ou exemple de résultat
10. Visualisation
Connexion à Power BI / Tableau
Screenshot d’un graphique simple basé sur une vue DM
11. Conclusion et bénéfices
Fiabilité, auditabilité, versioning, historique
Adapté aux environnements de production
Azure satpn19 time series analytics with azure adxRiccardo Zamana
The document discusses Azure Data Explorer (ADX), a fully managed data analytics service for real-time analysis on large volumes of data. It provides an overview of ADX, describing its key features such as fast query performance, optimized ingestion for streaming data, and its ability to enable data exploration. Examples of typical use cases for ADX including telemetry analytics and providing a backend for multi-tenant SaaS solutions are also presented. The document then dives into various ADX concepts like clusters, databases, ingestion techniques, supported data formats, and language examples to help users get started with the service.
This presentation focuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Spark as a Service with Azure DatabricksLace Lofranco
Presented at: Global Azure Bootcamp (Melbourne)
Participants will get a deep dive into one of Azure’s newest offering: Azure Databricks, a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. In this session, we will go through Azure Databricks key collaboration features, cluster management, and tight data integration with Azure data sources. We’ll also walk through an end-to-end Recommendation System Data Pipeline built using Spark on Azure Databricks.
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...Databricks
At Lennox International, we have thousands of IoT connected devices streaming data into the Azure platform with a minute level polling interval. The challenge was to use these data sets, combine with external data sources such as weather, and predict equipment failure with high levels of accuracy along with their influencing patterns and parameters. Previously the team was using a combination of on-premise and desktop tools to run algorithms on a sample set of devices. The result was low accuracy levels (around 65%) on a process that took more than 6 hours.
The team had to work through several data orchestration challenges and identify a machine learning platform which enabled them to collaborate between our engineering SME’s, Data Engineers and Data Scientists. The team decided to use Azure Databricks to build the data engineering pipelines, appropriate machine learning models and extract predictions using PySpark. To enhance the sophistication of the learning, the team worked on a variety of Spark ML models such as Gradient Boosted Trees and Random Forest. The team also implemented stacking, ensemble methods using H2O driverless AI and sparkling water on Azure Databricks clusters, which can scale up to 1000 cores.
Join us in this session and see how this resulted in models that run in 40 minutes with minimal tuning and predict failures with accuracy of about 90%.
This document discusses architecting a data lake. It begins by introducing the speaker and topic. It then defines a data lake as a repository that stores enterprise data in its raw format including structured, semi-structured, and unstructured data. The document outlines some key aspects to consider when architecting a data lake such as design, security, data movement, processing, and discovery. It provides an example design and discusses solutions from vendors like AWS, Azure, and GCP. Finally, it includes an example implementation using Azure services for an IoT project that predicts parts failures in trucks.
Apache Spark is a fast and general engine for large-scale data processing. It was created by UC Berkeley and is now the dominant framework in big data. Spark can run programs over 100x faster than Hadoop in memory, or more than 10x faster on disk. It supports Scala, Java, Python, and R. Databricks provides a Spark platform on Azure that is optimized for performance and integrates tightly with other Azure services. Key benefits of Databricks on Azure include security, ease of use, data access, high performance, and the ability to solve complex analytics problems.
In this session we will delve into the world of Azure Databricks and analyze why it is becoming a tool for data Scientist and/or fundamental data Engineer in conjunction with Azure services
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
This ebook deep dives into Apache Spark optimizations that improve performance, reduce costs and deliver unmatched scale
https://www.qubole.com/resources/ebooks/accelerating-time-to-value-of-big-data-of-apache-spark
This document summarizes an IBM Cloud Day 2021 presentation on IBM Cloud Data Lakes. It describes the architecture of IBM Cloud Data Lakes including data skipping capabilities, serverless analytics, and metadata management. It then discusses an example COVID-19 data lake built on IBM Cloud to provide trusted COVID-19 data to analytics applications. Key aspects included landing, preparation, and integration zones; serverless pipelines for data ingestion and transformation; and a data mart for querying and reporting.
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
Presentation by James Baker and myself on Running cost effective big data workloads with Azure Synapse and Azure Datalake Storage (ADLS) at Microsoft Ignite 2020. Covers Modern Data warehouse architecture supported by Azure Synapse, integration benefits with ADLS and some features that reduce cost such as Query Acceleration, integration of Spark and SQL processing with integrated meta data and .NET For Apache Spark support.
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
The developer world is changing as we create and generate new data patterns and handling processes within our applications. Additionally, with the massive interest in machine learning and advanced analytics how can we as developers build intelligence directly into our applications that can integrate with the data and data paths we are creating? The answer is Azure Databricks and by attending this session you will be able to confidently develop smarter and more intelligent applications and solutions which can be continuously built upon and that can scale with the growing demands of a modern application estate.
Data Con LA 2020
Description
Data warehouses are not enough. Data lakes are the backbone of a modern data environment. Data Lakes are best built leveraging unique services of the cloud provider to reduce operations complexity. This session will explain why everyone's talking about data lakes, break down the best services in Azure to build a Data Lake, and walk through code for querying and loading with Azure Databricks and Event Hubs for Kafka. Attendees will leave the session with a firm grasp of why we build data lakes and how Azure Databricks fits in for ETL and querying.
Speaker
Dustin Vannoy, Dustin Vannoy Consulting, Principal Data Engineer
Using Redash for SQL Analytics on DatabricksDatabricks
This talk gives a brief overview with a demo performing SQL analytics with Redash and Databricks. We will introduce some of the new features coming as part of our integration with Databricks following the acquisition earlier this year, along with a demo of the other Redash features that enable a productive SQL experience on top of Delta Lake.
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Azure Data Lake and Azure Data Lake AnalyticsWaqas Idrees
This document provides an overview and introduction to Azure Data Lake Analytics. It begins with defining big data and its characteristics. It then discusses the history and origins of Azure Data Lake in addressing massive data needs. Key components of Azure Data Lake are introduced, including Azure Data Lake Store for storing vast amounts of data and Azure Data Lake Analytics for performing analytics. U-SQL is covered as the query language for Azure Data Lake Analytics. The document also touches on related Azure services like Azure Data Factory for data movement. Overall it aims to give attendees an understanding of Azure Data Lake and how it can be used to store and analyze large, diverse datasets.
Big Data Adavnced Analytics on Microsoft AzureMark Tabladillo
This presentation provides a survey of the advanced analytics strengths of Microsoft Azure from an enterprise perspective (with these organizations being the bulk of big data users) based on the Team Data Science Process. The talk also covers the range of analytics and advanced analytics solutions available for developers using data science and artificial intelligence from Microsoft Azure.
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
The driving force behind Apache Spark (Databricks Inc.) and Microsoft have designed a joint service to quickly and easily create Big Data and Advanced Analytics solutions. The combination of the comprehensive Databricks Unified Analytics platform and the powerful capabilities of Microsoft Azure make it easy to analyse data streams or large amounts of data, as well asthe training of AI models. Sascha Dittmann shows in this session how the new Azure service can be set up and used in various real-world scenarios. He also shows, how to connect the various Azure Services to the Azure Databricks service.
Azure Databricks - An Introduction 2019 Roadshow.pptxpascalsegoul
Structure proposée du PowerPoint
1. Introduction au contexte
Objectif métier
Pourquoi Snowflake ?
Pourquoi Data Vault ?
2. Architecture cible
Schéma simplifié : zone RAW → Data Vault → Data Marts
Description des schémas : RAW, DV, DM
3. Données sources
Exemple : fichier CSV de commandes (client, produit, date, montant, etc.)
Structure des fichiers
4. Zone de staging (RAW)
CREATE STAGE
COPY INTO → vers table RAW
Screenshot du script SQL + résultat
5. Création des HUBs
HUB_CLIENT, HUB_PRODUIT…
Définition métier
Script SQL avec INSERT DISTINCT
6. Création des LINKS
LINK_COMMANDE (Client ↔ Produit ↔ Date)
Structure avec clés techniques
Script SQL + logique métier
7. Création des SATELLITES
SAT_CLIENT_DETAILS, SAT_PRODUIT_DETAILS…
Historisation avec LOAD_DATE, END_DATE, HASH_DIFF
Script SQL (MERGE ou INSERT conditionnel)
8. Orchestration
Exemple de flux via dbt ou Airflow (ou simplement séquence SQL)
Screenshot modèle YAML dbt ou DAG Airflow
9. Création des vues métiers (DM)
Vue agrégée des ventes mensuelles
SELECT complexe sur HUB + LINK + SAT
Screenshot ou exemple de résultat
10. Visualisation
Connexion à Power BI / Tableau
Screenshot d’un graphique simple basé sur une vue DM
11. Conclusion et bénéfices
Fiabilité, auditabilité, versioning, historique
Adapté aux environnements de production
Azure satpn19 time series analytics with azure adxRiccardo Zamana
The document discusses Azure Data Explorer (ADX), a fully managed data analytics service for real-time analysis on large volumes of data. It provides an overview of ADX, describing its key features such as fast query performance, optimized ingestion for streaming data, and its ability to enable data exploration. Examples of typical use cases for ADX including telemetry analytics and providing a backend for multi-tenant SaaS solutions are also presented. The document then dives into various ADX concepts like clusters, databases, ingestion techniques, supported data formats, and language examples to help users get started with the service.
Comment envisager l'architecture d'une solution dans le Cloud ? Quelles différences avec un hébergement classique ?
Nous illustrerons les grands principes du développement Cloud en prenant l'exemple d'une application web typique. Nous construirons l'architecture étape par étape pour la rendre scalable et lui faire bénéficier des avantages du Cloud.
Nous verrons ensuite les différents types d'implémentations et choix technologiques possibles de cette architecture sur le Cloud Microsoft Azure. Nous envisagerons aussi bien des services d'infrastructure (VMs, conteneurs, …) que des services de plus haut niveau de type plateforme, du serverless, des bases de données managées…
Nous zoomerons ensuite sur l'acquisition de la donnée et son traitement dans un contexte Big Data et verrons les caractéristiques d'une architecture lambda et ses implémentations possibles sur Azure (Hadoop, …). Nous terminerons par les différentes manières d'ajouter de l'intelligence dans sa solution : de la plus simple à mettre en œuvre pour le développeur via des APIs pré-packagées, à la plus élaborée et personnalisable pour le Data Scientist. Mais aussi comment la rendre plus facilement accessible par l'utilisateur via un bot Skype, Facebook, Slack, email, SMS...
Support du meetup https://www.meetup.com/fr-FR/Duchess-France-Meetup/events/238437772/
This document provides an overview of a course on implementing a modern data platform architecture using Azure services. The course objectives are to understand cloud and big data concepts, the role of Azure data services in a modern data platform, and how to implement a reference architecture using Azure data services. The course will provide an ARM template for a data platform solution that can address most data challenges.
Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job?
This session explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.
No matter if you think about open source Apache Kafka, a cloud service like Confluent Cloud, or another technology using the Kafka protocol like Redpanda or Pulsar, check out this slide deck.
A detailed article about this topic:
https://www.kai-waehner.de/blog/2022/01/04/when-not-to-use-apache-kafka/
When NOT to Use Apache Kafka? With Kai Waehner | Current 2022HostedbyConfluent
When NOT to Use Apache Kafka? With Kai Waehner | Current 2022
Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This session explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
Modern Data Science Lifecycle with ADX & Azure
This document discusses using Azure Data Explorer (ADX) for data science workflows. ADX is a fully managed analytics service for real-time analysis of streaming data. It allows for ad-hoc querying of data using Kusto Query Language (KQL) and integrates with various Azure data ingestion sources. The document provides an overview of the ADX architecture and compares it to other time series databases. It also covers best practices for ingesting data, visualizing results, and automating workflows using tools like Azure Data Factory.
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Spark + AI Summit 2020 had over 35,000 attendees from 125 countries. The majority of participants were data engineers and data scientists. Apache Spark is now widely used with Python and SQL. Spark 3.0 includes improvements like adaptive query execution that accelerate queries by 2-18x. Delta Engine is a new high performance query engine for data lakes built on Spark 3.0.
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
AIDEVDAY_ Data-in-Motion to Supercharge AITimothy Spann
AIDEVDAY_ Data-in-Motion to Supercharge AI
https://www.meetup.com/futureofdata-newyork/events/295376737/
Lightning Talk 2: Data-in-Motion to Supercharge AI
Speaker: Timothy Spann @Cloudera
Abstract: A quick look at the current state of real-time streaming for powering both data ingest and transformation to provide training and enhancement data to models. Also how to use streaming to feed a pipeline of data against your models or models hosted a HuggingFace or elsewhere.
https://huggingface.co/bigscience/bloom
https://www.aicamp.ai/event/eventdetails/W2023082314
Apache NiFi, Apache Kafka, Apache Flink, HuggingFace, WatsonX.AI, REST API, Cloudera Machine Learning (CML), Bloom, Deep Learning, AI
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
JConWorld: Continuous SQL with Kafka and Flink
In this talk, I will walk through how someone can setup and run continous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas and publishing data.
We will then cover consuming Kafka data, joining Kafka topics and inserting new events into Kafka topics as they arrive. This basic over view will show hands-on techniques, tips and examples of how to do this.
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science. https://www.datainmotion.dev/p/about-me.html https://dzone.com/users/297029/bunkertor.html
https://www.youtube.com/channel/UCDIDMDfje6jAvNE8DGkJ3_w?view_as=subscriber
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
The document summarizes upcoming presentations and news from the Brisbane Azure User Group (BAUG) for October 2018. Key points include:
- Upcoming presentations in October will be by Todd Whitehead on training and deploying custom AI to the edge.
- Microsoft is extending the retirement deadline of the Access Control Service to February 4, 2019 to allow more customers to complete their migration.
- New previews of Azure SQL Database Hyperscale supporting up to 100TB databases and Azure Sphere for hardware-based security on IoT devices.
- General availability releases of reserved capacity for Azure Cosmos DB, Azure SignalR service, and Tomcat/Java support on App Service on Linux.
Let's make a brief introduction to Azure Data eXplorer, with many examples using Kusto dialect and C# client.
With a particular focus on IIoT contexts and proces control data, let's discover how to implement time series analysis in terms of pattern recognition, and trend correlation.
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
We will present our Office 365 use case scenarios, why we chose Cassandra + Spark, and walk through the architecture we chose for running DSE on Azure.
The presentation will feature demos on how you too can build similar applications.
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
This document introduces .NET for Apache Spark, which allows .NET developers to use the Apache Spark analytics engine for big data and machine learning. It discusses why .NET support is needed for Apache Spark given that much business logic is written in .NET. It provides an overview of .NET for Apache Spark's capabilities including Spark DataFrames, machine learning, and performance that is on par or faster than PySpark. Examples and demos are shown. Future plans are discussed to improve the tooling, expand programming experiences, and provide out-of-box experiences on platforms like Azure HDInsight and Azure Databricks. Readers are encouraged to engage with the open source project and provide feedback.
Building a Real-Time IoT monitoring application with AzureDavide Mauri
Being able to analyze data in real-time is a very hot topic already and it will be more and more in. From product recommendations to fraud detection alarms a lot of stuff would be perfect if it could happen in real time. In this session a sample solution using the serverless capabilities of Azure will be developed, right from the ingestion of sensor data to their analysis and recommendation using AI in real time. Come to see how you could do the same in your environment, moving your application capabilities to the next level.
A sharing in a meetup of the AWS Taiwan User Group.
The registration page: https://bityl.co/7yRK
The promotion page: https://www.facebook.com/groups/awsugtw/permalink/4123481584394988/
Stargate, the gateway for some multi-models data APIData Con LA
Cedrick Lunven presents on the gateway for multi-model Data APIs. The presentation discusses why data gateways are rising in popularity, the architecture and implementations of gateways like Stargate, how Apache Cassandra can be used as a multi-model database, and demos Astra which is a Cassandra-as-a-Service. The presentation aims to explain the benefits of data gateways for both developers and database administrators.
Telangana State, India’s newest state that was carved from the erstwhile state of Andhra
Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’
to seek a permanent and sustainable solution to the drinking water problem in the state. MB is
designed to provide potable drinking water to every household in their premises through
piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable
piped drinking water supply from surface water sources
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
2. Dustin Vannoy
Data Engineering Consultant
Co-founder Data Engineering San Diego
/in/dustinvannoy
@dustinvannoy
[email protected]
Technologies
• Azure & AWS
• Spark
• Kafka
• Python
Modern Data Systems
• Data Lakes
• Analytics in Cloud
• Streaming
9. Why Spark?
Big data and the cloud
changed our mindset.
We want tools that
scale easily as data
size grows.
Spark is a leader in
data processing that
scales across many
machines. It can run
on Hadoop but is
faster and easier than
Map Reduce.
18. Why Kafka?
Streaming data
directly from one
system to another
often problematic
Kafka serves as the
scalable broker,
keeping up with
producer and
persisting data for all
consumers
23. Why Event Hubs?
Same core capability
as Kafka, using PaaS
instead of IaaS
Choose between Kafka
or Event Hub APIs;
avoid operational
overhead of managing
Kafka
25. Eventhub Namespace Setup
Standard pricing to enable Kafka
Each Throughput unit
1 MB/s ingress
2 MB/s egress
Auto Inflate to allow autoscale
26. Eventhub Setup
Partition count
Max # of consumers
Message retention
More days = More $
Capture
Save to Azure Storage
#2: In the world of data science we often default to processing in nightly or hourly batches, but that pattern is not enough any more. Our customers and business leaders see information is being created all the time and realize it should be available much sooner. While the move to stream processing adds complexity, the tools we have available make it achievable for teams of any size. In this session we will talk about why we need to shift some of our workloads from batch data jobs to streaming in real-time. We'll dive into how Spark Structured Streaming in Azure Databricks enables this along with streaming data systems such as Kafka and EventHub. We will discuss the concepts, how Azure Databricks enables stream processing, and review code examples on a sample data set.
#4: Shifting to Streaming:
We don’t have to convince our stakeholders that they don’t really need streaming.
Understand the needs, find the right uses for streaming, and make it happen.
Discuss pros and cons, considerations before going to production, and general use cases in AI/ML
Spark, Event Hubs, and Kafka
Define the systems we will be using for this session, including some of the reasons we choose them
Talk about some of the options for using these together
Getting Hands On
Review dependencies that are not covered
Walk through basic setup of the most important pieces
Demo of use case code, highlight some important Structure Streaming components
Best Practices
Cover things to consider when working with Spark Structured Streaming and Kafka or Event Hubs
#7: In the world of data science, those of us who develop ETL pipelines have determined that everything can be processed in nightly or hourly batches, but that only makes sense to data engineers. Our customers and business leaders see information is being created all the time and realize it should be available much sooner.
#8: Dealing with a large set of data at once brings its own challenges (a lot of resources at once, large table joins, run out of memory, etc)
Process as it comes in for cleaner logic (rather than seeing latest state, we see events as they happen and update state downstream)
Even if not doing real-time analytics yet, prepare for when you will - the times they are a’changin
#11: A fast and general engine for large-scale data processing, uses memory to provide benefit
Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit
Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though
Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you.
Several modules for different use cases, similar api so you can swap between modes relatively easily.
For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
#12: A fast and general engine for large-scale data processing, uses memory to provide benefit
Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit
Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though
Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you.
Several modules for different use cases, similar api so you can swap between modes relatively easily.
For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
#13: A fast and general engine for large-scale data processing, uses memory to provide benefit
Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit
Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though
Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you.
Several modules for different use cases, similar api so you can swap between modes relatively easily.
For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
#16: Window is essentially like grouping.
Continuously compute the average distance for each vendor over the last 10 minutes
#17: Window is essentially like grouping.
Continuously compute the average distance for each vendor over the last 10 minutes
#30: Quick overview of important databricks workspace segments – Clusters, Tables, Notebooks
Open create_parquet_tables notebook and run first few commands as examples of working without delta