Improving Data Literacy Around Data ArchitectureDATAVERSITY
Data Literacy is an increasing concern, as organizations look to become more data-driven. As the rise of the citizen data scientist and self-service data analytics becomes increasingly common, the need for business users to understand core Data Management fundamentals is more important than ever. At the same time, technical roles need a strong foundation in Data Architecture principles and best practices. Join this webinar to understand the key components of Data Literacy, and practical ways to implement a Data Literacy program in your organization.
AWS Lambda is a serverless compute service that runs code in response to events. It allows uploading code that can be run without having to manage infrastructure. Lambda manages capacity, scaling, monitoring, logging and security patching. Events from over 15 AWS services can trigger Lambda functions. Examples include S3 bucket uploads, DynamoDB changes, and API Gateway requests. Lambda functions support Node.js, Java, Python and C# and can be used to build automated workflows like resizing images or integrating apps. It offers 300 seconds of compute time per function for free each month.
서버리스, Lambda, Athena만 생각하셨나요? 서버리스 기술은 자동 크기 조정, 기본 제공 고가용성 및 종량제 결제 모델을 제공하여 민첩성을 개선하고 비용을 최적화할 수 있습니다. 이제는 AWS의 다양한 데이터 분석 서비스에서 서버리스 기술을 이용하여, 보다 효율적인 분석 환경 구성을 할 수 있습니다. Amazon Redshift, EMR, Opensearch에서 서버리스 기술을 활용할 수 있는 방법과 유즈케이스를 알아볼 수 있습니다.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...Amazon Web Services Japan
AWS Black Belt Online Seminarの最新コンテンツ: https://aws.amazon.com/jp/aws-jp-introduction/#new
過去に開催されたオンラインセミナーのコンテンツ一覧: https://aws.amazon.com/jp/aws-jp-introduction/aws-jp-webinar-service-cut/
CloudStack is an open source cloud computing platform that allows users to manage their infrastructure as an automated system. It provides self-service access to computing resources like servers, storage, and networking via a web interface. CloudStack supports multiple hypervisors and public/private cloud deployment strategies. The core components include hosts, primary storage, clusters, pods, networks, secondary storage, and zones which are managed by CloudStack servers.
This document provides an overview of Salesforce CPQ (Configure, Price, Quote) including:
- The CPQ data model and required post-installation steps
- How to create quotes, contracts, and amendments using CPQ
- How products can be bundled with options and features in CPQ
- The process for contract renewals and generating orders from finalized quotes
- Details are given on key CPQ functionality like optional constraints, configuration attributes, and subscription vs. non-subscription products.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
온디맨드 다시보기: https://www.youtube.com/watch?v=LMBSWl9Uo-4
2021년 1분기에 서울 리전에 출시 예정인 AWS Control Tower는 모범 사례를 기반으로 고객의 다중 AWS 계정 환경을 자동으로 구성해 줍니다. 본 세션에서는 AWS Control Tower를 활용하여 고객의 조직에서 필요로 하는 다중 AWS 계정 구조을 설계 및 구현하고, 각 계정에 포함해야 하는 기본 가드레일을 정의 및 생성하고, 거버넌스 체계를 구현하는 방법에 대해서 다룹니다.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
The document discusses various AWS services for monitoring, logging, and security. It provides examples of AWS CloudTrail logs and best practices for CloudTrail such as enabling in all regions, log file validation, encryption, and integration with CloudWatch Logs. It also summarizes VPC flow logs, CloudWatch metrics and logs, and tools for automating compliance like Config rules, CloudWatch events, and Inspector.
Azure Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. In this session we will learn how to create data integration solutions using the Data Factory service and ingest data from various data stores, transform/process the data, and publish the result data to the data stores.
The document discusses Azure Data Factory v2. It provides an agenda that includes topics like triggers, control flow, and executing SSIS packages in ADFv2. It then introduces the speaker, Stefan Kirner, who has over 15 years of experience with Microsoft BI tools. The rest of the document consists of slides on ADFv2 topics like the pipeline model, triggers, activities, integration runtimes, scaling SSIS packages, and notes from the field on using SSIS packages in ADFv2.
발표영상 다시보기: https://youtu.be/eQjkwhyOOmI
대규모 데이터 레이크 구성 및 관리는 복잡하고 시간이 많이 걸리는 작업입니다. AWS Lake Formation은 수일만에 안전한 데이터 레이크를 구성할 수 있는 완전 관리 서비스입니다. 본 세션에서는 데이터 수집, 분류, 정리, 변환 및 보안을 위해 AWS Lake Formation을 통해 Amazon S3, EMR, Redshift 및 Athena와 같은 분석 도구를 쉽게 구성하는 방법을 알아봅니다. (2019년 11월 서울 리전 출시)
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나Amazon Web Services Korea
Amazon SageMaker 배포에 처음 입문 하고자 하는 분들을 위해 동작 방식을 설명하고 실행할 수 있는 가이드를 제공합니다. Amazon SageMaker 빌트인 4가지 서빙 패턴(리얼타임 추론, 배치 추론, 비동기 추론, 서버리스 추론)을 시작으로 프로덕션 적용을 위한 핵심 기능과 비용 절감을 위한 방법을 소개합니다.
The document discusses AWS Glue, a fully managed ETL service. It provides an overview of Glue's programming environment and data processing model. It then gives several examples of optimizing Glue job performance, including processing many small files, a few large files, optimizing parallelism with JDBC partitions, Python performance, and using the new Python shell job type.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
Amazon Kinesis Data Analytics는 실시간으로 스트리밍 데이터를 처리하고 분석할 수 있는 서버리스 서비스입니다. Kinesis Data Analytics를 사용하면 로그 분석, 클릭스트림 분석, 사물 인터넷(IoT), 광고 기술, 게임 등의 대규모의 스트림을 처리할 수 있는 애플리케이션을 신속하고 유연하게 구축할 수 있으며 유지관리의 어려움에서 벗어날 수 있습니다. 이 세션에서는 Kinesis Data Analytics의 동작과 기능, 운영상의 모범 사례에 대한 설명을 바탕으로 Streaming Application 개발, Studio Notebook 활용하는 방법을 데모를 통해 알아봅니다.
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...Amazon Web Services Korea
Database Migration Service(DMS)는 RDBMS 이외에도 다양한 데이터베이스 이관을 지원합니다. 실제 고객사 사례를 통해 DMS가 데이터베이스 이관, 통합, 분리를 수행하는 데 어떻게 활용되는지 알아보고, 동시에 데이터 분석을 위한 데이터 수집(Data Ingest)에도 어떤 역할을 하는지 살펴보겠습니다.
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...Amazon Web Services Korea
AWS re:Invent에서는 다양한 고객들의 요구에 맞추어 새로운 분석 및 서버리스 서비스가 대거 출시되었습니다. 본 강연에서는 새롭게 출시된 핵심 분석 기능들과 함께, 누구나 손쉽게 사용할 수 있는 AWS의 분석 서버리스와 On-demand 기능들에 대한 심층적인 정보를 확인하실 수 있습니다.
AWS Glue는 고객이 분석을 위해 손쉽게 데이터를 준비하고 로드할 수 있게 지원하는 완전관리형 ETL(추출, 변환 및 로드) 서비스입니다. AWS 관리 콘솔에서 클릭 몇 번으로 ETL 작업을 생성하고 실행할 수 있습니다. 빅데이터 분석 시 다양한 데이터 소스에 대한 전처리 작업을 할 때, 별도의 데이터 처리용 서버나 인프라를 관리할 필요가 없습니다. 본 세션에서는 지난 5월 서울 리전에 출시한 Glue 서비스에 대한 자세한 소개와 함께 다양한 활용 팁을 데모와 함께 소개해 드립니다.
Cloudwatch: Monitoring your Services with Metrics and AlarmsFelipe
CloudWatch is AWS's monitoring and metrics service that collects data from AWS services and allows users to set alarms and view metrics. It collects both built-in metrics provided by AWS services as well as custom metrics defined by users. CloudWatch allows viewing metrics and setting alarms in the console, through APIs, and via integration with other AWS services. It provides visibility into applications and infrastructure to help with decisions around capacity planning and troubleshooting.
Azure Cosmos DB is Microsoft's globally distributed, multi-model database service that supports multiple APIs such as SQL, Cassandra, MongoDB, Gremlin and Azure Table. It allows storing entities with automatic partitioning and provides automatic online backups every 4 hours with the latest 2 backups stored. The Azure Cosmos DB change feed and Data Migration Tool allow importing and exporting data for backups. An emulator is also available for trying Cosmos DB locally without an Azure account.
Building Event Driven (Micro)services with Apache KafkaGuido Schmutz
What is a Microservices architecture and how does it differ from a Service-Oriented Architecture? Should you use traditional REST APIs to bind services together? Or is it better to use a richer, more loosely-coupled protocol? This talk will start with quick recap of how we created systems over the past 20 years and how different architectures evolved from it. The talk will show how we piece services together in event driven systems, how we use a distributed log (event hub) to create a central, persistent history of events and what benefits we achieve from doing so.
Apache Kafka is a perfect match for building such an asynchronous, loosely-coupled event-driven backbone. Events trigger processing logic, which can be implemented in a more traditional as well as in a stream processing fashion. The talk will show the difference between a request-driven and event-driven communication and show when to use which. It highlights how the modern stream processing systems can be used to hold state both internally as well as in a database and how this state can be used to further increase independence of other services, the primary goal of a Microservices architecture.
Demystifying Data Warehousing as a Service - DFWKent Graziano
This document provides an overview and introduction to Snowflake's cloud data warehousing capabilities. It begins with the speaker's background and credentials. It then discusses common data challenges organizations face today around data silos, inflexibility, and complexity. The document defines what a cloud data warehouse as a service (DWaaS) is and explains how it can help address these challenges. It provides an agenda for the topics to be covered, including features of Snowflake's cloud DWaaS and how it enables use cases like data mart consolidation and integrated data analytics. The document highlights key aspects of Snowflake's architecture and technology.
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...Rustem Feyzkhanov
Cloud native orchestrators like AWS Step Functions and Amazon SageMaker Pipelines can be used to construct scalable end-to-end deep learning pipelines in the cloud. These orchestrators provide centralized monitoring, logging, and scaling capabilities. AWS Step Functions is useful for integrating pipelines with production infrastructure, while SageMaker Pipelines is good for research workflows that require validation. Serverless architectures using services like AWS Lambda, Batch, and Fargate can build scalable and flexible pipelines at a low cost.
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
This document summarizes Netflix's big data platform, which uses Presto and Spark on Amazon EMR and S3. Key points:
- Netflix processes over 50 billion hours of streaming per quarter from 65+ million members across over 1000 devices.
- Their data warehouse contains over 25PB stored on S3. They read 10% daily and write 10% of reads.
- They use Presto for interactive queries and Spark for both batch and iterative jobs.
- They have customized Presto and Spark for better performance on S3 and Parquet, and contributed code back to open source projects.
- Their architecture leverages dynamic EMR clusters with Presto and Spark deployed via bootstrap actions for scalability.
온디맨드 다시보기: https://www.youtube.com/watch?v=LMBSWl9Uo-4
2021년 1분기에 서울 리전에 출시 예정인 AWS Control Tower는 모범 사례를 기반으로 고객의 다중 AWS 계정 환경을 자동으로 구성해 줍니다. 본 세션에서는 AWS Control Tower를 활용하여 고객의 조직에서 필요로 하는 다중 AWS 계정 구조을 설계 및 구현하고, 각 계정에 포함해야 하는 기본 가드레일을 정의 및 생성하고, 거버넌스 체계를 구현하는 방법에 대해서 다룹니다.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
The document discusses various AWS services for monitoring, logging, and security. It provides examples of AWS CloudTrail logs and best practices for CloudTrail such as enabling in all regions, log file validation, encryption, and integration with CloudWatch Logs. It also summarizes VPC flow logs, CloudWatch metrics and logs, and tools for automating compliance like Config rules, CloudWatch events, and Inspector.
Azure Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. In this session we will learn how to create data integration solutions using the Data Factory service and ingest data from various data stores, transform/process the data, and publish the result data to the data stores.
The document discusses Azure Data Factory v2. It provides an agenda that includes topics like triggers, control flow, and executing SSIS packages in ADFv2. It then introduces the speaker, Stefan Kirner, who has over 15 years of experience with Microsoft BI tools. The rest of the document consists of slides on ADFv2 topics like the pipeline model, triggers, activities, integration runtimes, scaling SSIS packages, and notes from the field on using SSIS packages in ADFv2.
발표영상 다시보기: https://youtu.be/eQjkwhyOOmI
대규모 데이터 레이크 구성 및 관리는 복잡하고 시간이 많이 걸리는 작업입니다. AWS Lake Formation은 수일만에 안전한 데이터 레이크를 구성할 수 있는 완전 관리 서비스입니다. 본 세션에서는 데이터 수집, 분류, 정리, 변환 및 보안을 위해 AWS Lake Formation을 통해 Amazon S3, EMR, Redshift 및 Athena와 같은 분석 도구를 쉽게 구성하는 방법을 알아봅니다. (2019년 11월 서울 리전 출시)
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나Amazon Web Services Korea
Amazon SageMaker 배포에 처음 입문 하고자 하는 분들을 위해 동작 방식을 설명하고 실행할 수 있는 가이드를 제공합니다. Amazon SageMaker 빌트인 4가지 서빙 패턴(리얼타임 추론, 배치 추론, 비동기 추론, 서버리스 추론)을 시작으로 프로덕션 적용을 위한 핵심 기능과 비용 절감을 위한 방법을 소개합니다.
The document discusses AWS Glue, a fully managed ETL service. It provides an overview of Glue's programming environment and data processing model. It then gives several examples of optimizing Glue job performance, including processing many small files, a few large files, optimizing parallelism with JDBC partitions, Python performance, and using the new Python shell job type.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
Amazon Kinesis Data Analytics는 실시간으로 스트리밍 데이터를 처리하고 분석할 수 있는 서버리스 서비스입니다. Kinesis Data Analytics를 사용하면 로그 분석, 클릭스트림 분석, 사물 인터넷(IoT), 광고 기술, 게임 등의 대규모의 스트림을 처리할 수 있는 애플리케이션을 신속하고 유연하게 구축할 수 있으며 유지관리의 어려움에서 벗어날 수 있습니다. 이 세션에서는 Kinesis Data Analytics의 동작과 기능, 운영상의 모범 사례에 대한 설명을 바탕으로 Streaming Application 개발, Studio Notebook 활용하는 방법을 데모를 통해 알아봅니다.
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...Amazon Web Services Korea
Database Migration Service(DMS)는 RDBMS 이외에도 다양한 데이터베이스 이관을 지원합니다. 실제 고객사 사례를 통해 DMS가 데이터베이스 이관, 통합, 분리를 수행하는 데 어떻게 활용되는지 알아보고, 동시에 데이터 분석을 위한 데이터 수집(Data Ingest)에도 어떤 역할을 하는지 살펴보겠습니다.
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...Amazon Web Services Korea
AWS re:Invent에서는 다양한 고객들의 요구에 맞추어 새로운 분석 및 서버리스 서비스가 대거 출시되었습니다. 본 강연에서는 새롭게 출시된 핵심 분석 기능들과 함께, 누구나 손쉽게 사용할 수 있는 AWS의 분석 서버리스와 On-demand 기능들에 대한 심층적인 정보를 확인하실 수 있습니다.
AWS Glue는 고객이 분석을 위해 손쉽게 데이터를 준비하고 로드할 수 있게 지원하는 완전관리형 ETL(추출, 변환 및 로드) 서비스입니다. AWS 관리 콘솔에서 클릭 몇 번으로 ETL 작업을 생성하고 실행할 수 있습니다. 빅데이터 분석 시 다양한 데이터 소스에 대한 전처리 작업을 할 때, 별도의 데이터 처리용 서버나 인프라를 관리할 필요가 없습니다. 본 세션에서는 지난 5월 서울 리전에 출시한 Glue 서비스에 대한 자세한 소개와 함께 다양한 활용 팁을 데모와 함께 소개해 드립니다.
Cloudwatch: Monitoring your Services with Metrics and AlarmsFelipe
CloudWatch is AWS's monitoring and metrics service that collects data from AWS services and allows users to set alarms and view metrics. It collects both built-in metrics provided by AWS services as well as custom metrics defined by users. CloudWatch allows viewing metrics and setting alarms in the console, through APIs, and via integration with other AWS services. It provides visibility into applications and infrastructure to help with decisions around capacity planning and troubleshooting.
Azure Cosmos DB is Microsoft's globally distributed, multi-model database service that supports multiple APIs such as SQL, Cassandra, MongoDB, Gremlin and Azure Table. It allows storing entities with automatic partitioning and provides automatic online backups every 4 hours with the latest 2 backups stored. The Azure Cosmos DB change feed and Data Migration Tool allow importing and exporting data for backups. An emulator is also available for trying Cosmos DB locally without an Azure account.
Building Event Driven (Micro)services with Apache KafkaGuido Schmutz
What is a Microservices architecture and how does it differ from a Service-Oriented Architecture? Should you use traditional REST APIs to bind services together? Or is it better to use a richer, more loosely-coupled protocol? This talk will start with quick recap of how we created systems over the past 20 years and how different architectures evolved from it. The talk will show how we piece services together in event driven systems, how we use a distributed log (event hub) to create a central, persistent history of events and what benefits we achieve from doing so.
Apache Kafka is a perfect match for building such an asynchronous, loosely-coupled event-driven backbone. Events trigger processing logic, which can be implemented in a more traditional as well as in a stream processing fashion. The talk will show the difference between a request-driven and event-driven communication and show when to use which. It highlights how the modern stream processing systems can be used to hold state both internally as well as in a database and how this state can be used to further increase independence of other services, the primary goal of a Microservices architecture.
Demystifying Data Warehousing as a Service - DFWKent Graziano
This document provides an overview and introduction to Snowflake's cloud data warehousing capabilities. It begins with the speaker's background and credentials. It then discusses common data challenges organizations face today around data silos, inflexibility, and complexity. The document defines what a cloud data warehouse as a service (DWaaS) is and explains how it can help address these challenges. It provides an agenda for the topics to be covered, including features of Snowflake's cloud DWaaS and how it enables use cases like data mart consolidation and integrated data analytics. The document highlights key aspects of Snowflake's architecture and technology.
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...Rustem Feyzkhanov
Cloud native orchestrators like AWS Step Functions and Amazon SageMaker Pipelines can be used to construct scalable end-to-end deep learning pipelines in the cloud. These orchestrators provide centralized monitoring, logging, and scaling capabilities. AWS Step Functions is useful for integrating pipelines with production infrastructure, while SageMaker Pipelines is good for research workflows that require validation. Serverless architectures using services like AWS Lambda, Batch, and Fargate can build scalable and flexible pipelines at a low cost.
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
This document summarizes Netflix's big data platform, which uses Presto and Spark on Amazon EMR and S3. Key points:
- Netflix processes over 50 billion hours of streaming per quarter from 65+ million members across over 1000 devices.
- Their data warehouse contains over 25PB stored on S3. They read 10% daily and write 10% of reads.
- They use Presto for interactive queries and Spark for both batch and iterative jobs.
- They have customized Presto and Spark for better performance on S3 and Parquet, and contributed code back to open source projects.
- Their architecture leverages dynamic EMR clusters with Presto and Spark deployed via bootstrap actions for scalability.
Azure Databricks - An Introduction 2019 Roadshow.pptxpascalsegoul
Structure proposée du PowerPoint
1. Introduction au contexte
Objectif métier
Pourquoi Snowflake ?
Pourquoi Data Vault ?
2. Architecture cible
Schéma simplifié : zone RAW → Data Vault → Data Marts
Description des schémas : RAW, DV, DM
3. Données sources
Exemple : fichier CSV de commandes (client, produit, date, montant, etc.)
Structure des fichiers
4. Zone de staging (RAW)
CREATE STAGE
COPY INTO → vers table RAW
Screenshot du script SQL + résultat
5. Création des HUBs
HUB_CLIENT, HUB_PRODUIT…
Définition métier
Script SQL avec INSERT DISTINCT
6. Création des LINKS
LINK_COMMANDE (Client ↔ Produit ↔ Date)
Structure avec clés techniques
Script SQL + logique métier
7. Création des SATELLITES
SAT_CLIENT_DETAILS, SAT_PRODUIT_DETAILS…
Historisation avec LOAD_DATE, END_DATE, HASH_DIFF
Script SQL (MERGE ou INSERT conditionnel)
8. Orchestration
Exemple de flux via dbt ou Airflow (ou simplement séquence SQL)
Screenshot modèle YAML dbt ou DAG Airflow
9. Création des vues métiers (DM)
Vue agrégée des ventes mensuelles
SELECT complexe sur HUB + LINK + SAT
Screenshot ou exemple de résultat
10. Visualisation
Connexion à Power BI / Tableau
Screenshot d’un graphique simple basé sur une vue DM
11. Conclusion et bénéfices
Fiabilité, auditabilité, versioning, historique
Adapté aux environnements de production
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
Epic Games uses AWS services extensively to gain insights from player data and ensure Fortnite remains engaging for its over 125 million players. Telemetry data from clients is collected with Kinesis and analyzed in real-time using Spark on EMR. Game designers use these insights to inform decisions. Epic also uses S3 as a data lake, DynamoDB for real-time queries, and EMR for batch processing. This analytics platform on AWS allows constant feedback to optimize the player experience.
AWS Glue is a serverless data integration service that allows users to discover, prepare, and transform data for analytics and machine learning. It provides a fully managed extract, transform, and load (ETL) service on AWS. AWS Glue crawls data sources, automatically extracts metadata and stores it in a centralized data catalog. It then executes ETL jobs developed by users to clean, enrich and move data between various data stores.
This document provides an overview of migrating applications and workloads to AWS. It discusses key considerations for different migration approaches including "forklift", "embrace", and "optimize". It also covers important AWS services and best practices for architecture design, high availability, disaster recovery, security, storage, databases, auto-scaling, and cost optimization. Real-world customer examples of migration lessons and benefits are also presented.
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)Amazon Web Services Korea
The document introduces Amazon Athena and AWS Glue. It summarizes that Amazon Athena allows users to interactively query data stored in Amazon S3 using standard SQL. It also summarizes that AWS Glue is a fully managed ETL service that automates data extraction, transformation and loading processes. Glue discovers how data is organized, crawls data sources to infer schemas, automatically generates ETL code and manages execution of data workflows.
This document provides an overview of big data concepts and architectures, as well as AWS big data services. It begins with introducing big data challenges around variety, volume, and velocity of data. It then covers the Hadoop ecosystem including HDFS, MapReduce, Hive, Pig and Spark. The document also discusses data lake architectures and how AWS services like S3, Glue, Athena, EMR, Redshift, QuickSight can be used to build them. Specific services covered in more detail include Kinesis, MSK, Glue, EMR and Redshift. Real-world examples of big data usage are also presented.
AWS Certified Solutions Architect Professional Course S15-S18Neal Davis
This deck contains the slides from our AWS Certified Solutions Architect Professional video course. It covers:
Section 15 Analytics Services
Section 16 Monitoring, Logging and Auditing
Section 17 Security: Defense in Depth
Section 18 Cost Management
Full course can be found here: https://digitalcloud.training/courses/aws-certified-solutions-architect-professional-video-course/
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
Just-in-time: Repetitive production system in which processing and movement of materials and goods occur just as they are needed, usually in small batches
JIT is characteristic of lean production systems
JIT operates with very little “fat”
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation.
Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.
By James Francis, CEO of Paradigm Asset Management
In the landscape of urban safety innovation, Mt. Vernon is emerging as a compelling case study for neighboring Westchester County cities. The municipality’s recently launched Public Safety Camera Program not only represents a significant advancement in community protection but also offers valuable insights for New Rochelle and White Plains as they consider their own safety infrastructure enhancements.
Thingyan is now a global treasure! See how people around the world are search...Pixellion
We explored how the world searches for 'Thingyan' and 'သင်္ကြန်' and this year, it’s extra special. Thingyan is now officially recognized as a World Intangible Cultural Heritage by UNESCO! Dive into the trends and celebrate with us!
How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345
I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around!
[email protected]
5. Data preparation is hard
Lots of data! Infrastructure
management
Data grows fast 10x
every5years
Data is more diverse
Most jobshand-coded
Brittle and error prone
Machine / instance sizing Cluster
lifecyclemanagement
Scheduling andmonitoring
Managingmetastores
Needs customization
6. AWS Glue has evolved
Then Now
Fully Managed extract-transform-load
(ETL) Service
For developers, built
by developers
Serverless data preparation service
ETL developers, data engineers, data
scientists, business analysts, and more
8. Amazon S3
data lakestorage
Building data
lakes
Break silos, store data in Amazon S3
AWSGlue jobs and workflows to
ingest, process, and refine data instages
Access data lakes viaa
variety of cloud analytic engines
Amazon RDS Other databases On-premises data Streaming data
AWS Gluecrawlers
load and maintain the Data Catalog
AWS Lake Formation permissions to
secure the data lake
13. ETL Jobs
No resources to manage
Charged hourly based on Data Processing Units (DPUs) - $0.44 per hour
provides 4 vCPU and 16 GB of memory
Three types
Apache Spark
Python Shell
Spark Streaming
Data Catalog
Free for the first million objects stored (table, table version, partition, or database)
$1.00 per 100,000 objects stored above 1M, per month
Crawlers
Charged hourly based on Data Processing Units (DPUs)
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
With AWS Glue, you only pay for the time your ETL job takes to run.
AWS Glue Usage and Pricing
15. Security: IAM Permissions – A refresher
IAM Users
consist of a username and a password
IAM Groups
collection of users
IAM Role
an identity used to delegate access to AWS resources
IAM Service Role
a role that a service assumes to perform actions in your
account on your behalf
IAM Policy
an entity, when attached to an identity, defines their permissions
16. AWS Glue Permissions
Follow the least privilege access principle
Requires an IAM Role
AWS Managed Policy: AWSGlueServiceRole
Custom Policy – fine-grained access
Some related services
Amazon S3, Amazon Redshift, Amazon CloudWatch
17. AWS Glue Components
Crawlers
Load andmaintain
Data Catalog
Infer metadata:
schema, table
structure
Supports schema
evolution
AWS GlueData
Catalog
Apache Hive Metastore
compatible
Many integrated
analytic services
Extract,
transform, and load
Serverless execution
Apache Spark / Python
shell jobs
Interactive development
Auto-generate ETLcode
Orchestrate triggers,
crawlers, and jobs
Build and monitor
complex flows
Reliable execution
Workflow
management
18. AWS Glue is used to cleanse, prep, and
catalog
AWS Glue DataCatalog
Workflows orchestrate dataflows
Process data instages
Crawlers populate/maintain catalog
Jobs execute ETLtransforms
19. What arecrawlers?
Automatically discover new data and extract schema definitions
detect schema changes and maintain tables detect Apache
Hive style partitions on Amazon S3
Built-in classifiers for popular datatypes
create your own custom classifier using Grok expressions
Run on demand, on a schedule, or as parts of workflows
22. Use excludepatterns to remove unnecessary files
To ignore all Metadata files in the
folders year=‘2017’ and for
location s3://mydatasets
s3://mydatasets
year=2017/**/METADATA.txt
23. Improve performance with multiple crawlers
Periodically audit long running crawlers to balance workloads
Often crawlers are processing multiple datasets / tables
Improve performance by using multiple crawlers
Crawler granularity is table or dataset
24. What is anAWS Glue
job?
An AWS Glue job encapsulates the business logic that
performs extract, transform, and load (ETL)work
• A core building block in your production ETL pipeline
• Provide your PySpark ETL script or have one automatically generated
• Supports a rich set of built-in AWS Glue transformations
• Jobs can be started, stopped,monitored
25. Under the hood:Apache Spark and AWSGlue
ETL
• Apache Spark is a distributed data processing engine with rich support
for complex analytics
• AWS Glue builds on the Apache Spark runtime to offer ETL-specific
functionality
SparkSQL AWS GlueETL
Spark DataFrames AWS GlueDynamicFrames
Spark Core:RDDs
26. Apache Spark – What is it?
HDFS
YARN
MapReduce Spark
Cassandra NoSQL
Mesos
Tez
Distributed Storage Layer
Cluster Resource Management
Processing Framework Layer
27. Let’s try that again..
Think of a Bee Hive as your Distributed Storage
A Bee Hive needs to have a Queen
This Queen, serves as your Spark Driver
The Worker Bees, serves as your worker nodes
28. Putting it together..
Generates the Spark Context
Main Method
Access to the Resource Manager
Spark Driver
Resource
Manager
Executor
Cache
Executor
Cache
Executor
Cache
Executor
Cache
The Queen
The Worker Bees
29. DataFrames and DynamicFrames
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema upfront
Each row has same structure
Suited for SQL-like analytics
DynamicFrames
Like DataFrames forETL
Designed for processing semi-structured
data, e.g., JSON, Avro, Apachelogs
30. schema per-record, noupfront schema needed
Easy to restructure, tag,modify
Can be more compact than DataFrame rows
Many flows can be done in single pass
Dynamic Frame internals
{“id”:”2489”, “type”: ”CreateEvent”,
”payload”: {“creator”:…}, …}
Dynamic records
type
id type
id
Dynamic Frame schema
type
id
{“id”:4391, “type”: “PullEvent”,
”payload”: {“assets”:…}, …}
type
id
{“id”:”6510”, “type”: “PushEvent”,
”payload”: {“pusher”:…}, …}
id
31. AWS Glue executionmodel: jobs and stages
Filter
Read
Read
Stage 1
Repartition
Write
Stage 2
Job 1
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
32. AWS Glue executionmodel: jobs and stages
Filter
Read
Repartition
Write
Read
Job 1
Stage 1
Stage 2
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
Actions
33. AWS Glue executionmodel: jobs and stages
Filter
Read
Read
Job 1
Stage 1
Repartition
Write
Stage 2
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
Jobs
34. AWS Glue executionmodel: data partitions
• Apache Spark and AWS Glue
are data parallel.
• Data is divided intopartitions
that are processed
concurrently.
• 1 stage x 1 partition = 1 task
Driver
Executors
Overall throughput islimited
by the number of partitions
35. Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
36. Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
Jobs
Filter
Read
Job 1
Stage 1
Repartition
Write
Stage 2
Apply
Mapping
Read Filter
Apply
Mapping
Job 2
Show
37. Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
38. • Text – xSV, JSON
• May or may not be compressed
• Human readable whenuncompressed
• Not optimized foranalytics
• Columnar – Parquet & ORC
• Compressed in a binaryformat
• Integrated indexes and stats
• Optimized read performance when selecting only a subset of columns
• Row – Avro
• Compressed in a binaryformat
• Optimized read performance when selecting all columns of a subset of
rows
File formats
39. Partitioning guidance
• Chose columns that have low cardinality (uniqueness)
• Partitioning on day/month/year has 365 unique values per year
• Partitioning on seconds has millions of values per year
• You can partition on any column, not just date
• For example, s3://abc-corp-sales-data/country=xx/state=xx/bu=xx)
• Look at your query patterns – what data do you want to query, and what do
you want to filter out?
40. Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
41. Standard
Provide the maximum capacity of DPUs (max. 100)
4 vCPUs of compute capacity and 16 GB of memory, 50 GB disk and 2 executors
G.1X
Provide the number of workers (max. 299)
A Worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk) and 1 executor per
worker
Recommended for memory-intensive jobs
G.2X
Provide the number of workers (max. 149)
A Worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk) and 1 executor per worker
Recommended for memory-intensive jobs that run ML Transforms
Worker Types
42. Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
• Use G.1X and G.2X instances when your jobs need lots of memory
• Executor memory issues happen most often during sort and shuffle
operations
• The driver most often runs out of memory when processing a very
large number of input partitions
43. What is anAWS Glue
trigger?
Triggers are the “glue” in your AWS Glue ETL pipeline
Triggers
• Can be used to chain multiple AWS Glue jobs in a series
• Can start multiple jobs atonce
• Can be scheduled, on-demand, or based on job events
• Can pass unique parameters to customize AWS Glue job runs
44. Three ways to set up anAWS Glue ETL
pipeline
• Schedule-driven
• Event-driven
• State machine–driven
47. Example ETL
flow
Create and run a job that will
• Consume data in S3
• Join the data
• Select only the required columns
• Write the results to a data lake on Amazon Simple Storage
Service (AmazonS3)
Monitor the running job Analyze
the resulting dataset
Join Data
Select
Columns
Fill null values
• Fill null values
Goal: prepare and analyze
POS Data
48. What are workflows and how do they work?
DAGs with triggers, jobs, andcrawlers
Graphical canvas for authoringworkflows
Run / rerun and monitor workflow executions
Share parameters across entities in the workflow
52. Track previously processed data
Enable |disable |pause bookmarks onsources
Rollback to a previous state if necessary
Incrementaldata processing with job
bookmarks
53. Examples uses:
Process POS Data filesdaily
Process log fileshourly
Track timestamps or primary keys in DBs
Track generated foreign keysfor
normalization
Bookmarks are per-job checkpoints that
track the work done in previous runs.
They persist the state of sources,
transforms, and sinks on each run.
run 1 run 2 run 3
Incrementaldata processing withjobbookmarks
54. Option Behavior
Enable Pick up from where you left off
Disable
Ignore and process the entire
dataset every time
Pause
Temporarily disable advancing the
bookmark
run 1 run 2
enable
disable
pause
run 3
Examples:
Enable: Process the newest githubarchive partition
Disable: Process the entire githubarchivetable
Pause: Process the previous githubarchive partition
Job bookmark options
55. Job bookmark example
year
…
…
2017
11 12
28
month
day 27
hour …
year
…
…
2017
11 12
28
month
day 27
hour …
Input table
… …
run 1
run 2
…
Output table
Periodically run ajob
avoid reprocessing
previous input
avoid generating
duplicate output
59. Key Concepts
Virtual Private Cloud (VPC)
allows you to specify an IP address range for the VPC, add subnets, associate security
groups, and configure route tables.
Subnet
is a range of IP addresses in your VPC.
Public Subnet
Internet
Private Subnet
No Internet
VPN connection
Virtual Private Gateway (VGW)
Amazon Side
Customer Gateway (CGW)
Physical device on your Corporate Network
Security Groups
controls inbound and outbound traffic for your instances
68. RecentAWS Glue innovations
Merge/
transition/purge
SageMaker
notebooks
AWS Glue
streaming
Vertical scaling
PartitionIndex
Pause and
resume
workflows
Bahrain
Spark UI
Crawler
performance
Sao Paulo
Custom JDBC
certificates
Milan AWS GlueVPC
sharing
AWS Glue2.0
C-based
libraries
MongoDB
Amazon
DocumentDB
Self-managed
Kafka support
AWS Glue
Studio
Spark 2.4.3
AVRO
support
Continuous
logging
Hong Kong
Resource tags
Python shell
jobs
GovCloud
AWS Glue
workflows
Python 3.7on
Spark Stockholm
Wheel
dependency
Job bookmarks
FindMatches
ML transforms
China Regions
AWS GlueETL
binaries
50+ new features
and regions
69. AWS Glue 2.0:New engine for real-time
workloads
Cost effective
New job execution engine with a new scheduler
10x faster job start times
Predictable job latencies
Enables micro-batching
Latency-sensitive workloads
Fast and predictable
Diverse workloads
1-minute minimum billing
4 5 % cost savings on average
70. AWS Glue Studio: New visual ETL
interface
M A K E S I T E A S Y TO A U T H O R , R U N , A N D M O N I TO R AW S G L U E E T L J O B S
Author AWS Glue jobs visually without coding
Monitor 1000s of jobs through a single pane of
glass
Distributed processing without the learning curve
Advanced transforms through code snippets
73. AWS Glue DataBrew
V I S U A L D ATA P R E PA R AT I O N F O R A N A LY T I C S A N D M A C H I N E L E A R N I N G
GenerallyAvailable!
74. AmazonManagedWorkflowsforApacheAirflow
H I G H LY AVA I L A B L E , S E C U R E , A N D M A N A G E D W O R K F LO W O R C H E S T R AT I O N F O R
A PA C H E A I R F LO W
Preview
75. AWSLake Formation
Build a secure data lake in days
Simplify security
management
Centrally define security,governance
and auditing policies
Enforce policiesconsistently
across multiple services
Integrates with IAM andKMS
Provide self-service
access to data
Build a data catalogthat
describes your data
Enable analysts and datascientists
to easily find relevantdata
Analyze with multipleanalytics
services without moving data
Build datalakes
quickly
Move, store, catalog, and clean
your data faster
Transform to openformats
like Parquet and ORC
ML-based deduplication
and recordmatching
76. AWS API
Boto3 for Python
https://boto3.amazonaws.com
/v1/documentation/api/latest/
guide/index.html
Examples:
Upload files to S3
Download files from S3
Run a Glue Job
Run a Workflow