This document discusses Apache Airflow and its use at Dailymotion. It provides an agenda that covers data at Dailymotion, Apache Airflow, how Airflow is used at Dailymotion, deployment of Airflow at Dailymotion, working on a DAG (directed acyclic graph) pipeline, and an example pipeline for Dailymotion's new Advanced Analytics project. The example pipeline aggregates data from different sources with varying frequencies and timezones into BigQuery and Exasol for visualization in Tableau.
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
Apache Airflow allows users to programmatically author, schedule, and monitor workflows or directed acyclic graphs (DAGs) using Python. It is an open-source workflow management platform developed by Airbnb that is used to orchestrate data pipelines. The document provides an overview of Airflow including what it is, its architecture, and concepts like DAGs, tasks, and operators. It also includes instructions on setting up Airflow and running tutorials on basic and dynamic workflows.
Apache Airflow is an open-source workflow management platform developed by Airbnb and now an Apache Software Foundation project. It allows users to define and manage data pipelines as directed acyclic graphs (DAGs) of tasks. The tasks can be operators to perform actions, move data between systems, and use sensors to monitor external systems. Airflow provides a rich web UI, CLI and integrations with databases, Hadoop, AWS and others. It is scalable, supports dynamic task generation and templates, alerting, retries, and distributed execution across clusters.
Introduction to Apache Airflow, it's main concepts and features and an example of a DAG. Afterwards some lessons and best practices learned by from the 3 years I have been using Airflow to power workflows in production.
Airflow is a workflow management system for authoring, scheduling and monitoring workflows or directed acyclic graphs (DAGs) of tasks. It has features like DAGs to define tasks and their relationships, operators to describe tasks, sensors to monitor external systems, hooks to connect to external APIs and databases, and a user interface for visualizing pipelines and monitoring runs. Airflow uses a variety of executors like SequentialExecutor, CeleryExecutor and MesosExecutor to run tasks on schedulers like Celery or Kubernetes. It provides security features like authentication, authorization and impersonation to manage access.
Building an analytics workflow using Apache AirflowYohei Onishi
This document discusses using Apache Airflow to build an analytics workflow. It begins with an overview of Airflow and how it can be used to author workflows through Python code. Examples are shown of using Airflow to copy files between S3 buckets. The document then covers setting up a highly available Airflow cluster, implementing continuous integration/deployment, and monitoring workflows. It emphasizes that Google Cloud Composer can simplify deploying and managing Airflow clusters on Google Kubernetes Engine and integrating with other Google Cloud services.
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
From business requirements to working pipelines with apache airflowDerrick Qin
In this talk we will be building Airflow pipelines. We’ll look at real business requirements and walk through pipeline design, implementation, testing, deployment and troubleshooting - all that by adhering to idempotency and ability to replay your past data through the pipelines.
We will introduce Airflow, an Apache Project for scheduling and workflow orchestration. We will discuss use cases, applicability and how best to use Airflow, mainly in the context of building data engineering pipelines. We have been running Airflow in production for about 2 years, we will also go over some learnings, best practices and some tools we have built around it.
Speakers: Robert Sanders, Shekhar Vemuri
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Kaxil Naik
From not knowing Python (let alone Airflow), and from submitting the first PR that fixes typo to becoming Airflow Committer, PMC Member, Release Manager, and #1 Committer this year, this talk walks through Kaxil’s journey in the Airflow World.
The second part of this talk explains:
how you can also start your OSS journey by contributing to Airflow
Expanding familiarity with a different part of the Airflow codebase
Continue committing regularly & steadily to become Airflow Committer. (including talking about current Guidelines of becoming a Committer)
Different mediums of communication (Dev list, users list, Slack channel, Github Discussions etc)
Presentation given on the 15th July 2021 at the Airflow Summit 2021
Conference website: https://airflowsummit.org/sessions/2021/clearing-airflow-obstructions/
Recording: https://www.crowdcast.io/e/airflowsummit2021/40
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
This document discusses how the author learned to use Airflow for data pipelining and scheduling tasks. It describes some early tools like Cron and Luigi that were used for scheduling. It then evaluates options like Drake, Pydoit, Pinball, Luigi, and AWS Data Pipeline before settling on Airflow due to its sophistication in handling complex dependencies, built-in scheduling and monitoring, and flexibility. The author also develops a plugin called smart-airflow to add file-based checkpointing capabilities to Airflow to track intermediate data transformations.
Fyber - airflow best practices in productionItai Yaffe
Eran Shemesh @ Fyber:
Fyber uses airflow to manage its entire big data pipelines including monitoring and auto-fix, the session will describe best practices that we implemented in production
This document discusses Apache Airflow and Google Cloud Composer. It begins by providing background on Apache Airflow, including that it is an open source workflow engine contributed by Airbnb. It then discusses how Codementor uses Airflow for ETL pipelines and machine learning workflows. The document mainly focuses on comparing self-hosting Airflow versus using Google Cloud Composer. Cloud Composer reduces efforts around hosting, permissions management, and monitoring. However, it has some limitations like occasional zombie tasks and higher costs. Overall, Cloud Composer allows teams to focus more on data logic and performance versus infrastructure maintenance.
A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodge-podge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.
Check out the contents on our browser-based liveBook reader here: https://livebook.manning.com/book/data-pipelines-with-apache-airflow/
Apache Airflow is a platform to author, schedule and monitor workflows as directed acyclic graphs (DAGs) of tasks. It allows workflows to be defined as code making them more maintainable, versionable and collaborative. The rich user interface makes it easy to visualize pipelines and monitor progress. Key concepts include DAGs, operators, hooks, pools and xcoms. Alternatives include Azkaban from LinkedIn and Oozie for Hadoop workflows.
This document provides an overview of building data pipelines using Apache Airflow. It discusses what a data pipeline is, common components of data pipelines like data ingestion and processing, and issues with traditional data flows. It then introduces Apache Airflow, describing its features like being fault tolerant and supporting Python code. The core components of Airflow including the web server, scheduler, executor, and worker processes are explained. Key concepts like DAGs, operators, tasks, and workflows are defined. Finally, it demonstrates Airflow through an example DAG that extracts and cleanses tweets.
Slide deck for the fourth data engineering lunch, presented by guest speaker Will Angel. It covered the topic of using Airflow for data engineering. Airflow is a scheduling tool for managing data pipelines.
This document provides an overview of Airflow, an open-source workflow management platform for authoring, scheduling and monitoring data pipelines. It describes Airflow's key components including the web server, scheduler, workers and metadata database. It explains how Airflow works by parsing DAGs, instantiating tasks and changing their state as they are scheduled, queued, run and monitored. The document also covers concepts like DAGs, operators, dependencies, concurrency vs parallelism and advanced topics such as subDAGs, hooks, XCOM and branching workflows.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
Introduction to Apache Airflow (Incubating), best practices and roadmap. Airflow is a platform to programmatically author, schedule and monitor workflows.
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...Itai Yaffe
Tal Sharon (Software Architect), Aviel Buskila (DevOps Engineer) and Max Peres (Data Engineer) @ Nielsen:
At the Nielsen Marketing Cloud, we used to manage our data pipelines via AWS Data Pipeline. Over the years, we’ve encountered several issues with this tool, and a year ago we decided to embark on a journey to replace it with a tool more suitable for our needs.
In this session, we’ll discuss how we actually migrated to Airflow, what challenges we faced and how we mitigated them (and even contributed to the open-source project along the way). We’ll also provide some helpful tips for Airflow users
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
This document discusses industrializing machine learning pipelines. It proposes a machine learning blueprint to improve collaboration between data scientists and engineers. The blueprint includes technical components like extracting, preprocessing, training, evaluating and predicting data. It defines roles for data scientists and engineers and recommends common tools and code to streamline workflows. The document also uses a case study of catalog categorization to illustrate how the blueprint was applied to classify videos using a machine learning pipeline with various technical components like TensorFlow, Apache Beam and BigQuery.
The document discusses upcoming features and changes in Apache Airflow 2.0. Key points include:
1. Scheduler high availability will use an active-active model with row-level locks to allow killing a scheduler without interrupting tasks.
2. DAG serialization will decouple DAG parsing from scheduling to reduce delays, support lazy loading, and enable features like versioning.
3. Performance improvements include optimizing the DAG file processor and using a profiling tool to identify other bottlenecks.
4. The Kubernetes executor will integrate with KEDA for autoscaling and allow customizing pods through templating.
5. The official Helm chart, functional DAGs, and smaller usability changes
Resilient Predictive Data Pipelines (QCon London 2016)Sid Anand
This document discusses building resilient predictive data pipelines. It begins by distinguishing between ETL and predictive data pipelines, noting that predictive pipelines require high availability with downtimes of less than an hour. The document then outlines design goals for resilient data pipelines, including being scalable, available, instrumented/monitored/alert-enabled, and quickly recoverable. It proposes using AWS services like SQS, SNS, S3, and Auto Scaling Groups to build such pipelines. The document also recommends using Apache Airflow for workflow automation and scheduling to reliably manage pipelines as directed acyclic graphs. It presents an architecture using these techniques and assesses how well it meets the outlined design goals.
From Relational Database Management to Big Data: Solutions for Data Migration...Cognizant
Big data migration testing for transferring relational database management files is a very time-consuming, high-compute task; we offer a hands-on, detailed framework for data validation in an open source (Hadoop) environment incorporating Amazon Web Services (AWS) for cloud capacity, S3 (Simple Storage Service) and EMR (Elastic MapReduce), Hive tables, Sqoop tools, PIG scripting and Jenkins Slave Machines.
From business requirements to working pipelines with apache airflowDerrick Qin
In this talk we will be building Airflow pipelines. We’ll look at real business requirements and walk through pipeline design, implementation, testing, deployment and troubleshooting - all that by adhering to idempotency and ability to replay your past data through the pipelines.
We will introduce Airflow, an Apache Project for scheduling and workflow orchestration. We will discuss use cases, applicability and how best to use Airflow, mainly in the context of building data engineering pipelines. We have been running Airflow in production for about 2 years, we will also go over some learnings, best practices and some tools we have built around it.
Speakers: Robert Sanders, Shekhar Vemuri
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Kaxil Naik
From not knowing Python (let alone Airflow), and from submitting the first PR that fixes typo to becoming Airflow Committer, PMC Member, Release Manager, and #1 Committer this year, this talk walks through Kaxil’s journey in the Airflow World.
The second part of this talk explains:
how you can also start your OSS journey by contributing to Airflow
Expanding familiarity with a different part of the Airflow codebase
Continue committing regularly & steadily to become Airflow Committer. (including talking about current Guidelines of becoming a Committer)
Different mediums of communication (Dev list, users list, Slack channel, Github Discussions etc)
Presentation given on the 15th July 2021 at the Airflow Summit 2021
Conference website: https://airflowsummit.org/sessions/2021/clearing-airflow-obstructions/
Recording: https://www.crowdcast.io/e/airflowsummit2021/40
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
This document discusses how the author learned to use Airflow for data pipelining and scheduling tasks. It describes some early tools like Cron and Luigi that were used for scheduling. It then evaluates options like Drake, Pydoit, Pinball, Luigi, and AWS Data Pipeline before settling on Airflow due to its sophistication in handling complex dependencies, built-in scheduling and monitoring, and flexibility. The author also develops a plugin called smart-airflow to add file-based checkpointing capabilities to Airflow to track intermediate data transformations.
Fyber - airflow best practices in productionItai Yaffe
Eran Shemesh @ Fyber:
Fyber uses airflow to manage its entire big data pipelines including monitoring and auto-fix, the session will describe best practices that we implemented in production
This document discusses Apache Airflow and Google Cloud Composer. It begins by providing background on Apache Airflow, including that it is an open source workflow engine contributed by Airbnb. It then discusses how Codementor uses Airflow for ETL pipelines and machine learning workflows. The document mainly focuses on comparing self-hosting Airflow versus using Google Cloud Composer. Cloud Composer reduces efforts around hosting, permissions management, and monitoring. However, it has some limitations like occasional zombie tasks and higher costs. Overall, Cloud Composer allows teams to focus more on data logic and performance versus infrastructure maintenance.
A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodge-podge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.
Check out the contents on our browser-based liveBook reader here: https://livebook.manning.com/book/data-pipelines-with-apache-airflow/
Apache Airflow is a platform to author, schedule and monitor workflows as directed acyclic graphs (DAGs) of tasks. It allows workflows to be defined as code making them more maintainable, versionable and collaborative. The rich user interface makes it easy to visualize pipelines and monitor progress. Key concepts include DAGs, operators, hooks, pools and xcoms. Alternatives include Azkaban from LinkedIn and Oozie for Hadoop workflows.
This document provides an overview of building data pipelines using Apache Airflow. It discusses what a data pipeline is, common components of data pipelines like data ingestion and processing, and issues with traditional data flows. It then introduces Apache Airflow, describing its features like being fault tolerant and supporting Python code. The core components of Airflow including the web server, scheduler, executor, and worker processes are explained. Key concepts like DAGs, operators, tasks, and workflows are defined. Finally, it demonstrates Airflow through an example DAG that extracts and cleanses tweets.
Slide deck for the fourth data engineering lunch, presented by guest speaker Will Angel. It covered the topic of using Airflow for data engineering. Airflow is a scheduling tool for managing data pipelines.
This document provides an overview of Airflow, an open-source workflow management platform for authoring, scheduling and monitoring data pipelines. It describes Airflow's key components including the web server, scheduler, workers and metadata database. It explains how Airflow works by parsing DAGs, instantiating tasks and changing their state as they are scheduled, queued, run and monitored. The document also covers concepts like DAGs, operators, dependencies, concurrency vs parallelism and advanced topics such as subDAGs, hooks, XCOM and branching workflows.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
Introduction to Apache Airflow (Incubating), best practices and roadmap. Airflow is a platform to programmatically author, schedule and monitor workflows.
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...Itai Yaffe
Tal Sharon (Software Architect), Aviel Buskila (DevOps Engineer) and Max Peres (Data Engineer) @ Nielsen:
At the Nielsen Marketing Cloud, we used to manage our data pipelines via AWS Data Pipeline. Over the years, we’ve encountered several issues with this tool, and a year ago we decided to embark on a journey to replace it with a tool more suitable for our needs.
In this session, we’ll discuss how we actually migrated to Airflow, what challenges we faced and how we mitigated them (and even contributed to the open-source project along the way). We’ll also provide some helpful tips for Airflow users
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
This document discusses industrializing machine learning pipelines. It proposes a machine learning blueprint to improve collaboration between data scientists and engineers. The blueprint includes technical components like extracting, preprocessing, training, evaluating and predicting data. It defines roles for data scientists and engineers and recommends common tools and code to streamline workflows. The document also uses a case study of catalog categorization to illustrate how the blueprint was applied to classify videos using a machine learning pipeline with various technical components like TensorFlow, Apache Beam and BigQuery.
The document discusses upcoming features and changes in Apache Airflow 2.0. Key points include:
1. Scheduler high availability will use an active-active model with row-level locks to allow killing a scheduler without interrupting tasks.
2. DAG serialization will decouple DAG parsing from scheduling to reduce delays, support lazy loading, and enable features like versioning.
3. Performance improvements include optimizing the DAG file processor and using a profiling tool to identify other bottlenecks.
4. The Kubernetes executor will integrate with KEDA for autoscaling and allow customizing pods through templating.
5. The official Helm chart, functional DAGs, and smaller usability changes
Resilient Predictive Data Pipelines (QCon London 2016)Sid Anand
This document discusses building resilient predictive data pipelines. It begins by distinguishing between ETL and predictive data pipelines, noting that predictive pipelines require high availability with downtimes of less than an hour. The document then outlines design goals for resilient data pipelines, including being scalable, available, instrumented/monitored/alert-enabled, and quickly recoverable. It proposes using AWS services like SQS, SNS, S3, and Auto Scaling Groups to build such pipelines. The document also recommends using Apache Airflow for workflow automation and scheduling to reliably manage pipelines as directed acyclic graphs. It presents an architecture using these techniques and assesses how well it meets the outlined design goals.
From Relational Database Management to Big Data: Solutions for Data Migration...Cognizant
Big data migration testing for transferring relational database management files is a very time-consuming, high-compute task; we offer a hands-on, detailed framework for data validation in an open source (Hadoop) environment incorporating Amazon Web Services (AWS) for cloud capacity, S3 (Simple Storage Service) and EMR (Elastic MapReduce), Hive tables, Sqoop tools, PIG scripting and Jenkins Slave Machines.
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder of DataTorrent presented "Streaming Analytics with Apache Apex" as part of the Big Data, Berlin v 8.0 meetup organised on the 14th of July 2016 at the WeWork headquarters.
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
This document discusses Apache Apex, an open source stream processing framework. It provides an overview of stream data processing and common use cases. It then describes key Apache Apex capabilities like in-memory distributed processing, scalability, fault tolerance, and state management. The document also highlights several customer use cases from companies like PubMatic, GE, and Silver Spring Networks that use Apache Apex for real-time analytics on data from sources like IoT sensors, ad networks, and smart grids.
Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...HostedbyConfluent
"Incorrect data produced into Kafka can be a poison pill that has the potential to disrupt businesses built upon Kafka. The “Semantic Validation” feature is designed to address the challenges posed by incorrect or unexpected data in Kafka’s data processing pipelines, with the goal of mitigating such disruptions. By allowing users to define robust field constraints directly within schemas, such as Avro, we aim to enhance data quality and minimize the downstream impacts of inaccurate data in Kafka.
Furthermore, this feature can be expanded to include offline data processing, in addition to Kafka and Flink real-time processing. By combining real-time processing, batch analytics, and AI data pipelines, a global semantic validation system can be built.
In our upcoming talk, we will delve into the use cases of this feature, discuss its architecture, provide examples of defining rules, and explain how we enforce these rules. Ultimately, we will demonstrate how this feature can significantly enhance reliability and trustworthiness in Uber’s data processing pipelines."
This document contains a summary of P. Kumaraswamy's professional experience as a Senior Software Test Engineer and Consultant. He has over 6 years of experience testing software in various domains including banking, healthcare, and insurance. Some of his key skills and responsibilities included functional testing, data migration testing, ETL testing using tools like Informatica and SSIS, and report testing using Cognos and SCA. He has worked on numerous projects for clients such as Johnson & Johnson, Citi, Travelers Insurance, and IMS Health. His roles have involved preparing test plans and cases, executing tests, and tracking defects in tools like QC and ALM.
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
Teradata believes in principles of self-service, automation, and on-demand resource allocation to enable faster, more efficient, and more effective data application development and operation. The document discusses the Lambda architecture, alternatives like the Kappa architecture, and a vision for an "Omega" architecture. It provides examples of how to build real-time data applications using microservices, event streaming, and loosely coupled services across Teradata and other data platforms like Hadoop.
Firebird to Cassandra Migration - Ask On DataVarsha Nayak
In this article, we delve into the intricacies of migrating from Firebird to Cassandra. We will explore the reasons behind choosing Cassandra over Firebird, highlighting its scalability, high availability, and fault tolerance.
This document discusses data migration in Oracle E-Business Suite. It covers migrating data to Oracle using open interfaces/APIs, Oracle utilities like FNDLOAD and iSetup, and third party tools like DataLoad and Mercury Object Migrator. It also discusses migrating data from Oracle by creating materialized views or using the Business Event System to define custom events. The document provides an overview of different data migration scenarios and options for loading both setup, master, and transactional data in Oracle E-Business Suite.
Streamline Your Data Workflows with DataOps for Better Efficiency.pdfunicloudm
As organizations handle ever-growing volumes of data, ensuring that data pipelines are efficient, reliable, and aligned with business needs becomes increasingly challenging. Traditional data management approaches often lead to siloed teams, time-consuming manual processes, and frequent errors.
Transcend Automation is the authorized business partners for Kepware Technologies in India. We Market, Promote, Integrate their products for customers in India
Rakesh Dhanani has over 15 years of experience developing Oracle databases and applications. His experience includes implementing, supporting, and developing Oracle EBS applications. He has developed custom applications, interfaces, reports and ETL processes. Currently he focuses on Oracle database and application development at Chicago Bridge & Iron. Previously he developed Oracle databases and applications at BHP Billiton and Swift Energy, gaining experience in the oil and gas industry.
Firebird to Cassandra Migration - Ask On DataVarsha Nayak
In this article, we delve into the intricacies of migrating from Firebird to Cassandra. We will explore the reasons behind choosing Cassandra over Firebird, highlighting its scalability, high availability, and fault tolerance.
Firebird to Cassandra Migration - Ask On DataVarsha Nayak
In this article, we delve into the intricacies of migrating from Firebird to Cassandra. We will explore the reasons behind choosing Cassandra over Firebird, highlighting its scalability, high availability, and fault tolerance.
The document discusses Enterprise Resource Planning (ERP) systems. It describes the ERP architecture as using a client-server model with a relational database to store and process data. The ERP lifecycle involves definition, construction, implementation, and operation phases. Core ERP components manage accounting, production, human resources and other internal functions, while extended components provide external capabilities like CRM, SCM, and e-business. Proper implementation requires screening software, evaluating packages, analyzing process gaps, reengineering workflows, training staff, testing, and post-implementation support.
This document provides an overview and outline of a three-part presentation on tuning all layers of the Oracle E-Business Suite for performance. Part 1 will cover tuning application modules, upgrade performance best practices, and tuning the database tier. It will also discuss performance triage and resolution approaches. The presentation aims to provide guidance on configuration, operational best practices, patching, performance testing, and monitoring across the application, middleware, and database tiers to optimize E-Business Suite performance.
This document describes a solution accelerator for monitoring overall equipment effectiveness (OEE) and key performance indicators (KPIs) across multiple manufacturing factories in near real-time. It discusses how the Databricks lakehouse platform can be used to ingest sensor and operational technology data from devices, clean and structure the data, integrate it with data from ERP systems, calculate OEE and other metrics through streaming aggregations, and surface the outcomes through dashboards. The solution implements a data architecture pattern called medallion to incrementally move data from raw to aggregated layers for analysis.
A whitepaper is about Qubole on AWS provides end-to-end data lake services such as AWS infrastructure management, data management, continuous data engineering, analytics, & ML with zero administration
https://www.qubole.com/resources/white-papers/qubole-on-aws
Emerging Trends in Marketing-Role of AI & Data ScienceDigital Vidya
Today marketing without the use of technology cannot exist. Both MarTech and AdTech are important cogs in the marketing wheel. However, the technological landscape is changing every rapidly and keeping track of the emerging trends in marketing is becoming very difficult and tiresome. This webinar will address the emerging technological landscape in marketing and what one should know about them.
Key Takeaways:
1. Ad Tech and Martech
2. AI in Marketing
3. Use of Videos
4. Hyperpersonilsation
5. Social Media Evolution
6. Change in the lead nurturing process
Digital Marketing Beyond Facebook & GoogleDigital Vidya
This document discusses digital marketing strategies in India beyond Facebook and Google. It recommends taking a multi-platform approach, including LinkedIn, Quora, programmatic advertising platforms, native platforms, video/OTT platforms, audio platforms, industry-specific platforms, and YouTube. Some key platforms mentioned are LinkedIn, Quora, Google Marketing Platform, Times Internet, Daily Hunt, Adgebra, Hotstar, Gaana, and Saavn. The document provides best practices for using different platforms and examples of targeting options. It also notes that while voice assistants are growing in India, voice advertising is not yet open to advertisers.
In the recent past, we have learnt that data is the lifeline of any business and it is really important to collect data, more and more of it. But no one is telling us what to do with large volumes of data.
Shailendra has successfully delivered over One Billion Dollars in incremental value and will spend 30 minutes in showcasing how many large organisations are using data to their advantage by creating value through generating incremental revenue and optimising costs using analytics techniques.
Key Takeaways:
(i) Demystify the myths of analytics
(ii) Walkthrough a step-by-step approach to delivering successful projects that created an incremental value of hundreds and millions of dollars.
(iii) Three use cases where large organisations are using analytics to their advantage by creating value by generating incremental revenue and optimising costs.
Welcome to the world of NoSQL. NoSQL market is now expected to reach 4.2 billion dollar business in itself by 2020. If you are still confused by what does this term means then you are not ready for the Big Data world. However, just knowing the term is not enough.
Due to the enormous numbers of No SQL platforms out there, one of the key challenges is not how to use them but when to use what. In this webinar session, we will start with a small description of the NoSQL and try to understand why it was introduced after all. Then we will look into the four different types of NoSQL frameworks and some tips on how to choose what.
Key Takeaways:
1. Understanding NoSQL
2. SQL to NoSQL: Why the Need is There
3. The Four Main Types of NoSQL
4. How to Make the Best Choice
5. NoSQL User Stories & Deployment of Best Practices
Persuasion Strategies That Work Building Influence To Open Up Your Revenue St...Digital Vidya
Effective persuasion techniques not only help Marketing & Salespeople to generate more customers. Whether done internally or via Influencers, these persuasion techniques are very important to trigger sales.
This webinar session will discuss several persuasion techniques and as well as provide an understanding to learn more about human behaviour. It will also highlight what exactly triggers people to make purchase decisions.
Key Takeaways:
1) Understanding Why People Buy: Key motivation and Drivers
2) How to Develop the Art of Persuasion: Become a Key Influencer
3) Learn how Major Brands in India and Across the World have used these Principles
How To Set-up An SEO Agency From Scratch As A NewbieDigital Vidya
This document provides tips for freelancers to find new clients, including learning from competitors, doing a SWOT analysis, building a list of actions, gathering case studies and reviews, building an online profile, and using sites like Upwork, email, and LinkedIn to connect with potential clients by selling on value and energy while targeting clients with budgets.
7 B2B Marketing Trends for Driving GrowthDigital Vidya
Know about the top 7 B2B Marketing Trends for Driving Growth. Gain insights from the webinar led by Virginia Sharma,
Director, Marketing Solutions, LinkedIn India
Social Video Analytics: From Demography to Psychography of User BehaviourDigital Vidya
Know about Social Video Analytics: From Demography to Psychography of User Behaviour. Gain insights from the webinar led by Nishant Radia, Co-founder & CMO, Vidooly.
How to Use Marketing Automation to Convert More Leads to SalesDigital Vidya
Know about How to Use Marketing Automation to Convert More Leads to Sales. Gain insights from the webinar led by David Fallarme, Head of Marketing, SEA & India, HubSpot
Native Advertising: Changing Digital Advertising LandscapeDigital Vidya
Know about how Native Advertising is changing Digital Advertising landscape. Gain insights from the webinar led by Samir Tiwari, Co-Founder & CEO, Non Lineaar.
Apache Spark has been gaining steam, with rapidity, both in the headlines and in real-world adoption. Spark was developed in 2009, and open sourced in 2010. Since then, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. This open source analytics engine stands out for its ability to process large volumes of data significantly faster than contemporaries such as MapReduce, primarily owing to in-memory storage of data on its own processing framework. That being said, one of the top real-world industry use cases for Apache Spark is its ability to process ‘streaming data‘.
Community Development with Social MediaDigital Vidya
Know how to do 'Community Development with Social Media'. Gain insights from the webinar led by Saurabh Jain Head, Paytm - Build for India and Founder, Fun2Do Labs.
Framework of Digital Media Marketing in IndiaDigital Vidya
Know 'Framework of Digital Media Marketing in India'. Gain insights from the webinar led by Nishant Malsisaria, Associate Media Director. Dentsu Webchutney.
The Secret to Search Engine Marketing Success in 2018Digital Vidya
Know 'The Secret to Search Engine Marketing Success in 2018'. Gain insights from the webinar led by Prashant Nandan, Senior Director - Digital Trading & Buying, Amplifi India-Dentsu Aegis Network.
People Centric Marketing - Create Impact by Putting People First Digital Vidya
Know how to create impact by putting people first via 'People Centric Marketing'. Gain insights from the webinar led by Sakhee Dheer, Head of Digital, Global Business Marketing, Asia Pacific, Facebook.
Going Global? Key Steps to Expanding Your Business GloballyDigital Vidya
'Going Global? Key Steps to Expanding Your Business Globally'. Gain insights from the webinar led by Alexia Ohannessian, International Marketing Lead, Trello. Explore more webinars at www.digitalvidya.com/webinars/.
How to Optimize your Online Presence for 6X Growth in Sales?Digital Vidya
Explore 'How to Optimize Your Online Presence for 6x Growth in Sales?'. Gain insights from the webinar led by Advit Sahdev, Head of Marketing, Infibeam.com.
By James Francis, CEO of Paradigm Asset Management
In the landscape of urban safety innovation, Mt. Vernon is emerging as a compelling case study for neighboring Westchester County cities. The municipality’s recently launched Public Safety Camera Program not only represents a significant advancement in community protection but also offers valuable insights for New Rochelle and White Plains as they consider their own safety infrastructure enhancements.
This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.
Thingyan is now a global treasure! See how people around the world are search...Pixellion
We explored how the world searches for 'Thingyan' and 'သင်္ကြန်' and this year, it’s extra special. Thingyan is now officially recognized as a World Intangible Cultural Heritage by UNESCO! Dive into the trends and celebrate with us!
3. Agenda
❖ A brief introduction to Qubole
❖ Apache Airflow
❖ Operational Challenges in managing an ETL
❖ Alerts and Monitoring
❖ Quality Assurance in ETL’s
3
4. About Qubole Data service
❖ A self-service platform for big data analytics.
❖ Delivers best-in-class Apache tools such as Hadoop, Hive, Spark,
etc. integrated into an enterprise-feature rich platform optimized
to run in the cloud.
❖ Enables users to focus on their data rather than the platform.
4
5. Data Team @ Qubole
❖ Data Warehouse for Qubole
❖ Provides Insights and Recommendations to users
❖ Just Another Qubole Account
❖ Enabling data driven features within QDS
5
6. Multi Tenant Nature Of Data
Team
6
Qubole
Distribution 2
(azure.qubole.com)
Distribution 1
(api.qubole.com)
Data Warehouse
Data Warehouse
7. Apache Airflow For ETL
❖ Developer Friendly
❖ A rich collection of Operators, CLI utilities and UI to author and manage your
Data Pipeline.
❖ Horizontally Scalable.
❖ Tight Integration With Qubole
7
9. Operational Challenges In ETL World.
9
How to achieve
continuous
integration and
deployment for
ETL’s
?
How to effectively
manage
configuration for
ETL’s in a multi
tenant environment
?
How we do we
make ETL’s aware of
the Data Warehouse
migrations
?
12. Airflow Variables for ETL Configuration
❖ Stores the information as a key value pair in airflow.
❖ Extensive support like CLI, UI and API to manage the variables
❖ Can be used from within the airflow script as
variable.get(“variable_name”)
12
13. Warehouse Management.
❖ A leaf out of Ruby on Rails: Active Record
Migrations.
❖ Each migration is tagged and committed as a
single commit to version control along with ETL
changes.
13
14. The PROCESS IS EASY
14
Checkout
from version
control the
target tag.
Update the
migration
number
Run any new
relevant
migrations
Fetch Current
Migration
Number from
Airflow
Variables.
15. ❖ Traditional deployment too messy when multiple users are handling airflow.
❖ Data Apps for ETL deployment.
❖ Provides cli option like <ETL_NAME> deploy -r <version_tag> -d <start_date>
Deployment
Checkout the
airflow
template file
from version
control.
Copy the final
script file to
airflow
directory.
Read Config
Values from
Airflow and
translate the
config values
19. IMPORTANCE OF DATA
VALIDATION
❖ Application’s correctness depends on correctness of data.
❖ Increase confidence on data by quantifying data quality.
❖ Correcting existing data can be expensive - prevention better than cure!
❖ Stopping critical downstream tasks if the data is invalid.
19
20. TREND MONITORING
❖ Monitor dips, peaks, anomalies.
❖ Hard problem!
❖ Not real time.
❖ One size doesn’t fit all - Different ETLs manipulate data in different ways.
❖ Difficult to maintain.
20
22. Using Apache Airflow Check operators:
Approach:
Extend open
source airflow
check operator for
queries running on
Qubole platform
Run data
validation queries
Fail the operator if
the validation fails
22
25. Problem: Airflow Check operators required pass_value to be defined
before
the ETL starts.
Use case: Validating data import logic
Solution: Make pass_value an Airflow template field
This way it can be configured at run-time. The pass value can be injected
through multiple mechanisms once it’s an airflow template field.
1. Compare Data across engines
25
27. Problem: Currently, Apache airflow check operators consider single row for
comparison.
Use case: Run group queries, compare each of the values against the pass_value.
Solution: Qubole_check_operator adds `results_parser_callable` parameter
The function pointed to by `results_parser_callable` holds the logic to return a list
of records on which the checks would be performed.
2. Validate multiline results
27
30. ETL # 1: Data Ingestion Imports data from RDS tables into Data Warehouse for analysis
purposes.
Historical Issues:
Mismatch with source data
1. Data duplication
2. Data missing for certain duration
Checks employed:
- Count comparison across the two data stores - source and
destination.
How checks have helped us:
- Verify and rectify upsert logic (which is not plain copy of
RDS)
PS: Runtime fetching of expected values!
30
31. ETL # 2: Data
Transformation
Repartitions a day’s worth of data into hourly partitions.
Historical Issues:
1. Data ending up in single partition field (Default hive
partition).
2. Wrong ordering of values in fields.
Checks employed:
1. Number of partitions getting created are 24 (one for every
hour).
2. Check the value of critical field, “source” .
How checks have helped us: Verify and rectify repartitioning
logic.
31
32. ETL # 3: Cost Computation
Computes Qubole Compute Unit
Hour (QCUH)
Situation: We are narrowing down on the
granularity of cost computation from daily to hourly.
How Checks have helped?
To monitor new data and alarm in case of
mismatch in trends of old and new data.
32
33. ETL # 4: Data
Transformation
Parses customer queries and outputs table usage information.
Historical Issues:
1. Data missing for a customer account.
2. Data loss due to different syntaxes across engines.
3. Data loss due to query syntax changes across different versions of data-
engines.
Checks employed:
1. Group by account ids, if any of them is 0, raise an alert.
2. Group by on engine type, account ids. If high error %, raise an alert.
How checks have helped us:
- Insights into amount of data loss.
- Provides feedback, helped us make syntax checking more robust.
33
34. FEATURES
❖ Ability to plug-in different alerting mechanisms.
❖ Dependency management and Failure handling.
❖ Ability to parse the output of assert query in a user defined manner.
❖ Run time fetching of the pass_value against which the comparison is made.
❖ Ability to generate failure/success report.
34
35. LESSONS LEARNT
One size doesn’t fit
all- Estimation of data
trends is a difficult
problem
Delegate the
validation task to the
ETL itself
35
36. Source code has been
contributed to Apache Airflow
AIRFLOW-2228: Enhancements in Check operator
AIRFLOW-2213: Adding Qubole Check Operator
36