A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
The monolith to cloud-native, microservices evolution has driven a shift from monitoring to observability. OpenTelemetry, a merger of the OpenTracing and OpenCensus projects, is enabling Observability 2.0. This talk gives an overview of the OpenTelemetry project and then outlines some production-proven architectures for improving the observability of your applications and systems.
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
This document discusses building the data lakehouse, a new data architecture that combines the best of data warehouses and data lakes. It introduces the concept of the data lakehouse and compares it to traditional data warehouses and data lakes. The data lakehouse stores all types of data in an open, cloud-based environment and uses machine learning and analytics to deliver insights. It provides a single platform for various users and workloads, from data engineering to data science.
Realtime Indexing for Fast Queries on Massive Semi-Structured DataScyllaDB
Rockset is a realtime indexing database that powers fast SQL over semi-structured data such as JSON, Parquet, or XML without requiring any schematization. All data loaded into Rockset are automatically indexed and a fully featured SQL engine powers fast queries over semi-structured data without requiring any database tuning. Rockset exploits the hardware fluidity available in the cloud and automatically grows and shrinks the cluster footprint based on demand. Available as a serverless cloud service, Rockset is used by developers to build data-driven applications and microservices.
In this talk, we discuss some of the key design aspects of Rockset, such as Smart Schema and Converged Index. We describe Rockset's Aggregator Leaf Tailer (ALT) architecture that provides low latency queries on large datasets.Then we describe how you can combine lightweight transactions in ScyllaDB with realtime analytics on Rockset to power an user-facing application.
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.
Build Real-Time Applications with Databricks StreamingDatabricks
This document discusses using Databricks, Spark, and Power BI for real-time data streaming. It describes a use case of a fire department needing real-time reporting of equipment locations, personnel statuses, and active incidents. The solution involves ingesting event data using Azure Event Hubs, processing the stream using Databricks and Spark Structured Streaming, storing the results in Delta Lake, and visualizing the data in Power BI dashboards. It then demonstrates the architecture by walking through creating Delta tables, streaming from Event Hubs to Delta Lake, and running a sample event simulator.
NiFi Best Practices for the EnterpriseGregory Keys
The document discusses best practices for implementing Apache NiFi in an enterprise. It recommends establishing a Center of Excellence (COE) to align stakeholders, provide guidance, and develop standards and processes for NiFi deployment. The COE should work with business leaders to understand data flow needs and ensure NiFi is delivering business value. When scaling NiFi across a large enterprise, it may make sense to have multiple semi-autonomous NiFi clusters for different business groups rather than one large cluster. Reusable templates, components, and patterns can help with development efficiencies.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
Substrait is a cross-language serialization format for relational algebra plans that aims to allow data computation engines and analysis tools to work together across programming languages. It defines common types, functions, and relational operators that multiple projects can reuse. Substrait plans are self-contained and serializable, with the goal of enabling innovation in frontend and backend systems independently. Several companies are actively working on integrations and the project is governed openly on GitHub.
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
Timely data in a data warehouse is a challenge many of us face, often with there being no straightforward solution.
Using a combination of batch and streaming data pipelines you can leverage the Delta Lake format to provide an enterprise data warehouse at a near real-time frequency. Delta Lake eases the ETL workload by enabling ACID transactions in a warehousing environment. Coupling this with structured streaming, you can achieve a low latency data warehouse. In this talk, we’ll talk about how to use Delta Lake to improve the latency of ingestion and storage of your data warehouse tables. We’ll also talk about how you can use spark streaming to build the aggregations and tables that drive your data warehouse.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
ApacheCon 2021 - Apache NiFi Deep Dive 300Timothy Spann
21-September-2021 - ApacheCon - Tuesday 17:10 UTC Apache NIFi Deep Dive 300
* https://github.com/tspannhw/EverythingApacheNiFi
* https://github.com/tspannhw/FLiP-ApacheCon2021
* https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html
* https://github.com/tspannhw/FLiP-IoT
* https://github.com/tspannhw/FLiP-Energy
* https://github.com/tspannhw/FLiP-SOLR
* https://github.com/tspannhw/FLiP-EdgeAI
* https://github.com/tspannhw/FLiP-CloudQueries
* https://github.com/tspannhw/FLiP-Jetson
* https://www.linkedin.com/pulse/2021-schedule-tim-spann/
Tuesday 17:10 UTC
Apache NIFi Deep Dive 300
Timothy Spann
For Data Engineers who have flows already in production, I will dive deep into best practices, advanced use cases, performance optimizations, tips, tricks, edge cases, and interesting examples. This is a master class for those looking to learn quickly things I have picked up after years in the field with Apache NiFi in production.
This will be interactive and I encourage questions and discussions.
You will take away examples and tips in slides, github, and articles.
This talk will cover:
Load Balancing
Parameters and Parameter Contexts
Stateless vs Stateful NiFi
Reporting Tasks
NiFi CLI
NiFi REST Interface
DevOps
Advanced Record Processing
Schemas
RetryFlowFile
Lookup Services
RecordPath
Expression Language
Advanced Error Handling Techniques
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache NiFi, Apache Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
AIDevWorld 23 Apache NiFi 101 Introduction and Best Practices
https://sched.co/1RoAO
Timothy Spann, Cloudera, Principal Developer Advocate
In this talk, we will walk step by step through Apache NiFi from the first load to first application. I will include slides, articles and examples to take away as a Quick Start to utilizing Apache NiFi in your real-time dataflows. I will help you get up and running locally on your laptop, Docker or in CDP Public Cloud.
Wednesday November 1, 2023 12:00pm - 12:25pm PDT
VIRTUAL AI DevWorld -- Main Stage https://app.hopin.com/events/api-world-2023-ai-devworld/stages
Retail & E-Commerce AI (Industry AI Conference)
Session Type OPEN TALK
Track or Conference Retail & E-Commerce AI (Industry AI Conference), Industry AI Conference, VIRTUAL, Tensorflow & PyTorch & Open Source Frameworks (AI/ML Engineering Conference), AI/ML Engineering Conference, AI DevWorld
In-Person/Virtual Virtual, Virtual Exclusive
apache nifi
Timothy Spann
Cloudera
Principal Developer Advocate for Data in Motion
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
cloudera dataflow
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.
Build Real-Time Applications with Databricks StreamingDatabricks
This document discusses using Databricks, Spark, and Power BI for real-time data streaming. It describes a use case of a fire department needing real-time reporting of equipment locations, personnel statuses, and active incidents. The solution involves ingesting event data using Azure Event Hubs, processing the stream using Databricks and Spark Structured Streaming, storing the results in Delta Lake, and visualizing the data in Power BI dashboards. It then demonstrates the architecture by walking through creating Delta tables, streaming from Event Hubs to Delta Lake, and running a sample event simulator.
NiFi Best Practices for the EnterpriseGregory Keys
The document discusses best practices for implementing Apache NiFi in an enterprise. It recommends establishing a Center of Excellence (COE) to align stakeholders, provide guidance, and develop standards and processes for NiFi deployment. The COE should work with business leaders to understand data flow needs and ensure NiFi is delivering business value. When scaling NiFi across a large enterprise, it may make sense to have multiple semi-autonomous NiFi clusters for different business groups rather than one large cluster. Reusable templates, components, and patterns can help with development efficiencies.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
Substrait is a cross-language serialization format for relational algebra plans that aims to allow data computation engines and analysis tools to work together across programming languages. It defines common types, functions, and relational operators that multiple projects can reuse. Substrait plans are self-contained and serializable, with the goal of enabling innovation in frontend and backend systems independently. Several companies are actively working on integrations and the project is governed openly on GitHub.
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
Timely data in a data warehouse is a challenge many of us face, often with there being no straightforward solution.
Using a combination of batch and streaming data pipelines you can leverage the Delta Lake format to provide an enterprise data warehouse at a near real-time frequency. Delta Lake eases the ETL workload by enabling ACID transactions in a warehousing environment. Coupling this with structured streaming, you can achieve a low latency data warehouse. In this talk, we’ll talk about how to use Delta Lake to improve the latency of ingestion and storage of your data warehouse tables. We’ll also talk about how you can use spark streaming to build the aggregations and tables that drive your data warehouse.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
ApacheCon 2021 - Apache NiFi Deep Dive 300Timothy Spann
21-September-2021 - ApacheCon - Tuesday 17:10 UTC Apache NIFi Deep Dive 300
* https://github.com/tspannhw/EverythingApacheNiFi
* https://github.com/tspannhw/FLiP-ApacheCon2021
* https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html
* https://github.com/tspannhw/FLiP-IoT
* https://github.com/tspannhw/FLiP-Energy
* https://github.com/tspannhw/FLiP-SOLR
* https://github.com/tspannhw/FLiP-EdgeAI
* https://github.com/tspannhw/FLiP-CloudQueries
* https://github.com/tspannhw/FLiP-Jetson
* https://www.linkedin.com/pulse/2021-schedule-tim-spann/
Tuesday 17:10 UTC
Apache NIFi Deep Dive 300
Timothy Spann
For Data Engineers who have flows already in production, I will dive deep into best practices, advanced use cases, performance optimizations, tips, tricks, edge cases, and interesting examples. This is a master class for those looking to learn quickly things I have picked up after years in the field with Apache NiFi in production.
This will be interactive and I encourage questions and discussions.
You will take away examples and tips in slides, github, and articles.
This talk will cover:
Load Balancing
Parameters and Parameter Contexts
Stateless vs Stateful NiFi
Reporting Tasks
NiFi CLI
NiFi REST Interface
DevOps
Advanced Record Processing
Schemas
RetryFlowFile
Lookup Services
RecordPath
Expression Language
Advanced Error Handling Techniques
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache NiFi, Apache Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
AIDevWorld 23 Apache NiFi 101 Introduction and Best Practices
https://sched.co/1RoAO
Timothy Spann, Cloudera, Principal Developer Advocate
In this talk, we will walk step by step through Apache NiFi from the first load to first application. I will include slides, articles and examples to take away as a Quick Start to utilizing Apache NiFi in your real-time dataflows. I will help you get up and running locally on your laptop, Docker or in CDP Public Cloud.
Wednesday November 1, 2023 12:00pm - 12:25pm PDT
VIRTUAL AI DevWorld -- Main Stage https://app.hopin.com/events/api-world-2023-ai-devworld/stages
Retail & E-Commerce AI (Industry AI Conference)
Session Type OPEN TALK
Track or Conference Retail & E-Commerce AI (Industry AI Conference), Industry AI Conference, VIRTUAL, Tensorflow & PyTorch & Open Source Frameworks (AI/ML Engineering Conference), AI/ML Engineering Conference, AI DevWorld
In-Person/Virtual Virtual, Virtual Exclusive
apache nifi
Timothy Spann
Cloudera
Principal Developer Advocate for Data in Motion
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
cloudera dataflow
Using Apache NiFi with Apache Pulsar for Fast Data On-RampTimothy Spann
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
https://pulsar-summit.org/event/europe-2023/schedule
https://pulsar-summit.org/event/europe-2023/sessions/europe-2023-using-apache-nifi-with-apache-pulsar-for-fast-data-on-ramp
12:30 PM - 1:00 PM, CEST , May 23
Using Apache Nifi with Apache Pulsar for Fast Data On-Ramp
As the Pulsar communities grows, more and more connectors will be added. To enhance the availability of sources and sinks and to make use of the greater Apache Streaming community, joining forces between Apache NiFi and Apache Pulsar is a perfect fit. Apache NiFi also adds the benefits of ELT, ETL, data crunching, transformation, validation and batch data processing. Once data is ready to be an event, NiFi can launch it into Pulsar at light speed.
Timothy Spann
Principal Developer Advocate for Data in Motion @ Cloudera
Cask Webinar
Date: 08/10/2016
Link to video recording: https://www.youtube.com/watch?v=XUkANr9iag0
In this webinar, Nitin Motgi, CTO of Cask, walks through the new capabilities of CDAP 3.5 and explains how your organization can benefit.
Some of the highlights include:
- Enterprise-grade security - Authentication, authorization, secure keystore for storing configurations. Plus integration with Apache Sentry and Apache Ranger.
- Preview mode - Ability to preview and debug data pipelines before deploying them.
- Joins in Cask Hydrator - Capabilities to join multiple data sources in data pipelines
- Real-time pipelines with Spark Streaming - Drag & drop real-time pipelines using Spark Streaming.
- Data usage analytics - Ability to report application usage of data sets.
- And much more!
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsApache Apex
Presenter:
Chaitanya Chebolu, Committer for Apache Apex and Software Engineer at DataTorrent.
In this session we will cover the use-case of ingesting data from Kafka and writing to HDFS with a couple of processing operators - Parser, Dedup, Transform.
Presentation on Presto (http://prestodb.io) basics, design and Teradata's open source involvement. Presented on Sept 24th 2015 by Wojciech Biela and Łukasz Osipiuk at the #20 Warsaw Hadoop User Group meetup http://www.meetup.com/warsaw-hug/events/224872317
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
4 most important Open Source Big Data Techs in the Oracle Big Data Cloud Service.
https://www.youtube.com/watch?v=OX9el8qXvQo
Real time cloud native open source streaming of any data to apache solrTimothy Spann
Real time cloud native open source streaming of any data to apache solr
Utilizing Apache Pulsar and Apache NiFi we can parse any document in real-time at scale. We receive a lot of documents via cloud storage, email, social channels and internal document stores. We want to make all the content and metadata to Apache Solr for categorization, full text search, optimization and combination with other datastores. We will not only stream documents, but all REST feeds, logs and IoT data. Once data is produced to Pulsar topics it can instantly be ingested to Solr through Pulsar Solr Sink.
Utilizing a number of open source tools, we have created a real-time scalable any document parsing data flow. We use Apache Tika for Document Processing with real-time language detection, natural language processing with Apache OpenNLP, Sentiment Analysis with Stanford CoreNLP, Spacy and TextBlob. We will walk everyone through creating an open source flow of documents utilizing Apache NiFi as our integration engine. We can convert PDF, Excel and Word to HTML and/or text. We can also extract the text to apply sentiment analysis and NLP categorization to generate additional metadata about our documents. We also will extract and parse images that if they contain text we can extract with TensorFlow and Tesseract.
ApacheCon 2021: Apache NiFi 101- introduction and best practicesTimothy Spann
ApacheCon 2021: Apache NiFi 101- introduction and best practices
Thursday 14:10 UTC
Apache NiFi 101: Introduction and Best Practices
Timothy Spann
In this talk, we will walk step by step through Apache NiFi from the first load to first application. I will include slides, articles and examples to take away as a Quick Start to utilizing Apache NiFi in your real-time dataflows. I will help you get up and running locally on your laptop, Docker
DZone Zone Leader and Big Data MVB
@PaasDev
https://github.com/tspannhw https://www.datainmotion.dev/
https://github.com/tspannhw/SpeakerProfile
https://dev.to/tspannhw
https://sessionize.com/tspann/
https://www.slideshare.net/bunkertor
This document outlines the agenda and content for a presentation on xPatterns, a tool that provides APIs and tools for ingesting, transforming, querying and exporting large datasets on Apache Spark, Shark, Tachyon and Mesos. The presentation demonstrates how xPatterns has evolved its infrastructure to leverage these big data technologies for improved performance, including distributed data ingestion, transformation APIs, an interactive Shark query server, and exporting data to NoSQL databases. It also provides examples of how xPatterns has been used to build applications on large healthcare datasets.
This document summarizes a presentation about Apache Phoenix, an open-source project that allows HBase to be queried with SQL. It discusses what Phoenix is, why tracing is important, and the features of a new tracing web app created for Phoenix, including listing traces, visualizing trace distributions and individual trace details. Programming challenges in creating the app and new issues filed are also summarized.
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings.
Speakers:
Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
This workshop will provide an introduction to Big Data Analytics using Apache Spark and Apache Zeppelin.
https://github.com/zeltovhorton/intro_spark_zeppelin_meetup
There will be a short lecture that includes an introduction to Spark, the Spark components.
Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.
A gentle introduction to Apache Spark from the theorem of Resilient Distributed Datasets to deploying software to the core platform, Spark Streaming, and Spark SQL
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
What is Model Context Protocol(MCP) - The new technology for communication bw...Vishnu Singh Chundawat
The MCP (Model Context Protocol) is a framework designed to manage context and interaction within complex systems. This SlideShare presentation will provide a detailed overview of the MCP Model, its applications, and how it plays a crucial role in improving communication and decision-making in distributed systems. We will explore the key concepts behind the protocol, including the importance of context, data management, and how this model enhances system adaptability and responsiveness. Ideal for software developers, system architects, and IT professionals, this presentation will offer valuable insights into how the MCP Model can streamline workflows, improve efficiency, and create more intuitive systems for a wide range of use cases.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Ad
E2E Data Pipeline - Apache Spark/Airflow/Livy
1. Confidential
End to End Pipelines Using Apache Spark/Livy/Airflow
An integrated solution to batch data processing
Rikin Tanna and Karunasri Maram
Capital One Auto Finance
February 12, 2020
4. 4
Solution Requirements
?
Scalable
Handle jobs with
growing data sets
End to End Data Pipeline
Parallel Execution
Ability to run multiple jobs
in parallel
Open-Source Support
Active contributions to
components used to
stay efficient
Dynamic
Generation of pipeline on
demand to support varying
characteristics
Dependency Enabled
Support ordering of tasks based
on dependency
5. 5
A fully integrated big data pipeline…. with just 3
components!
• Apache Spark
• Unified data analytics engine for large-scale data processing
• Served on EMR cluster
• Apache Livy
• REST Interface to enable easy interaction with Apache Spark
• Served on master node of EMR cluster
• Apache Airflow
• WMS to schedule, trigger, and monitor workflows on a single
compute resource
• Served on single compute instance, with metadata DB on
separate RDS instance
Solution: Brief
5
7. 7
What is Airflow?
An open source platform to programatically author, schedule, and monitor workflows
Dynamic: Airflow pipelines are
configured as code, allowing
for dynamic pipeline
generation as DAGs
Extensible: Easily extend the
library and usability by
creating your own operators
and executors
General-Purpose: Airflow is
written in Python, and all
pipelines are configured in
Python
Accessible: Rich UI allows for
non-technical users to
monitor workflows
8. 8
Airflow Luigi Oozie Azkaban
Dynamic Pipelines
Rich, Interactable UI
General Purpose
Usability
Scalability
Dependency
Management
Maturity/Support
Why Airflow?
Comparison of common open source workflow management systems
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
9. 9
Apache Airflow Architecture
Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
● Metadata Database
○ stores information necessary for scheduler/executor
and webserver
○ task state, DAG definitions, log locations, etc
● Scheduler/Executor
○ process that uses DAG definitions and task states to push
tasks onto queue for execution
● Workers
○ process(es) that execute the logic of the tasks
● Webserver
○ process that renders web UI, interacting with metadata
database to allow user monitoring and interaction with
workflows
10. 10
How to Get Started with Apache Airflow
1. Install Python
2. “pip install apache-airflow”
a. install from pypi using pip
b. AIRFLOW_HOME = ~/airflow (default)
3. “airflow initdb”
a. initialize database
4. “airflow webserver -p 8080”
a. start web server, default port is 8080
5. “airflow scheduler”
a. start scheduler (also starts executor processes)
6. visit localhost:8080 in browser and enable example dags
Deeper Understanding
1. Connect to database (using Datagrip or DBeaver)
and view tables. See how the data is altered as
workflows execute and changes are made
2. Dig into source code
(https://github.com/apache/airflow) and view
actions triggered by scheduler CLI command
12. 12
Spark Flink Storm
Streaming
Batch/Interactive/Iterative
Processing
General Purpose Usability
Scalability
Product Maturity
Community Support
Why Spark?
Comparison of common open source big data processing systems.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
17. 17
Livy spark-jobserver Mist Apache Toree
Streaming jobs
Batch Jobs
General Purpose
Usability
High Availability
Supports major languages
(Scala/Java/Python)
Dependency (No Code
changes required)
Why Livy?
Comparison of common open source Spark Interfaces.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
18. 18
○ Apache Livy is a service that enables
easy interaction with a Spark cluster
over a REST interface.
○ It enables easy submission of Spark jobs
or snippets of Spark code, synchronous
or asynchronous result retrieval, as well
as Spark Context management, all via a
simple REST interface or an RPC client
library.
A Rest Service for Spark Jobs
18Confidential
22. 22
Failure Resiliency
● Current weakness
○ Current solution lacks resiliency in Airflow (single EC2 instance)
○ solution:
■ containerize Airflow, deploy on pod with separate worker pod,
distribute tasks using external queue
● Livy
○ It supports session recovery using Zookeeper and reconnects to the
existing session even if its fails while executing the job.
● Spark
○ Failed tasks can be re-launched in parallel on all the other nodes in the cluster and distribute the recomputations across many nodes,and
recovering from the failures very fast.