Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder of DataTorrent presented "Streaming Analytics with Apache Apex" as part of the Big Data, Berlin v 8.0 meetup organised on the 14th of July 2016 at the WeWork headquarters.
Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...Dataconomy Media
Zsolt Várnai, Principal Software Engineer at Skyscanner, presented "The advantages of real-time monitoring in apps development" as part of the Big Data, Budapest v 3.0 meetup organised on the 19th of May 2016 at Skyscanner's headquarters.
How ironSource streams more then 200B events every month into a data warehouse
http://www.ironsrc.com/atom
https://www.youtube.com/watch?v=cF9iMolKT5w
Delivering digital transformation and business impact with io t, machine lear...Robert Sanders
A world-leading manufacturer was in search of an IoT solution that could ingest, integrate, and manage data being generated from various types of connected machinery located on factory floors around the globe. The company needed to manage the devices generating the data, integrate the flow of data into existing back-end systems, run advanced analytics on that data, and then deliver services to generate real-time decision making at the edge.
In this session, learn how Clairvoyant, a leading systems integrator and Red Hat partner, was able to accelerate digital transformation for their customer using Internet of Things (IoT) and machine learning in a hybrid cloud environment. Specifically, Clairvoyant and Eurotech will discuss:
• The approach taken to optimize manufacturing processes to cut costs, minimize downtime, and increase efficiency.
• How a data processing pipeline for IoT data was built using an open, end-to-end architecture from Cloudera, Eurotech, and Red Hat.
• How analytics and machine learning inferencing powered at the IoT edge will allow predictions to be made and decisions to be executed in real time.
• The flexible and hybrid cloud environment designed to provide the key foundational elements to quickly and securely roll out IoT use cases.
Predicting Patient Outcomes in Real-Time at HCASri Ambati
Data Scientist Allison Baker and Development Manager of Data Products Cody Hall work with a talented team of data scientists, software engineers, and web developers, and are building the framework and infrastructure to support a real-time prediction application, with the ability to scale across the entire company. Paramount to these efforts has been the capability of integrating the architecture for software production with the predictive models generated by H2O. This talk will review the processes by which HCA is building a pipeline to predict patient outcomes in real-time, heavily relying on H2O’s POJO scoring API and implemented in Clojure data processing. #h2ony
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
RWE & Patient Analytics Leveraging Databricks – A Use CaseDatabricks
RWE & Patient Analytics Leveraging Databricks - An Use Case
Harini Gopalakrishnan & Martin Longpre from Sanofi present on leveraging real world data and evidence generation using Databricks. They discuss defining real world data and evidence, using advanced analytics for indication searching, and implementing a conceptual architecture in Databricks for privacy-preserved analysis. Their system offers secure data management, self-service analytics tools, and controls access and auditing. Databricks is customized for their needs with cluster policies, Gitlab integration, and IAM roles. They demonstrate their workflow and discuss future improvements to further enhance insights from real world data.
Testing the Data Warehouse—Big Data, Big ProblemsTechWell
Data warehouses are critical systems for collecting, organizing, and making information readily available for strategic decision making. The ability to review historical trends and monitor near real-time operational data is a key competitive advantage for many organizations. Yet the methods for assuring the quality of these valuable assets are quite different from those of transactional systems. Ensuring that appropriate testing is performed is a major challenge for many enterprises. Geoff Horne has led numerous data warehouse testing projects in both the telecommunications and ERP sectors. Join Geoff as he shares his approaches and experiences, focusing on the key “uniques” of data warehouse testing: methods for assuring data completeness, monitoring data transformations, measuring quality, and more. Geoff explores the opportunities for test automation as part of the data warehouse process, describing how you can harness automation tools to streamline the work and minimize overhead.
Towards Personalization in Global Digital HealthDatabricks
The rapid expansion of mobile phone usage in low-income and middle-income countries has created unprecedented opportunities for applying AI to improve individual and population health.
In benshi.ai, a non-profit funded by the Bill and Melinda Gates Foundation, the goal is to transform health outcomes in resource-poor countries through advanced AI applications. We aim to do so by providing personalized predictions and recommendations to support diagnosis to medical care teams and frontline workers, as well as to nudge patients through personalized incentives towards an improvement in disease treatment management and general wellness.
To this end, we have built an operational machine learning platform that provides personalized content and interventions real-time. Multiple engineering and machine learning decisions have been made to overcome different challenges and to build an experimentation engine and a centralized data and model management system for global health. Databricks served as a cornerstone upon which all our data/ML services were built. In particular, MLflow and dbx (an opensource tool from Databricks) have been crucial for the training, tracking and management of our end-to-end model pipelines. From the data science perspective, our challenges involved causal inference analysis, behavioral time series forecasting, micro-randomized trials, and contextual bandits-based experimentation at the individual level.
This talk will focus on how we overcome the technical challenges to build a state-of-the-art machine learning platform that serves to improve global health outcomes.
- The document discusses data infrastructure at an online video syndication platform that handles 2-3 million streams per day.
- It describes different tools for data storage, analytics, and real-time processing including Hadoop, Spark, MongoDB, Logstash, Elasticsearch, and Storm.
- It also discusses best practices for data collection, formatting, analysis, and using data to detect issues through a case study on investigating a rapid decline in video streams.
Brokering Data: Accelerating Data Evaluation with Databricks White LabelDatabricks
As the data-as-a-service ecosystem continues to evolve, data brokers are faced with an unprecedented challenge – demonstrating the value of their data. Successfully crafting and selling a compelling data product relies on a broker’s ability to differentiate their product from the rest of the market. In smaller or static datasets, measures like row count and cardinality can speak volumes. However, when datasets are in the terabytes or petabytes though – differentiation becomes much difficult. On top of that “data quality” is a somewhat ill-defined term and the definition of a “high quality dataset” can change daily or even hourly.
This breakout session will describe Veraset’s partnership with Databricks, and how we have white labeled Databricks to showcase and accelerate the value of our data. We’ll discuss the challenges that data brokers have faced to date and some of the primitives of our businesses that have guided our direction thus far. We will also actively demo our white label instance and notebook to show how we’ve been able to provide key insights to our customers and reduce the TTFB of data onboarding.
A Hadoop User Group (HUG) Ireland talk on Data Science production environments and their online set up using #ExpertModels by Cronan McNamara, CEO @CremeGlobal
This document discusses Ford's data analytics strategy. It notes that the volume of data Ford collects is increasing significantly from connected vehicles and other sources. This includes up to 25 gigabytes per hour from individual vehicles. Ford is working to build applications and drive adoption of analytics across the company through education and training programs to democratize access to tools and infrastructure while ensuring privacy, security, and governance of customer data. The goal is to provide the right data, tools, and support to analysts and data scientists to improve products and services.
Building A Product Assortment Recommendation EngineDatabricks
Amid the increasingly competitive brewing industry, the ability of retailers and brewers to provide optimal product assortments for their consumers has become a key goal for business stakeholders. Consumer trends, regional heterogeneities and massive product portfolios combine to scale the complexity of assortment selection. At AB InBev, we approach this selection problem through a two-step method rooted in statistical learning techniques. First, regression models and collaborative filtering are used to predict product demand in partnering retailers. The second step involves robust optimization techniques to recommend a set of products that enhance business-specified performance indicators, including retailer revenue and product market share.
With the ultimate goal of scaling our approach to over 100k brick-and-mortar retailers across the United States and online platforms, we have implemented our algorithms in custom-built Python libraries using Apache Spark. We package and deploy production versions of Python wheels to a hosted repository for installation to production infrastructure.
To orchestrate the execution of these processes at scale, we use a combination of the Databricks API, Azure App Configuration, Azure Functions, Azure Event Grid and some custom-built utilities to deploy the production wheels to on-demand and interactive Databricks clusters. From there, we monitor execution with Azure Application Insights and log evaluation metrics to Databricks Delta tables on ADLS. To create a full-fledged product and deliver value to customers, we built a custom web application using React and GraphQL which allows users to request assortment recommendations in a self-service, ad-hoc fashion.
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks
This document discusses Databricks' goal of democratizing access to Spark. It introduces the Databricks cloud platform, which provides a hosted model for Spark with rapid releases, dynamic scaling, and security controls. The platform is used for just-in-time data warehousing, advanced analytics, and real-time use cases. Many companies struggle with the steep learning curve and costs of big data projects. To empower more developers, Databricks trained thousands on Spark and launched online courses with over 100,000 students. They are announcing the Databricks Community Edition, a free version of their platform, to further democratize access to Spark through mini clusters, notebooks, APIs, and continuous delivery of learning content.
How Yellowbrick Data Integrates to Existing Environments WebcastYellowbrick Data
This document discusses how Yellowbrick can integrate into existing data environments. It describes Yellowbrick's data warehouse capabilities and how it compares to other solutions. The document recommends upgrading from single server databases or traditional MPP systems to Yellowbrick when data outgrows a single server or there are too many disparate systems. It also recommends moving from pre-configured or cloud-only systems to Yellowbrick to significantly reduce costs while improving query performance. The document concludes with a security demonstration using a netflow dataset.
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Liz Masters Lovelace
With your Digital Transformation in full swing it’s time to transform the way you look at your systems and services. With the speed of DevOps you need your Monitoring to be faster, more agile, and more accurate. You can’t afford your systems to be down. Its time to look at monitoring from a different angle. Let’s explore looking from the top down rather than the bottom up. For more information, please reach out to Craig Haessig. [email protected]
Yellowbrick Webcast with DBTA for Real-Time AnalyticsYellowbrick Data
This document discusses key capabilities needed for real-time analytics. It notes that real-time data, combined with historical data, provides important context for decision making. Building data pipelines with fewer systems and steps leads to greater scalability and reliability. The document outlines needs for real-time analytics like ingesting streaming data, powering analytic applications, delivering massive capacity, and guaranteeing performance. It emphasizes that both real-time and historical data are important for analytics and that the right architecture can incorporate multiple data sources and workloads.
This presentation is to understand StreamSets ETL tool.
StreamSets is modern ETL tool designed to process streaming data.
StreamSets has 2 engines, 1 is Data Controller and Data Transformer(Based on Apache Spark).
Stream processing for the practitioner: Blueprints for common stream processi...Aljoscha Krettek
Aljoscha Krettek offers an overview of the modern stream processing space, details the challenges posed by stateful and event-time-aware stream processing, and shares core archetypes ("application blueprints”) for stream processing drawn from real-world use cases with Apache Flink.
Topics include:
* Aggregating IoT event data, in which event-time-aware processing, handling of late data, and state are important
* Data enrichment, in which a stream of real-time events is “enriched” with data from a slowly changing database of supplemental data points
* Dynamic stream processing, in which a stream of control messages and dynamically updated user logic is used to process a stream of events for use cases such as alerting and fraud detection
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML MeetupRomain Yon
Original event: https://www.meetup.com/NYC-Machine-Learning/events/256605862/
--
"Doing large scale ML in production is hard" – Everyone who's tried
This talk is focussed on ML Systems. Especially the less obvious pitfalls, which have caused us troubles at Spotify.
This talk assumes a certain level of familiarity with ML: You'll get the most out of if you've some experience with applied ML, ideally on production systems.
Romain Yon is a Staff ML Engineer at Spotify. Over the years, Romain has worked on many of the core ML systems that power Spotify today (Music Recommendation, Catalog Quality, Search Ranking, Ads, ..).
During the past year, Romain has been mostly focusing on designing reusable ML Infrastructure that can be leveraged throughout Spotify.
Prior to Spotify, Romain co-founded the startup https://linkurio.us while getting his MSc in ML from Georgia Tech.
Managing R&D Data on Parallel Compute InfrastructureDatabricks
Clinical genomic analytics pipelines using Databricks and the Delta Lake for the benefit of loading individual reads from raw sequencing or base-call files have significant advantages over more traditional methods. Analysis pipelines that perform genomic mapping to purpose-built reference data artifacts persisted to tables allows for enhanced performance that is magnitudes greater than previous mapping methods. These scalable, reproducible, and potentially open sourced methods have the ability to transform bioinformatics and R&D data management / governance.
The document discusses how GitLab.com builds its data services and products. It describes how GitLab.com uses its own DevOps platform to build an Enterprise Data Platform that analyzes data from GitLab.com. The data team faces challenges around scaling, visibility, and speed. To address these, the team takes actions like open sourcing tools, adopting DevOps practices, and establishing roles, processes, and technologies to build a trusted data model and framework. The key takeaways emphasize continuous iteration, discipline, automation, and living the company values.
This document discusses moving from traditional business intelligence (BI) tools to adopting machine learning (ML). It provides an overview of common BI workflows and limitations. It then introduces ML concepts like supervised, unsupervised, and reinforcement learning. The document outlines the typical ML pipeline including data wrangling, modeling, validation, and deployment. Finally, it discusses challenges of adopting ML and provides recommendations for getting started with ML using Python libraries and optimizing infrastructure costs.
Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit
This document discusses how Caserta Concepts used Apache Spark to help a customer master their customer data by cleaning, standardizing, matching, and linking over 6 million customer records and hundreds of millions of data points. Traditional customer data integration approaches were prohibitively expensive and slow for this volume of data. Spark enabled the data to be processed 10x faster by parallelizing data cleansing and transformation. GraphX was also used to model the data as a graph and identify linked customer records, reducing survivorship processing from 2 hours to under 5 minutes.
Next Gen Big Data Analytics with Apache Apex discusses Apache Apex, an open source stream processing framework. It provides an overview of Apache Apex's capabilities for processing continuous, real-time data streams at scale. Specifically, it describes how Apache Apex allows for in-memory, distributed stream processing using a programming model of operators in a directed acyclic graph. It also covers Apache Apex's features for fault tolerance, dynamic scaling, and integration with Hadoop and YARN.
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
Towards Personalization in Global Digital HealthDatabricks
The rapid expansion of mobile phone usage in low-income and middle-income countries has created unprecedented opportunities for applying AI to improve individual and population health.
In benshi.ai, a non-profit funded by the Bill and Melinda Gates Foundation, the goal is to transform health outcomes in resource-poor countries through advanced AI applications. We aim to do so by providing personalized predictions and recommendations to support diagnosis to medical care teams and frontline workers, as well as to nudge patients through personalized incentives towards an improvement in disease treatment management and general wellness.
To this end, we have built an operational machine learning platform that provides personalized content and interventions real-time. Multiple engineering and machine learning decisions have been made to overcome different challenges and to build an experimentation engine and a centralized data and model management system for global health. Databricks served as a cornerstone upon which all our data/ML services were built. In particular, MLflow and dbx (an opensource tool from Databricks) have been crucial for the training, tracking and management of our end-to-end model pipelines. From the data science perspective, our challenges involved causal inference analysis, behavioral time series forecasting, micro-randomized trials, and contextual bandits-based experimentation at the individual level.
This talk will focus on how we overcome the technical challenges to build a state-of-the-art machine learning platform that serves to improve global health outcomes.
- The document discusses data infrastructure at an online video syndication platform that handles 2-3 million streams per day.
- It describes different tools for data storage, analytics, and real-time processing including Hadoop, Spark, MongoDB, Logstash, Elasticsearch, and Storm.
- It also discusses best practices for data collection, formatting, analysis, and using data to detect issues through a case study on investigating a rapid decline in video streams.
Brokering Data: Accelerating Data Evaluation with Databricks White LabelDatabricks
As the data-as-a-service ecosystem continues to evolve, data brokers are faced with an unprecedented challenge – demonstrating the value of their data. Successfully crafting and selling a compelling data product relies on a broker’s ability to differentiate their product from the rest of the market. In smaller or static datasets, measures like row count and cardinality can speak volumes. However, when datasets are in the terabytes or petabytes though – differentiation becomes much difficult. On top of that “data quality” is a somewhat ill-defined term and the definition of a “high quality dataset” can change daily or even hourly.
This breakout session will describe Veraset’s partnership with Databricks, and how we have white labeled Databricks to showcase and accelerate the value of our data. We’ll discuss the challenges that data brokers have faced to date and some of the primitives of our businesses that have guided our direction thus far. We will also actively demo our white label instance and notebook to show how we’ve been able to provide key insights to our customers and reduce the TTFB of data onboarding.
A Hadoop User Group (HUG) Ireland talk on Data Science production environments and their online set up using #ExpertModels by Cronan McNamara, CEO @CremeGlobal
This document discusses Ford's data analytics strategy. It notes that the volume of data Ford collects is increasing significantly from connected vehicles and other sources. This includes up to 25 gigabytes per hour from individual vehicles. Ford is working to build applications and drive adoption of analytics across the company through education and training programs to democratize access to tools and infrastructure while ensuring privacy, security, and governance of customer data. The goal is to provide the right data, tools, and support to analysts and data scientists to improve products and services.
Building A Product Assortment Recommendation EngineDatabricks
Amid the increasingly competitive brewing industry, the ability of retailers and brewers to provide optimal product assortments for their consumers has become a key goal for business stakeholders. Consumer trends, regional heterogeneities and massive product portfolios combine to scale the complexity of assortment selection. At AB InBev, we approach this selection problem through a two-step method rooted in statistical learning techniques. First, regression models and collaborative filtering are used to predict product demand in partnering retailers. The second step involves robust optimization techniques to recommend a set of products that enhance business-specified performance indicators, including retailer revenue and product market share.
With the ultimate goal of scaling our approach to over 100k brick-and-mortar retailers across the United States and online platforms, we have implemented our algorithms in custom-built Python libraries using Apache Spark. We package and deploy production versions of Python wheels to a hosted repository for installation to production infrastructure.
To orchestrate the execution of these processes at scale, we use a combination of the Databricks API, Azure App Configuration, Azure Functions, Azure Event Grid and some custom-built utilities to deploy the production wheels to on-demand and interactive Databricks clusters. From there, we monitor execution with Azure Application Insights and log evaluation metrics to Databricks Delta tables on ADLS. To create a full-fledged product and deliver value to customers, we built a custom web application using React and GraphQL which allows users to request assortment recommendations in a self-service, ad-hoc fashion.
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks
This document discusses Databricks' goal of democratizing access to Spark. It introduces the Databricks cloud platform, which provides a hosted model for Spark with rapid releases, dynamic scaling, and security controls. The platform is used for just-in-time data warehousing, advanced analytics, and real-time use cases. Many companies struggle with the steep learning curve and costs of big data projects. To empower more developers, Databricks trained thousands on Spark and launched online courses with over 100,000 students. They are announcing the Databricks Community Edition, a free version of their platform, to further democratize access to Spark through mini clusters, notebooks, APIs, and continuous delivery of learning content.
How Yellowbrick Data Integrates to Existing Environments WebcastYellowbrick Data
This document discusses how Yellowbrick can integrate into existing data environments. It describes Yellowbrick's data warehouse capabilities and how it compares to other solutions. The document recommends upgrading from single server databases or traditional MPP systems to Yellowbrick when data outgrows a single server or there are too many disparate systems. It also recommends moving from pre-configured or cloud-only systems to Yellowbrick to significantly reduce costs while improving query performance. The document concludes with a security demonstration using a netflow dataset.
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Liz Masters Lovelace
With your Digital Transformation in full swing it’s time to transform the way you look at your systems and services. With the speed of DevOps you need your Monitoring to be faster, more agile, and more accurate. You can’t afford your systems to be down. Its time to look at monitoring from a different angle. Let’s explore looking from the top down rather than the bottom up. For more information, please reach out to Craig Haessig. [email protected]
Yellowbrick Webcast with DBTA for Real-Time AnalyticsYellowbrick Data
This document discusses key capabilities needed for real-time analytics. It notes that real-time data, combined with historical data, provides important context for decision making. Building data pipelines with fewer systems and steps leads to greater scalability and reliability. The document outlines needs for real-time analytics like ingesting streaming data, powering analytic applications, delivering massive capacity, and guaranteeing performance. It emphasizes that both real-time and historical data are important for analytics and that the right architecture can incorporate multiple data sources and workloads.
This presentation is to understand StreamSets ETL tool.
StreamSets is modern ETL tool designed to process streaming data.
StreamSets has 2 engines, 1 is Data Controller and Data Transformer(Based on Apache Spark).
Stream processing for the practitioner: Blueprints for common stream processi...Aljoscha Krettek
Aljoscha Krettek offers an overview of the modern stream processing space, details the challenges posed by stateful and event-time-aware stream processing, and shares core archetypes ("application blueprints”) for stream processing drawn from real-world use cases with Apache Flink.
Topics include:
* Aggregating IoT event data, in which event-time-aware processing, handling of late data, and state are important
* Data enrichment, in which a stream of real-time events is “enriched” with data from a slowly changing database of supplemental data points
* Dynamic stream processing, in which a stream of control messages and dynamically updated user logic is used to process a stream of events for use cases such as alerting and fraud detection
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML MeetupRomain Yon
Original event: https://www.meetup.com/NYC-Machine-Learning/events/256605862/
--
"Doing large scale ML in production is hard" – Everyone who's tried
This talk is focussed on ML Systems. Especially the less obvious pitfalls, which have caused us troubles at Spotify.
This talk assumes a certain level of familiarity with ML: You'll get the most out of if you've some experience with applied ML, ideally on production systems.
Romain Yon is a Staff ML Engineer at Spotify. Over the years, Romain has worked on many of the core ML systems that power Spotify today (Music Recommendation, Catalog Quality, Search Ranking, Ads, ..).
During the past year, Romain has been mostly focusing on designing reusable ML Infrastructure that can be leveraged throughout Spotify.
Prior to Spotify, Romain co-founded the startup https://linkurio.us while getting his MSc in ML from Georgia Tech.
Managing R&D Data on Parallel Compute InfrastructureDatabricks
Clinical genomic analytics pipelines using Databricks and the Delta Lake for the benefit of loading individual reads from raw sequencing or base-call files have significant advantages over more traditional methods. Analysis pipelines that perform genomic mapping to purpose-built reference data artifacts persisted to tables allows for enhanced performance that is magnitudes greater than previous mapping methods. These scalable, reproducible, and potentially open sourced methods have the ability to transform bioinformatics and R&D data management / governance.
The document discusses how GitLab.com builds its data services and products. It describes how GitLab.com uses its own DevOps platform to build an Enterprise Data Platform that analyzes data from GitLab.com. The data team faces challenges around scaling, visibility, and speed. To address these, the team takes actions like open sourcing tools, adopting DevOps practices, and establishing roles, processes, and technologies to build a trusted data model and framework. The key takeaways emphasize continuous iteration, discipline, automation, and living the company values.
This document discusses moving from traditional business intelligence (BI) tools to adopting machine learning (ML). It provides an overview of common BI workflows and limitations. It then introduces ML concepts like supervised, unsupervised, and reinforcement learning. The document outlines the typical ML pipeline including data wrangling, modeling, validation, and deployment. Finally, it discusses challenges of adopting ML and provides recommendations for getting started with ML using Python libraries and optimizing infrastructure costs.
Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit
This document discusses how Caserta Concepts used Apache Spark to help a customer master their customer data by cleaning, standardizing, matching, and linking over 6 million customer records and hundreds of millions of data points. Traditional customer data integration approaches were prohibitively expensive and slow for this volume of data. Spark enabled the data to be processed 10x faster by parallelizing data cleansing and transformation. GraphX was also used to model the data as a graph and identify linked customer records, reducing survivorship processing from 2 hours to under 5 minutes.
Next Gen Big Data Analytics with Apache Apex discusses Apache Apex, an open source stream processing framework. It provides an overview of Apache Apex's capabilities for processing continuous, real-time data streams at scale. Specifically, it describes how Apache Apex allows for in-memory, distributed stream processing using a programming model of operators in a directed acyclic graph. It also covers Apache Apex's features for fault tolerance, dynamic scaling, and integration with Hadoop and YARN.
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.
http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Stream data processing is becoming increasingly important to support business needs for faster time to insight and action with growing volume of information from more sources. Apache Apex (http://apex.apache.org/) is a unified big data in motion processing platform for the Apache Hadoop ecosystem. Apex supports demanding use cases with:
* Architecture for high throughput, low latency and exactly-once processing semantics.
* Comprehensive library of building blocks including connectors for Kafka, Files, Cassandra, HBase and many more
* Java based with unobtrusive API to build real-time and batch applications and implement custom business logic.
* Advanced engine features for auto-scaling, dynamic changes, compute locality.
Apex was developed since 2012 and is used in production in various industries like online advertising, Internet of Things (IoT) and financial services.
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
This presentation discusses architectural differences between Apache Apex features with Spark Streaming. It discusses how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
Also, it will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. Further, it will discuss how these features affect time to market and total cost of ownership.
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data.
Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex
Internet of Things (IoT) devices are becoming more ubiquitous in consumer, business and industrial landscapes. They are being widely used in applications ranging from home automation to the industrial internet. They pose a unique challenge in terms of the volume of data they produce, and the velocity with which they produce it, and the variety of sources they need to handle. The challenge is to ingest and process this data at the speed at which it is being produced in a real-time and fault tolerant fashion. Apache Apex is an industrial grade, scalable and fault tolerant big data processing platform that runs natively on Hadoop. In this deck, you will see how Apex is being used in IoT applications and also see how the enterprise features such as dimensional analytics, real-time dashboards and monitoring play a key role.
Presented by Pramod Immaneni, Principal Architect at DataTorrent and PPMC member Apache Apex, on BrightTALK webinar on Apr 6th, 2016
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
Slides from http://www.meetup.com/Hadoop-User-Group-Munich/events/230313355/
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH
• Architecture highlights: high throughput, low-latency, operability with stateful fault tolerance, strong processing guarantees, auto-scaling etc
• Application development model, unified approach for real-time and batch use cases
• Tools for ease of use, ease of operability and ease of management
• How customers use Apache Apex in production
BigDataSpain 2016: Introduction to Apache ApexThomas Weise
Apache Apex is an open source stream processing platform, built for large scale, high-throughput, low-latency, high availability and operability. With a unified architecture it can be used for real-time and batch processing. Apex is Java based and runs natively on Apache Hadoop YARN and HDFS.
We will discuss the key features of Apache Apex and architectural differences from similar platforms and how these differences affect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, low latency SLA, high throughput and large scale ingestion.
Apex APIs and libraries of operators and examples focus on developer productivity. We will present the programming model with examples and how custom business logic can be easily integrated based on the Apex operator API.
We will cover integration with connectors to sources/destinations (including Kafka, JMS, SQL, NoSQL, files etc.), scalability with advanced partitioning, fault tolerance and processing guarantees, computation and scheduling model, state management, windowing and dynamic changes. Attendees will also learn how these features affect time to market and total cost of ownership and how they are important in existing Apex production deployments.
https://www.bigdataspain.org/
StreamAnalytix 2.0 is a multi-engine streaming analytics platform that allows users to deploy multiple streaming engines depending on their use case requirements. It features an easy to use drag-and-drop UI, support for predictive analytics, machine learning, and real-time dashboards. The platform provides a level of abstraction that gives customers flexibility in choosing the best streaming engine for their needs.
Introduction to Apache Apex and writing a big data streaming application Apache Apex
Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application.
This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc.
Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.
This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email [email protected]
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Dataconomy Media
Kx Technology is an in-memory columnar database and programming system that is highly optimized for real-time streaming and historical time-series data analytics. It provides extreme performance at low latency and can scale to process massive data volumes without significant infrastructure. Kx has been widely adopted over two decades in the financial services industry for applications like market surveillance, risk management, and quantitative research.
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
This talk will be a deep dive into ingesting unbounded file data and streaming data from Kafka into Hadoop. We will also cover data enrichment and dimensional compute. Customer use-case and reference architecture.
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
Operational systems manage our finances, shopping, devices and much more. Adding real-time analytics to these systems enables them to instantly respond to changing conditions and provide immediate, targeted feedback. This use of analytics is called “operational intelligence,” and the need for it is widespread.
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Dataconomy Media
The challenges of increasing complexity of organizations, companies and projects are obvious and omnipresent. Everywhere there are connections and dependencies that are often not adequately managed or not considered at all because of a lack of technology or expertise to uncover and leverage the relationships in data and information. In his presentation, Axel Morgner talks about graph technology and knowledge graphs as indispensable building blocks for successful companies.
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...Dataconomy Media
The document discusses emerging technologies and their potential impacts, and questions how individuals and societies can responsibly address issues arising from new technologies. It notes that governments, regulators, and individuals struggle to understand new concepts that spread rapidly. It asks if there are existing systems or forms of cooperation that could help societies address responsibilities related to technologies, but offers no definitive solutions, mainly posing questions.
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...Dataconomy Media
Every day we are challenged with more data, more use cases and an ever increasing demand for analytics. In this talk Bjorn will explain how autonomous data management and machine learning help innovators to more productive and give examples how to deliver new data driven projects with less risk at lower costs.
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...Dataconomy Media
This document contains an agenda and presentation materials for a talk on building and deploying an anti-money laundering (AML) model using DataRobot. The agenda includes introductions to DataRobot and AML, an AML demo, a real AML use case example, and a question and answer section. The presentation materials provide background on DataRobot, including its history and products. It also gives an overview of money laundering and how AML works, both traditionally using rule-based systems and how machine learning can help by reducing false positives and improving efficiency. A case study shows how DataRobot has helped other organizations with AML use cases.
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...Dataconomy Media
Trump, Brexit, Cambridge Analytica... In the last few years, we have had to confront the consequences of the use and misuse of data science algorithms in manipulating public opinion through social media. The use of private data to microtarget individuals is a daily practice (and a trillion-dollar industry), which has serious side-effects when the selling product is your political ideology. How can we cope with this new scenario?
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...Dataconomy Media
The document discusses data innovation and Men on the Moon's approach. It notes that while there is a large amount of available data worldwide, only a small portion is used to create value. Most data science projects also fail. The document then outlines Men on the Moon's "Data Thinking" approach, which combines design thinking and data science. Their approach involves defining a data vision, identifying use cases, prototyping solutions, and enabling employees. The goal is to leverage data to create valuable solutions for people through data innovation.
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...Dataconomy Media
What does it take to build a good data product or service? Data practitioners always think about the technology, user experience and commercial viability. But rarely do they think about the implications of the systems they build. This talk will shed light on the impact of AI systems and the unintended consequences of the use of data in different products. It will also discuss our role, as data practitioners, in planting the seeds of fairness in the systems we build.
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...Dataconomy Media
People analytics uses data science techniques like machine learning and pattern recognition on employee data to generate insights and reports that can help businesses make smarter talent and operational decisions. These decisions can improve workforce effectiveness, engagement, recruitment, retention and performance while also increasing sales and reducing fraud and accidents. People analytics technologies include surveys, correlation analysis, machine learning and AI which can help companies improve their culture, develop employee skills and boost growth when the results are properly implemented.
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...Dataconomy Media
Cloud Infrastructure is a hostile environment: a power supply failure or a network outage leads to downtime and big losses. There is nothing we can trust: a single server, a server rack, even a whole datacenter can fail, and if an application is fragile by design, disruption is inevitable. We must distribute our application and diversify cloud data strategy to survive disturbances of any scale. Apache Cassandra is a cloud-native platform-agnostic database that stores data with a distributed redundancy so it easily survives any issue. What to know how Apple and Netflix handle petabytes of data, keeping it highly available? Join us and listen to a story of 10 little servers and no downtime!
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...Dataconomy Media
In the data industry, having correctly labelled datasets is vital. Timothy Thatcher explains how tagging your data while considering time and location and complex hierarchical rules at scale can be handled.
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...Dataconomy Media
This document discusses using machine learning to analyze individual and interpersonal behavior for clinical diagnosis and screening. It focuses on analyzing non-verbal behaviors like interpersonal synchronization that have been shown to be impaired in conditions like autism spectrum disorder. The document proposes that machine learning could provide an objective, automated tool for diagnosing conditions more quickly by analyzing video recordings of social interactions. This may help address bottlenecks in healthcare systems and allow earlier access to treatment.
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...Dataconomy Media
This document discusses the end-to-end experimentation platform at GetYourGuide for A/B testing. It outlines the challenges of running experiments such as imbalanced assignments, suspicious metric changes, and non-converging results. It also describes the tools used for planning experiments, monitoring assignments, performing daily checks, and analyzing results. The goal is to validate UX changes, estimate effects on customers, and make more objective decisions through A/B testing while addressing issues that could impact experiment quality.
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...Dataconomy Media
Cloud Infrastructure is a hostile environment: a power supply failure or a network outage leads to downtime and big losses. There is nothing we can trust: a single server, a server rack, even a whole datacenter can fail, and if an application is fragile by design, disruption is inevitable. We must distribute our application and diversify cloud data strategy to survive disturbances of any scale. Apache Cassandra is a cloud-native platform-agnostic database that stores data with a distributed redundancy so it easily survives any issue. What to know how Apple and Netflix handle petabytes of data, keeping it highly available? Join us and listen to a story of 10 little servers and no downtime!
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...Dataconomy Media
Creativity is the mental ability to create new ideas and designs. Innovation, on the other hand, Means developing useful solutions from new ideas. Creativity can be goal-oriented, Whereas innovation is always goal-oriented. This bedeutet, dass innovation aims to achieve defined goals. The use of cloud services and technologies promises enterprise users many benefits in terms of more flexible use of IT resources and faster access to innovative solutions. That’s why we want to examine the question in this talk, of what role cloud computing plays for innovation in companies.
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...Dataconomy Media
Presentation of Time Series Properties of Financial Instrument and Possibilities in Frequency Decomposition and Information Extraction using FT, STFT and Wavelets with Outlook in Current Research on Wavelet Neural Networks
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Dataconomy Media
"With most machine learning (ML) and deep learning (DL) frameworks, it can take hours to move data for ETL, and hours to train models. It's also hard to scale, with data sets increasingly being larger than the capacity of any single server. The amount of the data also makes it hard to incrementally test and retrain models in near real-time.
Learn how Apache Ignite and GridGain help to address limitations like ETL costs, scaling issues and Time-To-Market for the new models and help achieve near-real-time, continuous learning.
Yuriy Babak, the head of ML/DL framework development at GridGain and Apache Ignite committer, will explain how ML/DL work with Apache Ignite, and how to get started.
Topics include:
— Overview of distributed ML/DL including architecture, implementation, usage patterns, pros and cons
— Overview of Apache Ignite ML/DL, including built-in ML/DL algorithms, and how to implement your own
— Model inference with Apache Ignite, including how to train models with other libraries, like Apache Spark, and deploy them in Ignite
— How Apache Ignite and TensorFlow can be used together to build distributed DL model training and inference"
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Dataconomy Media
"Machine learning algorithms require significant amounts of training data which has been centralized on one machine or in a datacenter so far. For numerous applications, such need of collecting data can be extremely privacy-invasive. Recent advancements in AI research approach this issue by a new paradigm of training AI models, i.e., Federated Learning.
In federated learning, edge devices (phones, computers, cars etc.) collaboratively learn a shared AI model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. From personal data perspective, this paradigm enables a way of training a model on the device without directly inspecting users’ data on a server. This talk will pinpoint several examples of AI applications benefiting from federated learning and the likely future of privacy-aware systems."
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
TrsLabs - Fintech Product & Business ConsultingTrs Labs
Hybrid Growth Mandate Model with TrsLabs
Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant.
An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices.
Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company
Talk to us & Unlock the competitive advantage
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
3. Industries & Use Cases
3
Financial Services Ad-Tech Telecom Manufacturing Energy IoT
Fraud and risk
monitoring
Real-time
customer facing
dashboards on
key performance
indicators
Call detail record
(CDR) &
extended data
record (XDR)
analysis
Supply chain
planning &
optimization
Smart meter
analytics
Data ingestion
and processing
Credit risk
assessment
Click fraud
detection
Understanding
customer
behavior AND
context
Preventive
maintenance
Reduce outages
& improve
resource
utilization
Predictive
analytics
Improve turn around
time of trade
settlement processes
Billing
optimization
Packaging and
selling
anonymous
customer data
Product quality &
defect tracking
Asset &
workforce
management
Data governance
• Large scale ingest and distribution
• Real-time ELTA (Extract Load Transform Analyze)
• Dimensional computation & aggregation
• Enforcing data quality and data governance requirements
• Real-time data enrichment with reference data
• Real-time machine learning model scoring
HORIZONTAL
4. Apache Apex
4
• In-memory, distributed stream processing
• Application logic broken into components (operators) that execute distributed in a cluster
• Unobtrusive Java API to express (custom) logic
• Maintain state and metrics in member variables
• Windowing, event-time processing
• Scalable, high throughput, low latency
• Operators can be scaled up or down at runtime according to the load and SLA
• Dynamic scaling (elasticity), compute locality
• Fault tolerance & correctness
• Automatically recover from node outages without having to reprocess from beginning
• State is preserved, checkpointing, incremental recovery
• End-to-end exactly-once
• Operability
• System and application metrics, record/visualize data
• Dynamic changes, elasticity
6. Application Development Model
6
A Stream is a sequence of data
tuples
A typical Operator takes one or
more input streams, performs
computations & emits one or more
output streams
• Each Operator is YOUR custom
business logic in java, or built-in
operator from our open source
library
• Operator has many instances
that run in parallel and each
instance is single-threaded
Directed Acyclic Graph (DAG) is
made up of operators and streams
Directed Acyclic Graph (DAG)
Operator Operator
Operator
Operator
Operator Operator
Tuple
Output
Stream
Filtered
Stream
Enriched
Stream
Filtered
Stream
Enriched
Stream
8. End-to-End Exactly Once
8
• Important when writing to external systems
• Data should not be duplicated or lost in the external system in case of
application failures
• Common external systems
ᵒ Databases
ᵒ Files
ᵒ Message queues
• Exactly-once = at-least-once + idempotency + consistent state
• Data duplication must be avoided when data is replayed from checkpoint
ᵒ Operators implement the logic dependent on the external system
ᵒ Platform provides checkpointing and repeatable windowing
10. Scalability
10
NxM PartitionsUnifier
0 1 2 3
Logical DAG
0 1 2
1
1 Unifier
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
Unifier
Unifier0
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
11. Advanced Partitioning
11
0
1a
1b
2 3 4Unifier
Physical DAG
0 4
3a2a1a
1b 2b 3b
Unifier
Physical DAG with Parallel Partition
Parallel Partition
Container
uopr
uopr1
uopr2
uopr3
uopr4
uopr1
uopr2
uopr3
uopr4
dopr
dopr
doprunifier
unifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
NIC
Logical Plan
Execution Plan, for N = 4; M = 1
Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
Cascading Unifiers
0 1 2 3 4
Logical DAG
12. Dynamic Partitioning
12
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
19. Maximize Revenue w/ real-time insights
19
PubMatic is the leading marketing automation software company for publishers. Through real-time analytics,
yield management, and workflow automation, PubMatic enables publishers to make smarter inventory
decisions and improve revenue performance
Business Need Apex based Solution Client Outcome
• Ingest and analyze high volume clicks &
views in real-time to help customers
improve revenue
- 200K events/second data
flow
• Report critical metrics for campaign
monetization from auction and client
logs
- 22 TB/day data generated
• Handle ever increasing traffic with
efficient resource utilization
• Always-on ad network
• DataTorrent Enterprise platform,
powered by Apache Apex
• In-memory stream processing
• Comprehensive library of pre-built
operators including connectors
• Built-in fault tolerance
• Dynamically scalable
• Management UI & Data Visualization
console
• Helps PubMatic deliver ad performance
insights to publishers and advertisers in
real-time instead of 5+ hours
• Helps Publishers visualize campaign
performance and adjust ad inventory in
real-time to maximize their revenue
• Enables PubMatic reduce OPEX with
efficient compute resource utilization
• Built-in fault tolerance ensures
customers can always access ad
network
20. Industrial IoT applications
20
GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their
devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its
customers develop and execute Industrial IoT applications and gain real-time insights as well as actions.
Business Need Apex based Solution Client Outcome
• Ingest and analyze high-volume, high speed
data from thousands of devices, sensors
per customer in real-time without data loss
• Predictive analytics to reduce costly
maintenance and improve customer
service
• Unified monitoring of all connected sensors
and devices to minimize disruptions
• Fast application development cycle
• High scalability to meet changing business
and application workloads
• Ingestion application using DataTorrent
Enterprise platform
• Powered by Apache Apex
• In-memory stream processing
• Built-in fault tolerance
• Dynamic scalability
• Comprehensive library of pre-built
operators
• Management UI console
• Helps GE improve performance and lower
cost by enabling real-time Big Data
analytics
• Helps GE detect possible failures and
minimize unplanned downtimes with
centralized management & monitoring of
devices
• Enables faster innovation with short
application development cycle
• No data loss and 24x7 availability of
applications
• Helps GE adjust to scalability needs with
auto-scaling
21. Smart energy applications
21
Silver Spring Networks helps global utilities and cities connect, optimize, and manage smart energy and smart city
infrastructure. Silver Spring Networks receives data from over 22 million connected devices, conducts 2 million
remote operations per year
Business Need Apex based Solution Client Outcome
• Ingest high-volume, high speed data from
millions of devices & sensors in real-time
without data loss
• Make data accessible to applications
without delay to improve customer service
• Capture & analyze historical data to
understand & improve grid operations
• Reduce the cost, time, and pain of
integrating with 3rd party apps
• Centralized management of software &
operations
• DataTorrent Enterprise platform, powered
by Apache Apex
• In-memory stream processing
• Pre-built operator
• Built-in fault tolerance
• Dynamically scalable
• Management UI console
• Helps Silver Spring Networks ingest &
analyze data in real-time for effective load
management & customer service
• Helps Silver Spring Networks detect
possible failures and reduce outages with
centralized management & monitoring of
devices
• Enables fast application development for
faster time to market
• Helps Silver Spring Networks scale with
easy to partition operators
• Automatic recovery from failures
22. More about the use cases
22
• Pubmatic
• https://www.youtube.com/watch?v=JSXpgfQFcU8
• GE
• https://www.youtube.com/watch?v=hmaSkXhHNu0
• http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-using-
apache-apex-hadoop
• SilverSpring Networks
• https://www.youtube.com/watch?v=8VORISKeSjI
• http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by-
silver-spring-networks