A presentation discussing how to deploy Big data solutions. The difference between structured reporting systems which feed business processes and the data science systems which do cool stuff
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
The document discusses Marketo's migration of their SAAS business analytics platform to Hadoop. It describes their requirements of near real-time processing of 1 billion activities per customer per day at scale. They conducted a technology selection process between various Hadoop components and chose HBase, Kafka and Spark Streaming. The implementation involved building expertise, designing and building their first cluster, implementing security including Kerberos, validation through passive testing, deploying the new system through a migration, and ongoing monitoring, patching and upgrading of the new platform. Challenges included managing expertise retention, Zookeeper performance on VMs, Kerberos integration, and capacity planning for the shared Hadoop cluster.
This document discusses harnessing the power of Apache Hadoop. It summarizes the benefits of using Hadoop to derive value from large, diverse datasets. It then outlines the steps to install and deploy Hadoop, challenges of doing so, and advantages of using Cloudera's Distribution of Hadoop (CDH) and management tools to more easily operationalize Hadoop. The document promotes an upcoming webinar on managing the Hadoop lifecycle.
Webinar: Don't Leave Your Data in the DarkDataStax
As new types of data sources emerge from cloud, mobile devices, social media and machine sensor devices, traditional databases hit the ceiling due to today’s dynamic, data-volume driven business culture.
Join us in this online webinar and learn how you can incorporate a modern, NoSQL platform into daily operations to optimize and simplify data performance. DataStax recently announced DataStax Enterprise 4.0, a production-certified version of Apache Cassandra with an in-memory option, enterprise search, advanced security features and visual management tools. Give your developers a simple and powerful way to deliver the information your customers care about most—unconstrained by the complexities and high costs of traditional database systems.
Learn how to:
- Easily assign data based on its performance needs on traditional spinning disk, SSD or in-memory. All in the same database instance
- Leverage DataStax’s built-in enhancements for broader information search and analysis even with many thousands of concurrent requests
- Visually monitor, manage, and fine-tune your environment to get the most of your online data
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...VMware Tanzu
Pivotal HAWQ, one of the world’s most advanced enterprise SQL on Hadoop technology, coupled with the Hortonworks Data Platform, the only 100% open source Apache Hadoop data platform, can turbocharge your analytic efforts. The slides from this technical webinar present a deep dive on this powerful modern data architecture for analytics and data science.
Learn more here: http://pivotal.io/big-data/pivotal-hawq
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
Building scalable data pipelines for big data involves dealing with legacy systems, implementing data lineage and provenance, managing the data lifecycle, and engineering pipelines that can handle large volumes of data. Effective data pipeline engineering requires understanding how to extract, transform and load data while addressing issues like privacy, security, and integrating diverse data sources. Frameworks like Cascading can help build pipelines, but proper testing and scaling is also required to develop robust solutions.
The document discusses the evolution of big data architectures from Hadoop and MapReduce to Lambda architecture and stream processing frameworks. It notes the limitations of early frameworks in terms of latency, scalability, and fault tolerance. Modern architectures aim to unify batch and stream processing for low latency queries over both historical and new data.
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit
For over 30 years, Parametric has been a leading provider of model-based portfolios to institutional and private investors, with unique implementation and customization expertise. Much like other cutting-edge financial services providers, Parametric operates with highly diverse, fast moving data from which they glean insights. Data sources range from benchmark providers to electronic trading participants to stock exchanges etc. The challenge is to not just onboard the data but also to figure out how to monetize it when the schemas are fast changing. This presents a problem to traditional architectures where large teams are needed to design the new ETL flow. Organizations that are able to quickly adapt to new schemas and data sources have a distinct competitive advantage.
In this presentation and demo, Architects from Parametric , Chris Gambino & Vamsi Chemitiganti will present the data architecture designed in response to this business challenge. We discuss the approach (and trade-offs) to pooling, managing, processing the data using the latest techniques in data ingestion & pre-processing. The overall best practices in creating a central data pool are also discussed. Quantitative analysts to have the most accurate and up to date information for their models to work on. Attendees will be able to draw on their experiences both from a business and technology standpoint on not just creating a centralized data platform but also being able to distribute it to different units.
Solr + Hadoop: Interactive Search for Hadoopgregchanan
This document discusses Cloudera Search, which integrates Apache Solr with Cloudera's distribution of Apache Hadoop (CDH) to provide interactive search capabilities. It describes the architecture of Cloudera Search, including components like Solr, SolrCloud, and Morphlines for extraction and transformation. Methods for indexing data in real-time using Flume or batch using MapReduce are presented. The document also covers querying, security features like Kerberos authentication and collection-level authorization using Sentry, and concludes by describing how to obtain Cloudera Search.
Expert IT analyst groups like Wikibon forecast that NoSQL database usage will grow at a compound rate of 60% each year for the next five years, and Gartner Groups says NoSQL databases are one of the top trends impacting information management in 2013. But is NoSQL right for your business? How do you know which business applications will benefit from NoSQL and which won't? What questions do you need to ask in order to make such decisions?
If you're wondering what NoSQL is and if your business can benefit from NoSQL technology, join DataStax for the Webinar, "How to Tell if Your Business Needs NoSQL". This to-the-point presentation will provide practical litmus tests to help you understand whether NoSQL is right for your use case, and supplies examples of NoSQL technology in action with leading businesses that demonstrate how and where NoSQL databases can have the greatest impact."
Speaker: Robin Schumacher, Vice President of Products at DataStax
Robin Schumacher has spent the last 20 years working with databases and big data. He comes to DataStax from EnterpriseDB, where he built and led a market-driven product management group. Previously, Robin started and led the product management team at MySQL for three years before they were bought by Sun (the largest open source acquisition in history), and then by Oracle. He also started and led the product management team at Embarcadero Technologies, which was the #1 IPO in 2000. Robin is the author of three database performance books and frequent speaker at industry events. Robin holds BS, MA, and Ph.D. degrees from various universities.
Enterprise Data Warehouse Optimization: 7 Keys to SuccessHortonworks
You have a legacy system that no longer meet the demands of your current data needs, and replacing it isn’t an option. But don’t panic: Modernizing your traditional enterprise data warehouse is easier than you may think.
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...DataStax Academy
This document discusses using Spark and Cassandra for ad hoc analytics on Internet of Complex Things (IoCT) data. It describes modeling data in Cassandra, limitations of ad hoc queries in Cassandra, and how the Spark Cassandra connector enables running ad hoc queries in Spark by treating Cassandra tables as DataFrames that can be queried using SQL. It also covers running Spark SQL queries on Cassandra data using the JDBC server.
This document summarizes Premal Shah's presentation on how 6sense instruments their systems to analyze customer data. 6sense uses Hadoop and other tools to ingest customer data from various sources, run modeling and scoring, and provide actionable insights to customers. They discuss the data pipeline, challenges of performance and scaling, and how they use metrics and tools like Sumo Logic and OpsClarity to optimize and monitor their systems.
Getting Ready to Use Redis with Apache Spark with Tague GriffithDatabricks
This technical tutorial is designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. The session starts with a quick introduction to Redis and the capabilities Redis provides. It will cover the basic data types provided by Redis and the module system. Using an ad serving use case, Griffith will look at how Redis can improve the performance and reduce the cost of using complex ML-models in production.
You will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark and then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will also be discussed, focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these features.
By the end of the session, you should feel confident building a prototype/proof-of-concept application using Redis and Spark. You’ll understand how Redis complements Spark, and how to use Redis to serve complex, ML-models with high performance.
Lambda architecture for real time big dataTrieu Nguyen
- The document discusses the Lambda Architecture, a system designed by Nathan Marz for building real-time big data applications. It is based on three principles: human fault-tolerance, data immutability, and recomputation.
- The document provides two case studies of applying Lambda Architecture - at Greengar Studios for API monitoring and statistics, and at eClick for real-time data analytics on streaming user event data.
- Key lessons discussed are keeping solutions simple, asking the right questions to enable deep analytics and profit, using reactive and functional approaches, and turning data into useful insights.
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
Analyzing the World's Largest Security Data Lake!DataWorks Summit
The document discusses Symantec's CloudFire Analytics platform for analyzing security data at scale. It describes how CloudFire provides Hadoop ecosystem tools on OpenStack virtual machines across 50+ data centers to support security product analytics. Key points covered include analytics services and data, administration and monitoring using tools like Ambari and OpsView, and plans for self-service analytics using dynamic clusters provisioned through CloudBreak integration.
Data Con LA 2020
Description
Join this session to learn how to build a modern cloud-scale data compute platform with code in just minutes!
Using the industry's first IDE for building data applications, developers can now create data marts and data applications, while working interactively with large datasets. We will explore how easy it is to develop, test and operationalize powerful data compute applications over streaming data using SQL and Python and eager execution in Xcalar with the combination of declarative and visual imperative programming and eager execution
You will see how you can reduce time to market for analyzing large volumes of data and building enterprise-level complex data compute applications.
You will learn how to increase your developer productivity with SQL and Python, and put your complex business logic and ML models into production pipelines with the fastest time to value in industry.
Speaker
Nikita Ogievetsky, Xcalar, VP Product Engineering
The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman
The document discusses the future of analytics, data integration, and business intelligence (BI) on big data platforms like Hadoop. It covers how BI has evolved from old-school data warehousing to enterprise BI tools to utilizing big data platforms. New technologies like Impala, Kudu, and dataflow pipelines have made Hadoop fast and suitable for analytics. Machine learning can be used for automatic schema discovery. Emerging open-source BI tools and platforms, along with notebooks, bring new approaches to BI. Hadoop has become the default platform and future for analytics.
IoT devices generate high volume, continuous streams of data that must be analyzed in-memory – before they land on disk – to identify potential outliers/failures or business opportunities. Companies need to build robust yet flexible applications that can instantly act on the information derived from analyzing their IoT data. Attend this session to learn how you can easily handle real-time data acquisition across structured and semi-structured data, as well as windowing, fast in-memory streaming analytics, event correlation, visualization, alerts, workflows and smart data storage.
ProtectWise Revolutionizes Enterprise Network Security in the Cloud with Data...DataStax Academy
ProtectWise has revolutionized enterprise network security with its Security DVR Platform, which combines detection, visibility, and response capabilities into a single cloud-based solution. The Platform ingests and analyzes massive amounts of network data using technologies like Cassandra, Solr, and stream processing to detect threats, gain network visibility, and power responsive analytics over days, months, and years of historical data. A demo of the Security DVR Visualizer was provided.
Reliable Data Intestion in BigData / IoTGuido Schmutz
Many of the Big Data and IoT use cases are based on combing data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
Kappa Architecture is an alternative to Lambda Architecture that simplifies real-time data processing. It uses a distributed log like Kafka to store all input data immutably to allow reprocessing from the beginning if the processing code changes. This avoids having to maintain separate batch and real-time processing systems. The ASPgems team has implemented Kappa Architecture for several clients using Kafka, Spark Streaming, and Cassandra to provide real-time analytics and metrics in sectors like telecommunications, IoT, insurance, and energy.
Pivotal - Advanced Analytics for Telecommunications Hortonworks
Innovative mobile operators need to mine the vast troves of unstructured data now available to them to help develop compelling customer experiences and uncover new revenue opportunities. In this webinar, you’ll learn how HDB’s in-database analytics enable advanced use cases in network operations, customer care, and marketing for better customer experience. Join us, and get started on your advanced analytics journey today!
The document provides an overview of machine learning concepts and techniques using Apache Spark. It discusses supervised and unsupervised learning methods like classification, regression, clustering and collaborative filtering. Specific algorithms like k-means clustering, decision trees and random forests are explained. It also introduces Apache Spark MLlib and how to build machine learning pipelines and models with Spark ML APIs.
This document provides an overview of big data and discusses key concepts. It begins by defining big data and noting the increasing volume, velocity and variety of data being created. It then covers the big data landscape including storage models and technologies like Hadoop, analytics techniques like machine learning, and visualization. Finally, it discusses business uses cases and how big data is impacting industries and creating new business models through insights gained from data.
Slides used for the keynote at the even Big Data & Data Science http://eventos.citius.usc.es/bigdata/
Some slides are borrowed from random hadoop/big data presentations
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit
For over 30 years, Parametric has been a leading provider of model-based portfolios to institutional and private investors, with unique implementation and customization expertise. Much like other cutting-edge financial services providers, Parametric operates with highly diverse, fast moving data from which they glean insights. Data sources range from benchmark providers to electronic trading participants to stock exchanges etc. The challenge is to not just onboard the data but also to figure out how to monetize it when the schemas are fast changing. This presents a problem to traditional architectures where large teams are needed to design the new ETL flow. Organizations that are able to quickly adapt to new schemas and data sources have a distinct competitive advantage.
In this presentation and demo, Architects from Parametric , Chris Gambino & Vamsi Chemitiganti will present the data architecture designed in response to this business challenge. We discuss the approach (and trade-offs) to pooling, managing, processing the data using the latest techniques in data ingestion & pre-processing. The overall best practices in creating a central data pool are also discussed. Quantitative analysts to have the most accurate and up to date information for their models to work on. Attendees will be able to draw on their experiences both from a business and technology standpoint on not just creating a centralized data platform but also being able to distribute it to different units.
Solr + Hadoop: Interactive Search for Hadoopgregchanan
This document discusses Cloudera Search, which integrates Apache Solr with Cloudera's distribution of Apache Hadoop (CDH) to provide interactive search capabilities. It describes the architecture of Cloudera Search, including components like Solr, SolrCloud, and Morphlines for extraction and transformation. Methods for indexing data in real-time using Flume or batch using MapReduce are presented. The document also covers querying, security features like Kerberos authentication and collection-level authorization using Sentry, and concludes by describing how to obtain Cloudera Search.
Expert IT analyst groups like Wikibon forecast that NoSQL database usage will grow at a compound rate of 60% each year for the next five years, and Gartner Groups says NoSQL databases are one of the top trends impacting information management in 2013. But is NoSQL right for your business? How do you know which business applications will benefit from NoSQL and which won't? What questions do you need to ask in order to make such decisions?
If you're wondering what NoSQL is and if your business can benefit from NoSQL technology, join DataStax for the Webinar, "How to Tell if Your Business Needs NoSQL". This to-the-point presentation will provide practical litmus tests to help you understand whether NoSQL is right for your use case, and supplies examples of NoSQL technology in action with leading businesses that demonstrate how and where NoSQL databases can have the greatest impact."
Speaker: Robin Schumacher, Vice President of Products at DataStax
Robin Schumacher has spent the last 20 years working with databases and big data. He comes to DataStax from EnterpriseDB, where he built and led a market-driven product management group. Previously, Robin started and led the product management team at MySQL for three years before they were bought by Sun (the largest open source acquisition in history), and then by Oracle. He also started and led the product management team at Embarcadero Technologies, which was the #1 IPO in 2000. Robin is the author of three database performance books and frequent speaker at industry events. Robin holds BS, MA, and Ph.D. degrees from various universities.
Enterprise Data Warehouse Optimization: 7 Keys to SuccessHortonworks
You have a legacy system that no longer meet the demands of your current data needs, and replacing it isn’t an option. But don’t panic: Modernizing your traditional enterprise data warehouse is easier than you may think.
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...DataStax Academy
This document discusses using Spark and Cassandra for ad hoc analytics on Internet of Complex Things (IoCT) data. It describes modeling data in Cassandra, limitations of ad hoc queries in Cassandra, and how the Spark Cassandra connector enables running ad hoc queries in Spark by treating Cassandra tables as DataFrames that can be queried using SQL. It also covers running Spark SQL queries on Cassandra data using the JDBC server.
This document summarizes Premal Shah's presentation on how 6sense instruments their systems to analyze customer data. 6sense uses Hadoop and other tools to ingest customer data from various sources, run modeling and scoring, and provide actionable insights to customers. They discuss the data pipeline, challenges of performance and scaling, and how they use metrics and tools like Sumo Logic and OpsClarity to optimize and monitor their systems.
Getting Ready to Use Redis with Apache Spark with Tague GriffithDatabricks
This technical tutorial is designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. The session starts with a quick introduction to Redis and the capabilities Redis provides. It will cover the basic data types provided by Redis and the module system. Using an ad serving use case, Griffith will look at how Redis can improve the performance and reduce the cost of using complex ML-models in production.
You will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark and then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will also be discussed, focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these features.
By the end of the session, you should feel confident building a prototype/proof-of-concept application using Redis and Spark. You’ll understand how Redis complements Spark, and how to use Redis to serve complex, ML-models with high performance.
Lambda architecture for real time big dataTrieu Nguyen
- The document discusses the Lambda Architecture, a system designed by Nathan Marz for building real-time big data applications. It is based on three principles: human fault-tolerance, data immutability, and recomputation.
- The document provides two case studies of applying Lambda Architecture - at Greengar Studios for API monitoring and statistics, and at eClick for real-time data analytics on streaming user event data.
- Key lessons discussed are keeping solutions simple, asking the right questions to enable deep analytics and profit, using reactive and functional approaches, and turning data into useful insights.
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
Analyzing the World's Largest Security Data Lake!DataWorks Summit
The document discusses Symantec's CloudFire Analytics platform for analyzing security data at scale. It describes how CloudFire provides Hadoop ecosystem tools on OpenStack virtual machines across 50+ data centers to support security product analytics. Key points covered include analytics services and data, administration and monitoring using tools like Ambari and OpsView, and plans for self-service analytics using dynamic clusters provisioned through CloudBreak integration.
Data Con LA 2020
Description
Join this session to learn how to build a modern cloud-scale data compute platform with code in just minutes!
Using the industry's first IDE for building data applications, developers can now create data marts and data applications, while working interactively with large datasets. We will explore how easy it is to develop, test and operationalize powerful data compute applications over streaming data using SQL and Python and eager execution in Xcalar with the combination of declarative and visual imperative programming and eager execution
You will see how you can reduce time to market for analyzing large volumes of data and building enterprise-level complex data compute applications.
You will learn how to increase your developer productivity with SQL and Python, and put your complex business logic and ML models into production pipelines with the fastest time to value in industry.
Speaker
Nikita Ogievetsky, Xcalar, VP Product Engineering
The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman
The document discusses the future of analytics, data integration, and business intelligence (BI) on big data platforms like Hadoop. It covers how BI has evolved from old-school data warehousing to enterprise BI tools to utilizing big data platforms. New technologies like Impala, Kudu, and dataflow pipelines have made Hadoop fast and suitable for analytics. Machine learning can be used for automatic schema discovery. Emerging open-source BI tools and platforms, along with notebooks, bring new approaches to BI. Hadoop has become the default platform and future for analytics.
IoT devices generate high volume, continuous streams of data that must be analyzed in-memory – before they land on disk – to identify potential outliers/failures or business opportunities. Companies need to build robust yet flexible applications that can instantly act on the information derived from analyzing their IoT data. Attend this session to learn how you can easily handle real-time data acquisition across structured and semi-structured data, as well as windowing, fast in-memory streaming analytics, event correlation, visualization, alerts, workflows and smart data storage.
ProtectWise Revolutionizes Enterprise Network Security in the Cloud with Data...DataStax Academy
ProtectWise has revolutionized enterprise network security with its Security DVR Platform, which combines detection, visibility, and response capabilities into a single cloud-based solution. The Platform ingests and analyzes massive amounts of network data using technologies like Cassandra, Solr, and stream processing to detect threats, gain network visibility, and power responsive analytics over days, months, and years of historical data. A demo of the Security DVR Visualizer was provided.
Reliable Data Intestion in BigData / IoTGuido Schmutz
Many of the Big Data and IoT use cases are based on combing data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
Kappa Architecture is an alternative to Lambda Architecture that simplifies real-time data processing. It uses a distributed log like Kafka to store all input data immutably to allow reprocessing from the beginning if the processing code changes. This avoids having to maintain separate batch and real-time processing systems. The ASPgems team has implemented Kappa Architecture for several clients using Kafka, Spark Streaming, and Cassandra to provide real-time analytics and metrics in sectors like telecommunications, IoT, insurance, and energy.
Pivotal - Advanced Analytics for Telecommunications Hortonworks
Innovative mobile operators need to mine the vast troves of unstructured data now available to them to help develop compelling customer experiences and uncover new revenue opportunities. In this webinar, you’ll learn how HDB’s in-database analytics enable advanced use cases in network operations, customer care, and marketing for better customer experience. Join us, and get started on your advanced analytics journey today!
The document provides an overview of machine learning concepts and techniques using Apache Spark. It discusses supervised and unsupervised learning methods like classification, regression, clustering and collaborative filtering. Specific algorithms like k-means clustering, decision trees and random forests are explained. It also introduces Apache Spark MLlib and how to build machine learning pipelines and models with Spark ML APIs.
This document provides an overview of big data and discusses key concepts. It begins by defining big data and noting the increasing volume, velocity and variety of data being created. It then covers the big data landscape including storage models and technologies like Hadoop, analytics techniques like machine learning, and visualization. Finally, it discusses business uses cases and how big data is impacting industries and creating new business models through insights gained from data.
Slides used for the keynote at the even Big Data & Data Science http://eventos.citius.usc.es/bigdata/
Some slides are borrowed from random hadoop/big data presentations
The document provides an overview of big data concepts including definitions, statistics on data generation and internet usage, applications and examples, challenges, and data types. It discusses key big data concepts such as the 3Vs of volume, velocity and variety; more Vs including veracity, value and visualization; data science areas and skills; the data workflow; and examples from companies like UPS, Walmart, eBay, and Kaiser Permanente.
The document provides an introduction to big data, including definitions and characteristics. It discusses how big data can be described by its volume, variety, and velocity. It notes that big data is large and complex data that is difficult to process using traditional data management tools. Common sources of big data include social media, sensors, and scientific instruments. Challenges in big data include capturing, storing, analyzing, and visualizing large and diverse datasets that are generated quickly. Distributed file systems and technologies like Hadoop are well-suited for processing big data.
MapReduce allows distributed processing of large datasets across clusters of computers. It works by splitting the input data into independent chunks which are processed by the map function in parallel. The map function produces intermediate key-value pairs which are grouped by the reduce function to form the output data. Fault tolerance is achieved through replication of data across nodes and re-executing failed tasks. This makes MapReduce suitable for efficiently processing very large datasets in a distributed environment.
The wave of Big Data is still in its high peaks, with age of prominence at about 5 years. Many are still amused, while few fortunate folks had a taste of it. Taste with essence. Few linger around the topics, terminology, and other buzz!
This is a series attempt to gain our arms around the Domain and key coordinates of the subject. Subsequently dwell a bit deeper on implementation challenges, navigating a bit close to the core of the challenges. Whet tools, solution approaches and how knowledge from other related fields of Science fit into the overall ball game!
Main abode for this going forward will be at www.ganaakruti.com.
Big Data brings big promise and also big challenges, the primary and most important one being the ability to deliver Value to business stakeholders who are not data scientists!
This document provides a brief history of big data, from the earliest known uses of data storage thousands of years ago to modern applications of big data. It outlines key developments such as the creation of early data storage and analysis methods, the development of computerized data processing, and the growth of data collection and sharing through the internet and mobile technology. The document also discusses the increasing volume of data generated every day through online activities and defines some of the main challenges in working with big data today.
Digital infographics can use graphics to more easily convey information through visualization. There are different types of infographics including spatial, chronological, and quantitative infographics that use diagrams, charts, maps, and other visual elements to communicate information. Big data refers to extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations. Big data is used across many industries for applications like customer analytics, predictive maintenance, risk analysis, and more. Hadoop is an open-source software framework that allows distributed processing of large data sets across clusters of computers using MapReduce.
The document provides an introduction to big data and Hadoop. It defines big data as large datasets that cannot be processed using traditional computing techniques due to the volume, variety, velocity, and other characteristics of the data. It discusses traditional data processing versus big data and introduces Hadoop as an open-source framework for storing, processing, and analyzing large datasets in a distributed environment. The document outlines the key components of Hadoop including HDFS, MapReduce, YARN, and Hadoop distributions from vendors like Cloudera and Hortonworks.
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at [email protected]
BDA UNIT 1big data – web analytics – big data applications– big data technolo...BalachandarJ5
UNDERSTANDING BIG DATA Introduction to big data – convergence of key trends, unstructured data – industry examples of
big data – web analytics – big data applications– big data technologies – introduction to Hadoop – open source technologies – cloud and big data – mobile business intelligence – Crowd sourcing
analytics ,– inter and trans firewall analytics.
Tools and Methods for Big Data Analytics by Dahl WintersMelinda Thielbar
Research Triangle Analysts October presentation on Big Data by Dahl Winters (formerly of Research Triangle Institute). Dahl takes her viewers on a whirlwind tour of big data tools such as Hadoop and big data algorithms such as MapReduce, clustering, and deep learning. These slides document the many resources available on the internet, as well as guidelines of when and where to use each.
Tools and Methods for Big Data Analytics by Dahl WintersMelinda Thielbar
Research Triangle Analysts October presentation on Big Data by Dahl Winters (formerly of Research Triangle Institute). Dahl takes her viewers on a whirlwind tour of big data tools such as Hadoop and big data algorithms such as MapReduce, clustering, and deep learning. These slides document the many resources available on the internet, as well as guidelines of when and where to use each.
Big Data Analytics and Hadoop is presented. Key points include:
- Big data is large and complex data that is difficult to process using traditional methods. Domains that produce large datasets include meteorology, physics simulations, and internet search.
- The four V's of big data are volume, velocity, variety, and veracity. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. Its core components are HDFS for storage and MapReduce for processing.
- Apache Hadoop has gained popularity for big data analytics due to its ability to process large amounts of data in parallel using commodity hardware, its scalability, and automatic failover. A Hadoop ecosystem of
This is a presentation I gave to SAP Internally to discuss how their customers see DevOps and what challenges they have when trying to migrate to a DevOps operating model
The document discusses building a big data lab using cloud services like Google Cloud Platform (GCP). It notes that traditional homebrew labs have limited resources while cloud-based labs provide infinite resources and utility billing. It emphasizes defining goals for the lab work, acquiring necessary skills and knowledge, and using public datasets to complement internal data. Choosing the right tools and cloud platform like GCP, AWS, or Azure is important for high performance analytics on large data volumes and formats.
Showing the challenges and opportunities within the SAP ecosystem for adopting DevOps practices. Discussing how ABAP, HANA, UI5, BObj, NW JAVA and SCP JAVA each have their own capabilities and challenges in adopting DevOps.
This document provides an overview of SAP HANA and business performance with SAP. It discusses the history of SAP HANA and how it has evolved from 2011 to provide real-time analysis, reporting and business capabilities. It also summarizes the HANA technology stack, database architecture, features, software lifecycle, infrastructure examples, backup/recovery process, user management and network connectivity.
Showing the challenges and opportunities within the SAP ecosystem for adopting DevOps practices. Discussing how ABAP, HANA, UI5, BObj, NW JAVA and SCP JAVA each have their own capabilities and challenges in adopting DevOps.
TEC118 –How Do You Manage the Configuration of Your Environments from Metal ...Chris Kernaghan
The document discusses configuration management in IT infrastructure. It describes how configuration management has evolved from manual processes using tools like Excel and Word documents to more automated approaches using infrastructure as code. It provides examples of configuration management systems like Puppet and Chef and shows their architectures and how they can be used to configure operating systems, databases, and applications in a consistent, repeatable manner. The presentation includes demonstrations of Puppet and Chef.
Automating Infrastructure as a Service Deployments and monitoring – TEC213Chris Kernaghan
The document discusses automating infrastructure as a service deployments and monitoring. It covers several topics:
- IaaS environments allow for scalable cloud computing resources billed based on usage. SAP has been working with Amazon Web Services since 2008.
- Automation can schedule repetitive tasks, enable consistent processes, and provide auditable records. DevOps focuses on collaboration, automation, measurement, and sharing to create flexible infrastructure.
- Automating infrastructure provisioning, configuration management, change management, and exception monitoring can improve speed, reduce costs, and ensure compliance. Cloud security also needs automation to ensure data protection with the cloud's flexibility.
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deploymentsChris Kernaghan
This document summarizes automation of infrastructure as a service deployments and monitoring. It discusses Infrastructure as a Service (IaaS) and how IaaS environments allow for scalable, on-demand provisioning of computing resources. It also discusses SAP's support for AWS and how Capgemini UK uses AWS for SAP deployments. The document advocates for automating infrastructure tasks to improve consistency, auditability and repeatability. It provides examples of automation for build processes, configuration management, change management, exception monitoring, and other areas. Overall, the document promotes automating infrastructure processes in IaaS environments to improve agility, reduce costs, and ensure compliance.
SAP TechEd 2013 session Tec118 managing your-environmentChris Kernaghan
This document discusses configuration management and provides examples of using Puppet and Chef for configuration management. It defines configuration management as managing the configuration of systems from hardware to applications. It explains that configuration management allows automating repetitive system administration tasks in a scheduled, consistent, auditable, and repeatable way. The document compares Puppet and Chef and provides examples of configuration scripts for each tool. It demos how to use Puppet and Chef to configure a system.
01 sap hana landscape and operations infrastructure v2 0Chris Kernaghan
This document discusses SAP HANA landscape and operations infrastructure. It covers HANA editions and technical scenarios, the HANA database lifecycle including patching, installation, backup and restore. It also discusses using HANA as a platform and monitoring HANA performance. Additionally, it outlines different data load scenarios into HANA and provides tips for tools like SAP BODS and SLT. The document concludes that introducing HANA brings technical challenges across areas like reporting, data management, development and operations that require consideration.
This document discusses SAP HANA landscape and operations. It covers HANA editions and scenarios, the HANA database lifecycle including patching and backup, using HANA as a platform, performance monitoring, data load scenarios using tools like BODS, SLT, SRS and ESP, and provides contact information for the author.
Trends Artificial Intelligence - Mary MeekerClive Dickens
Mary Meeker’s 2024 AI report highlights a seismic shift in productivity, creativity, and business value driven by generative AI. She charts the rapid adoption of tools like ChatGPT and Midjourney, likening today’s moment to the dawn of the internet. The report emphasizes AI’s impact on knowledge work, software development, and personalized services—while also cautioning about data quality, ethical use, and the human-AI partnership. In short, Meeker sees AI as a transformative force accelerating innovation and redefining how we live and work.
What is Oracle EPM A Guide to Oracle EPM Cloud Everything You Need to KnowSMACT Works
In today's fast-paced business landscape, financial planning and performance management demand powerful tools that deliver accurate insights. Oracle EPM (Enterprise Performance Management) stands as a leading solution for organizations seeking to transform their financial processes. This comprehensive guide explores what Oracle EPM is, its key benefits, and how partnering with the right Oracle EPM consulting team can maximize your investment.
How Advanced Environmental Detection Is Revolutionizing Oil & Gas Safety.pdfRejig Digital
Unlock the future of oil & gas safety with advanced environmental detection technologies that transform hazard monitoring and risk management. This presentation explores cutting-edge innovations that enhance workplace safety, protect critical assets, and ensure regulatory compliance in high-risk environments.
🔍 What You’ll Learn:
✅ How advanced sensors detect environmental threats in real-time for proactive hazard prevention
🔧 Integration of IoT and AI to enable rapid response and minimize incident impact
📡 Enhancing workforce protection through continuous monitoring and data-driven safety protocols
💡 Case studies highlighting successful deployment of environmental detection systems in oil & gas operations
Ideal for safety managers, operations leaders, and technology innovators in the oil & gas industry, this presentation offers practical insights and strategies to revolutionize safety standards and boost operational resilience.
👉 Learn more: https://www.rejigdigital.com/blog/continuous-monitoring-prevent-blowouts-well-control-issues/
Mark Zuckerberg teams up with frenemy Palmer Luckey to shape the future of XR...Scott M. Graffius
Mark Zuckerberg teams up with frenemy Palmer Luckey to shape the future of XR/VR/AR wearables 🥽
Drawing on his background in AI, Agile, hardware, software, gaming, and defense, Scott M. Graffius explores the collaboration in “Meta and Anduril’s EagleEye and the Future of XR: How Gaming, AI, and Agile are Transforming Defense.” It’s a powerful case of cross-industry innovation—where gaming meets battlefield tech.
📖 Read the article: https://www.scottgraffius.com/blog/files/meta-and-anduril-eagleeye-and-the-future-of-xr-how-gaming-ai-and-agile-are-transforming-defense.html
#Agile #AI #AR #ArtificialIntelligence #AugmentedReality #Defense #DefenseTech #EagleEye #EmergingTech #ExtendedReality #ExtremeReality #FutureOfTech #GameDev #GameTech #Gaming #GovTech #Hardware #Innovation #Meta #MilitaryInnovation #MixedReality #NationalSecurity #TacticalTech #Tech #TechConvergence #TechInnovation #VirtualReality #XR
Data Virtualization: Bringing the Power of FME to Any ApplicationSafe Software
Imagine building web applications or dashboards on top of all your systems. With FME’s new Data Virtualization feature, you can deliver the full CRUD (create, read, update, and delete) capabilities on top of all your data that exploit the full power of FME’s all data, any AI capabilities. Data Virtualization enables you to build OpenAPI compliant API endpoints using FME Form’s no-code development platform.
In this webinar, you’ll see how easy it is to turn complex data into real-time, usable REST API based services. We’ll walk through a real example of building a map-based app using FME’s Data Virtualization, and show you how to get started in your own environment – no dev team required.
What you’ll take away:
-How to build live applications and dashboards with federated data
-Ways to control what’s exposed: filter, transform, and secure responses
-How to scale access with caching, asynchronous web call support, with API endpoint level security.
-Where this fits in your stack: from web apps, to AI, to automation
Whether you’re building internal tools, public portals, or powering automation – this webinar is your starting point to real-time data delivery.
Neural representations have shown the potential to accelerate ray casting in a conventional ray-tracing-based rendering pipeline. We introduce a novel approach called Locally-Subdivided Neural Intersection Function (LSNIF) that replaces bottom-level BVHs used as traditional geometric representations with a neural network. Our method introduces a sparse hash grid encoding scheme incorporating geometry voxelization, a scene-agnostic training data collection, and a tailored loss function. It enables the network to output not only visibility but also hit-point information and material indices. LSNIF can be trained offline for a single object, allowing us to use LSNIF as a replacement for its corresponding BVH. With these designs, the network can handle hit-point queries from any arbitrary viewpoint, supporting all types of rays in the rendering pipeline. We demonstrate that LSNIF can render a variety of scenes, including real-world scenes designed for other path tracers, while achieving a memory footprint reduction of up to 106.2x compared to a compressed BVH.
https://arxiv.org/abs/2504.21627
Domino IQ – What to Expect, First Steps and Use Casespanagenda
Webinar Recording: https://www.panagenda.com/webinars/domino-iq-what-to-expect-first-steps-and-use-cases/
HCL Domino iQ Server – From Ideas Portal to implemented Feature. Discover what it is, what it isn’t, and explore the opportunities and challenges it presents.
Key Takeaways
- What are Large Language Models (LLMs) and how do they relate to Domino iQ
- Essential prerequisites for deploying Domino iQ Server
- Step-by-step instructions on setting up your Domino iQ Server
- Share and discuss thoughts and ideas to maximize the potential of Domino iQ
Interested in leveling up your JavaScript skills? Join us for our Introduction to TypeScript workshop.
Learn how TypeScript can improve your code with dynamic typing, better tooling, and cleaner architecture. Whether you're a beginner or have some experience with JavaScript, this session will give you a solid foundation in TypeScript and how to integrate it into your projects.
Workshop content:
- What is TypeScript?
- What is the problem with JavaScript?
- Why TypeScript is the solution
- Coding demo
Securiport is a border security systems provider with a progressive team approach to its task. The company acknowledges the importance of specialized skills in creating the latest in innovative security tech. The company has offices throughout the world to serve clients, and its employees speak more than twenty languages at the Washington D.C. headquarters alone.
Presentation given at the LangChain community meetup London
https://lu.ma/9d5fntgj
Coveres
Agentic AI: Beyond the Buzz
Introduction to AI Agent and Agentic AI
Agent Use case and stats
Introduction to LangGraph
Build agent with LangGraph Studio V2
DevOps in the Modern Era - Thoughtfully Critical PodcastChris Wahl
https://youtu.be/735hP_01WV0
My journey through the world of DevOps! From the early days of breaking down silos between developers and operations to the current complexities of cloud-native environments. I'll talk about my personal experiences, the challenges we faced, and how the role of a DevOps engineer has evolved.
Your startup on AWS - How to architect and maintain a Lean and Mean accountangelo60207
Prevent infrastructure costs from becoming a significant line item on your startup’s budget! Serial entrepreneur and software architect Angelo Mandato will share his experience with AWS Activate (startup credits from AWS) and knowledge on how to architect a lean and mean AWS account ideal for budget minded and bootstrapped startups. In this session you will learn how to manage a production ready AWS account capable of scaling as your startup grows for less than $100/month before credits. We will discuss AWS Budgets, Cost Explorer, architect priorities, and the importance of having flexible, optimized Infrastructure as Code. We will wrap everything up discussing opportunities where to save with AWS services such as S3, EC2, Load Balancers, Lambda Functions, RDS, and many others.
Exploring the advantages of on-premises Dell PowerEdge servers with AMD EPYC processors vs. the cloud for small to medium businesses’ AI workloads
AI initiatives can bring tremendous value to your business, but you need to support your new AI workloads effectively. That means choosing the best possible infrastructure for your needs—and many companies are finding that the cloud isn’t right for them. According to a recent Rackspace survey of IT executives, 69 percent of companies have moved some of their applications on-premises from the cloud, with half of those citing security and compliance as the reason and 44 percent citing cost.
On-premises solutions provide a number of advantages. With full control over your security infrastructure, you can be certain that all compliance requirements remain firmly in the hands of your IT team. Opting for on-premises also gives you the ability to design your infrastructure to the precise needs of that team and your new AI workloads. Depending on the workload, you may also see performance benefits, along with more predictable costs. As you start to build your next AI initiative, consider an on-premises solution utilizing AMD EPYC processor-powered Dell PowerEdge servers.
If You Use Databricks, You Definitely Need FMESafe Software
DataBricks makes it easy to use Apache Spark. It provides a platform with the potential to analyze and process huge volumes of data. Sounds awesome. The sales brochure reads as if it is a can-do-all data integration platform. Does it replace our beloved FME platform or does it provide opportunities for FME to shine? Challenge accepted
2. LEARN • NETWORK • COLLABORATE • INFLUENCE
Deploying Big Data platforms
LEARN • NETWORK • COLLABORATE • INFLUENCE
Chris Kernaghan
Principal Consultant
3. LEARN • NETWORK • COLLABORATE • INFLUENCE
Cholera epidemic first use of big data
4. LEARN • NETWORK • COLLABORATE • INFLUENCE
Big Data Epidemiology by Google
5. LEARN • NETWORK • COLLABORATE • INFLUENCE
How I really got started in Big Data
John, we need
to give Chris
more grey hair
Let’s throw him
into a Big Data
demo
8. LEARN • NETWORK • COLLABORATE • INFLUENCE
Areas of focus
Data acquisition
and curation
Data storage Compute
infrastructure
Analysis and
Insight
Everything as Code*
* Well As much as possible
9. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data Acquisition and curation
Areas of focus
11. LEARN • NETWORK • COLLABORATE • INFLUENCE
How big was the Panama Papers data set
12. LEARN • NETWORK • COLLABORATE • INFLUENCE
How big was the Panama Papers data set
13. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data Lake
Panama Papers Technology stack
SQL
14. LEARN • NETWORK • COLLABORATE • INFLUENCE
The tools used supported 370 journalists from
around the world
Infrastructure
was a pool of
up to 40
servers run in
AWS
15. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data quality and curation are not one time activities
Remove the human element as much as possible
16. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data security
• Data lake
– What data do you collect
– Do you have restrictions on what data can be combined
– How long does your data live
17. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data security
• Geographical concerns
– Where does your data reside
18. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data security
• Authentication
– Who is accessing your data
19. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data Storage
Areas of focus
20. LEARN • NETWORK • COLLABORATE • INFLUENCE
How BIG is Big Data
22. LEARN • NETWORK • COLLABORATE • INFLUENCE
Storage Considerations
• IOPS are still important
– Big data still uses a lot of spinning disk
• Replication and Redundancy
– Eats a lot of disk space
• Build for failure
• Sometimes you have to go in-memory
23. LEARN • NETWORK • COLLABORATE • INFLUENCE
Compute infrastructure
Areas of focus
24. LEARN • NETWORK • COLLABORATE • INFLUENCE
Structured Reporting Versus Big Data/Science
Compute requirements
2
• Structured reporting systems run business processes
– Sized and static
– Under change control
– Business centric
25. LEARN • NETWORK • COLLABORATE • INFLUENCE
Structured Reporting Versus Big Data/Science
Compute requirements
2
• Data science systems answer difficult questions irregularly
– Cloud or heavy use of virtualisation
– Developer centric
– Rapidly evolving
26. LEARN • NETWORK • COLLABORATE • INFLUENCE
What you still need to remember
2
• Compute is cheap
• Scalability is critical
27. LEARN • NETWORK • COLLABORATE • INFLUENCE
What you still need to remember
2
• Software definition for consistency
• Automate as much as possible
28. LEARN • NETWORK • COLLABORATE • INFLUENCE
2
100 Hadoop
Nodes
122GB RAM
Each = 12.2TB RAM
Build time of 3Hrs
29. LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
2
Disk definition
Network
defintion
Software
Install
30. LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
3
• Deployment was consistent for each and every node of the
cluster
– Hostnames defined the same way
– Configuration files created the same way
31. LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
3
• Faster deployment
– Automated build 3hrs to build and deploy 100 nodes
– Manual build 800hrs + to build and deploy 100 nodes
• Use of automated tools to detect failure and start new node
(ElasticBeanstalk)
32. LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
3
• Reusability of script
– Heavy use of parameters means it is adaptable
• Use of Git meant distributed development was handled easily
37. LEARN • NETWORK • COLLABORATE • INFLUENCE
Things to remember
• Remember the type of
platform you are using
• Storage is cheap but not
all storage is equal
• Scalability is critical
• Version control rocks
• Automate everything
you can
• Value is in the data but
not all data is valuable
• Data should not live
forever
#5: 2008 H1N1 flu pandemic in US
CDC had out of date data
#7: Panama papers – transient use case
Under Armour – constant data use case answering lots of different questions
Common Sense Finance institution – transient audit data use case
Natures Hope – Pushing structured data into data lake to provide better temperate control as part of their data lifecycle
Intel – using event streaming to drive manufacturing processes
#10: We are literally drowning in data – data lakes
What data do we acquire – sensor data, web data, social media, transactional data
What data is actually necessary, how long does it need to live for, what is its data life cycle
What data do we need that we do not have access to
How do we curate data for data lakes
#11: We are literally drowning in data – data lakes
What data do we acquire – sensor data, web data, social media, transactional data
What data is actually necessary, how long does it need to live for, what is its data life cycle
What data do we need that we do not have access to
How do we curate data for data lakes
#12: We have four developers and three journalists.
#14: Time line
Working on Platform for 3 years across the various links
Processed Panama papers in around 12 months
#22: How do we store data – databases and files
Big data data storage systems
HDFS
Cloud based S3 or Azure Storage
Databases – SQL and NoSQL
CSV
Hardware – massively scalable software defined infrastructures which expect failure
#29: John broke my cluster
20 nodes – scaled to 100 nodes