The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.Data Con LA
Hadoop is moving towards improved security and real-time analytics. For security, Hadoop vendors have made acquisitions and implemented features like Kerberos authentication and Apache Sentry authorization. For real-time analytics, tools are focusing on real-time streaming (like Storm, Spark Streaming, and Samza) and real-time querying of data (like Hive on Tez, Impala, Drill, and Spark). The right tool depends on use cases, and enterprises should choose what is easiest and most scalable.
This is the first time I introduced the concept of Schema-on-Read vs Schema-on-Write to the public. It was at Berkeley EECS RAD Lab retreat Open Mic Session on May 28th, 2009 at Santa Cruz, California.
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
IBM's Big SQL is their SQL for Hadoop product that allows users to run SQL queries on Hadoop data. It uses the Hive metastore to catalog table definitions and shares data logic with Hive. Big SQL is architected for high performance with a massively parallel processing (MPP) runtime and runs directly on the Hadoop cluster with no proprietary storage formats required. The document compares Big SQL to other SQL on Hadoop solutions and outlines its performance and architectural advantages.
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Edureka!
This document discusses various Hadoop distributions and how to choose between them. It introduces Apache Hadoop and describes popular distributions from Cloudera, Hortonworks, and MapR. Cloudera is based on open source Hadoop but adds proprietary tools, while Hortonworks uses only open source software. MapR takes a different approach than Hadoop with its own file system. The document advises trying different distributions' community editions to compare them and determine features needed before selecting a distribution.
The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.
The document provides an overview of Hadoop, including:
- What Hadoop is and its core modules like HDFS, YARN, and MapReduce.
- Reasons for using Hadoop like its ability to process large datasets faster across clusters and provide predictive analytics.
- When Hadoop should and should not be used, such as for real-time analytics versus large, diverse datasets.
- Options for deploying Hadoop including as a service on cloud platforms, on infrastructure as a service providers, or on-premise with different distributions.
- Components that make up the Hadoop ecosystem like Pig, Hive, HBase, and Mahout.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA and SAP BusinessObjects enabling a broad range of new analytic applications.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKhuguk
This session will give you an update on what SUSE is up to in the Big Data arena. We will take a brief look at SUSE Linux Enterprise Server and why it makes the perfect foundation for your Hadoop Deployment.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
The document discusses when to use Hadoop instead of a relational database management system (RDBMS) for advanced analytics. It provides examples of when queries like count distinct, cursors, and alter table statements become problematic in an RDBMS. It contrasts analyzing simple, transactional data like invoices versus complex, evolving data like customers or website visitors. Hadoop is better suited for problems involving complex objects, self-joins on large datasets, and matching large datasets. The document encourages structuring data in HDFS in a flexible way that fits the problem and use cases like simple counts on complex objects, self-self-self joins, and matching problems.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
This document provides an overview and comparison of RDBMS, Hadoop, and Spark. It introduces RDBMS and describes its use cases such as online transaction processing and data warehouses. It then introduces Hadoop and describes its ecosystem including HDFS, YARN, MapReduce, and related sub-modules. Common use cases for Hadoop are also outlined. Spark is then introduced along with its modules like Spark Core, SQL, and MLlib. Use cases for Spark include data enrichment, trigger event detection, and machine learning. The document concludes by comparing RDBMS and Hadoop, as well as Hadoop and Spark, and addressing common misconceptions about Hadoop and Spark.
This document provides an overview of a SQL-on-Hadoop tutorial. It introduces the presenters and discusses why SQL is important for Hadoop, as MapReduce is not optimal for all use cases. It also notes that while the database community knows how to efficiently process data, SQL-on-Hadoop systems face challenges due to the limitations of running on top of HDFS and Hadoop ecosystems. The tutorial outline covers SQL-on-Hadoop technologies like storage formats, runtime engines, and query optimization.
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Nicolas Morales
This document provides an overview of IBM's Big SQL product for running SQL queries on Hadoop data. It discusses how Big SQL uses a massively parallel processing (MPP) architecture to replace MapReduce for improved performance. Big SQL nodes run directly on the Hadoop cluster to process data locally. The document highlights Big SQL's full SQL query capabilities and support for analytic functions. It also notes how Big SQL leverages the existing Hive metadata and is designed to integrate with the broader Hadoop ecosystem.
This document summarizes Andrew Brust's presentation on using the Microsoft platform for big data. It discusses Hadoop and HDInsight, MapReduce, using Hive with ODBC and the BI stack. It also covers Hekaton, NoSQL, SQL Server Parallel Data Warehouse, and PolyBase. The presentation includes demos of HDInsight, MapReduce, and using Hive with the BI stack.
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA, Sybase and SAP BusinessObjects enabling a broad range of new analytic applications.
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
This document provides best practices for optimizing the performance of InfoSphere BigInsights and InfoSphere Streams when deployed in the cloud. It discusses optimizing disk performance by choosing cloud providers and instances with good disk I/O, partitioning and formatting disks correctly, and configuring HDFS to use multiple data directories. It also discusses optimizing Java performance by correctly configuring JVM memory and optimizing MapReduce performance by setting appropriate values for map and reduce tasks based on machine resources.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
This document discusses modern data architecture and Apache Hadoop's role within it. It presents WANdisco and its Non-Stop Hadoop solution, which extends HDFS across multiple data centers to provide 100% uptime for Hadoop deployments. Non-Stop Hadoop uses WANdisco's patented distributed coordination engine to synchronize HDFS metadata across sites separated by wide area networks, enabling continuous availability of HDFS data and global HDFS deployments.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
The document discusses integrating Hadoop with relational databases. It describes scenarios where reference data is stored in an RDBMS and used in Hadoop, Hadoop is used for offline analytics on data stored in an RDBMS, and exporting MapReduce outputs to an RDBMS. It then presents a case study on extending SQOOP for optimized Oracle integration and compares performance with and without the extension. Other tools for Hadoop-RDBMS integration are also briefly outlined.
The document provides an overview of an experimentation platform built on Hadoop. It discusses experimentation workflows, why Hadoop was chosen as the framework, the system architecture, and challenges faced and lessons learned. Key points include:
- The platform supports A/B testing and reporting on hundreds of metrics and dimensions for experiments.
- Data is ingested from various sources and stored in Hadoop for analysis using technologies like Hive, Spark, and Scoobi.
- Challenges included optimizing joins and jobs for large datasets, addressing data skew, and ensuring job resiliency. Tuning configuration parameters and job scheduling helped improve performance.
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to Apache Hadoop, DFS, and Map Reduce.
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
The team at Fandango heartily embraced NoSQL, using Couchbase to power a key media publishing system. The initial implementation was fraught with integration issues and high latency, and required a major effort to successfully refactor. My talk will outline the key organizational and architectural decisions that created deep systemic problems, and the steps taken to re-architect the system to achieve a high level of performance at scale.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA and SAP BusinessObjects enabling a broad range of new analytic applications.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKhuguk
This session will give you an update on what SUSE is up to in the Big Data arena. We will take a brief look at SUSE Linux Enterprise Server and why it makes the perfect foundation for your Hadoop Deployment.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
The document discusses when to use Hadoop instead of a relational database management system (RDBMS) for advanced analytics. It provides examples of when queries like count distinct, cursors, and alter table statements become problematic in an RDBMS. It contrasts analyzing simple, transactional data like invoices versus complex, evolving data like customers or website visitors. Hadoop is better suited for problems involving complex objects, self-joins on large datasets, and matching large datasets. The document encourages structuring data in HDFS in a flexible way that fits the problem and use cases like simple counts on complex objects, self-self-self joins, and matching problems.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
This document provides an overview and comparison of RDBMS, Hadoop, and Spark. It introduces RDBMS and describes its use cases such as online transaction processing and data warehouses. It then introduces Hadoop and describes its ecosystem including HDFS, YARN, MapReduce, and related sub-modules. Common use cases for Hadoop are also outlined. Spark is then introduced along with its modules like Spark Core, SQL, and MLlib. Use cases for Spark include data enrichment, trigger event detection, and machine learning. The document concludes by comparing RDBMS and Hadoop, as well as Hadoop and Spark, and addressing common misconceptions about Hadoop and Spark.
This document provides an overview of a SQL-on-Hadoop tutorial. It introduces the presenters and discusses why SQL is important for Hadoop, as MapReduce is not optimal for all use cases. It also notes that while the database community knows how to efficiently process data, SQL-on-Hadoop systems face challenges due to the limitations of running on top of HDFS and Hadoop ecosystems. The tutorial outline covers SQL-on-Hadoop technologies like storage formats, runtime engines, and query optimization.
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Nicolas Morales
This document provides an overview of IBM's Big SQL product for running SQL queries on Hadoop data. It discusses how Big SQL uses a massively parallel processing (MPP) architecture to replace MapReduce for improved performance. Big SQL nodes run directly on the Hadoop cluster to process data locally. The document highlights Big SQL's full SQL query capabilities and support for analytic functions. It also notes how Big SQL leverages the existing Hive metadata and is designed to integrate with the broader Hadoop ecosystem.
This document summarizes Andrew Brust's presentation on using the Microsoft platform for big data. It discusses Hadoop and HDInsight, MapReduce, using Hive with ODBC and the BI stack. It also covers Hekaton, NoSQL, SQL Server Parallel Data Warehouse, and PolyBase. The presentation includes demos of HDInsight, MapReduce, and using Hive with the BI stack.
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA, Sybase and SAP BusinessObjects enabling a broad range of new analytic applications.
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
This document provides best practices for optimizing the performance of InfoSphere BigInsights and InfoSphere Streams when deployed in the cloud. It discusses optimizing disk performance by choosing cloud providers and instances with good disk I/O, partitioning and formatting disks correctly, and configuring HDFS to use multiple data directories. It also discusses optimizing Java performance by correctly configuring JVM memory and optimizing MapReduce performance by setting appropriate values for map and reduce tasks based on machine resources.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
This document discusses modern data architecture and Apache Hadoop's role within it. It presents WANdisco and its Non-Stop Hadoop solution, which extends HDFS across multiple data centers to provide 100% uptime for Hadoop deployments. Non-Stop Hadoop uses WANdisco's patented distributed coordination engine to synchronize HDFS metadata across sites separated by wide area networks, enabling continuous availability of HDFS data and global HDFS deployments.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
The document discusses integrating Hadoop with relational databases. It describes scenarios where reference data is stored in an RDBMS and used in Hadoop, Hadoop is used for offline analytics on data stored in an RDBMS, and exporting MapReduce outputs to an RDBMS. It then presents a case study on extending SQOOP for optimized Oracle integration and compares performance with and without the extension. Other tools for Hadoop-RDBMS integration are also briefly outlined.
The document provides an overview of an experimentation platform built on Hadoop. It discusses experimentation workflows, why Hadoop was chosen as the framework, the system architecture, and challenges faced and lessons learned. Key points include:
- The platform supports A/B testing and reporting on hundreds of metrics and dimensions for experiments.
- Data is ingested from various sources and stored in Hadoop for analysis using technologies like Hive, Spark, and Scoobi.
- Challenges included optimizing joins and jobs for large datasets, addressing data skew, and ensuring job resiliency. Tuning configuration parameters and job scheduling helped improve performance.
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to Apache Hadoop, DFS, and Map Reduce.
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
The team at Fandango heartily embraced NoSQL, using Couchbase to power a key media publishing system. The initial implementation was fraught with integration issues and high latency, and required a major effort to successfully refactor. My talk will outline the key organizational and architectural decisions that created deep systemic problems, and the steps taken to re-architect the system to achieve a high level of performance at scale.
Yarn cloudera-kathleenting061414 kate-tingData Con LA
This document summarizes Kathleen Ting's presentation on migrating to MapReduce v2 (MRv2) on YARN. The presentation covered the motivation for moving to MRv2 and YARN, including higher cluster utilization and lower costs. It then discussed common misconfiguration issues seen in support tickets, such as memory, thread pool size, and federation misconfigurations. Specific examples were provided for resolving task memory errors, JobTracker memory errors, and fetch failures in both MRv1 and MRv2. Recommendations were given for optimizing YARN memory usage and CPU isolation in containers.
Aziksa hadoop for buisness users2 santosh jhaData Con LA
This document discusses big data, including its drivers, characteristics, use cases across different industries, and lessons learned. It provides examples of companies like Etsy, Macy's, Canadian Pacific, and Salesforce that are using big data to gain insights, increase revenues, reduce costs and improve customer experiences. Big data is being used across industries like financial services, healthcare, manufacturing, and media/entertainment for applications such as customer profiling, fraud detection, operations optimization, and dynamic pricing. While big data projects show strong financial benefits, the document cautions that not all projects are well-structured and Hadoop alone is not sufficient to meet all business analysis needs.
The document discusses various options for processing and aggregating data in MongoDB, including the Aggregation Framework, MapReduce, and connecting MongoDB to external systems like Hadoop. The Aggregation Framework is described as a flexible way to query and transform data in MongoDB using a JSON-like syntax and pipeline stages. MapReduce is presented as more versatile but also more complex to implement. Connecting to external systems like Hadoop allows processing large amounts of data across clusters.
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Data Con LA
Apache Solr makes it so easy to interactively visualize and explore your data. Create a dashboard, add some facets, select some values, cross it with the time… and just look at the results.
Apache Spark is the growing framework for performing streaming computations, which makes it ideal for real time indexing.
Solr also comes with new Analytics Facets which are a major weapon added to the arsenal of the data explorer. They bring another dimension: calculations. We can now do the equivalent of SQL, just in a much simpler and faster way. These calculations can operate over buckets of data. For example, it is now possible to see the sum of Web traffic by country over the time, the median price of some categories of products, which ads are bringing more money by location...
This talks puts in practice some of the leading features of Solr Search. It presents the main types of facets/stats and which advanced properties and usage make them shine. A demo in parallel with the open source Search App in Hue will demonstrate how these facets can power interactive widgets or your own analytic queries. The data will be indexed in real time from a live stream with Spark.
Kiji cassandra la june 2014 - v02 clint-kellyData Con LA
Big Data Camp LA 2014, Don't re-invent the Big-Data Wheel, Building real-time, Big Data applications on Cassandra with the open-source Kiji project by Clint Kelly of Wibidata
20140614 introduction to spark-ben whiteData Con LA
This document provides an introduction to Apache Spark. It begins by explaining how Spark improves upon MapReduce by leveraging distributed memory for better performance and supporting iterative algorithms. Spark is described as a general purpose computational framework that retains the advantages of MapReduce like scalability and fault tolerance, while offering more functionality through directed acyclic graphs and libraries for machine learning. The document then discusses getting started with Spark and its execution modes like standalone, YARN client, and YARN cluster. Finally, it introduces Spark concepts like Resilient Distributed Datasets (RDDs), which are collections of objects partitioned across a cluster.
This document discusses how big data analytics can provide insights from large amounts of structured and unstructured data. It provides examples of how big data has helped organizations reduce customer churn, improve customer acquisition, speed up loan approvals, and detect fraud. The document also outlines IBM's big data platform and analytics process for extracting value from large, diverse data sources.
Ag big datacampla-06-14-2014-ajay_gopalData Con LA
This document provides an overview of CARD.COM, a company that offers prepaid debit cards customized with different designs. They collect data from card transactions, member interactions on their site/app, and marketing platforms to test different designs and better understand customer behavior. Their goal is to use data science to personalize the financial experience for members and potentially offer services like credit scores for the unbanked. They are hiring various technical roles and use open source tools like R, Python, and PHP to build out their analytics platforms and infrastructure.
The document summarizes the Hadoop stack and its components for storing and analyzing big data. It includes file storage with HDFS, data processing with MapReduce, data access tools like Hive and Pig, and security/monitoring with Kerberos and Nagios. HDFS uses metadata to track file locations across data nodes in a fault-tolerant manner similar to a file system.
140614 bigdatacamp-la-keynote-jon hsiehData Con LA
The document discusses the evolution of big data stacks from their origins inspired by Google's systems through imitation via Hadoop-based stacks to ongoing innovation. It traces the development of major components like MapReduce, HDFS, HBase and their adoption beyond Google. It also outlines the timeline of open source projects and companies in this space from 2003 to the present.
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
More and more organizations are turning to Hadoop and NoSQL to manage big data. In fact, many IT professionals consider each of those terms to be synonymous with big data. At the same time, these two technologies are seen as different beasts that handle different challenges. That means they are often deployed in a rather disjointed way, even when intended to solve the same overarching business problem. The emerging trend of “in-Hadoop databases” promises to narrow the deployment gap between them and enable new enterprise applications. In this talk, Dale will describe that integrated architecture and how customers have deployed it to benefit both the technical and the business teams.
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
1. The document discusses lessons learned from designing data ingest systems. Key lessons include structuring endpoints wisely, accepting at least once semantics, knowing that change data capture is difficult, understanding service level agreements, considering record format and schema, and tracking record lineage.
2. The document also provides examples of real-world data ingest scenarios and different implementation strategies with analyses of their tradeoffs. It concludes with recommendations to track errors and keep transformations minimal.
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
NoSQL has exploded on the developer scene promising alternatives to RDBMS that make rapidly developing, Internet scale applications easier than ever. However, as a trade off to the ease of development and scale, some of the familiarity with other well-known query interfaces such as SQL, has been lost. Until now that is...N1QL (pronounced ‘N1QL’) is a SQL like query language for querying JSON, which brings the familiarity of RDBMS back to the NoSQL world. In this session you will learn about the syntax and basics of this new language as well as Integration with the Couchbase SDKs.
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Data Con LA
Investigated a couple audio based, deep learning strategies for identifying human vocalized car sounds. In one case Mel Frequency Cepstral Coefficients, MFCCs, were used as inputs into a supervised, logistic regression neural network. In a separate case, Short Term Fourier Transforms ,STFT, were used to generate PCA whitened spectograms, which were used as inputs into a supervised, convolutional neural network. The MFCC method trained quickly on a relative small dataset of 4 sounds. The STFT method resulted in a much larger input matrix, resulting in much longer times for converging onto a solution
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Data Con LA
This document discusses decision making systems and the lambda architecture. It introduces decision making algorithms like multi-armed bandits that balance exploration vs exploitation. Contextual multi-armed bandits are discussed as well. The lambda architecture is then described as having serving, speed, and batch layers to enable low latency queries, real-time updates, and batch model training. The software stack of Kafka, Spark/Spark Streaming, HBase and MLLib is presented as enabling scalable stream processing and machine learning.
The document provides information about a training on big data and Hadoop. It covers topics like HDFS, MapReduce, Hive, Pig and Oozie. The training is aimed at CEOs, managers, developers and helps attendees get Hadoop certified. It discusses prerequisites for learning Hadoop, how Hadoop addresses big data problems, and how companies are using Hadoop. It also provides details about the curriculum, profiles of trainers and job roles working with Hadoop.
Hadoop essentials by shiva achari - sample chapterShiva Achari
Sample chapter of Hadoop Ecosystem
Delve into the key concepts of Hadoop and get a thorough understanding of the Hadoop ecosystem
For more information: http://bit.ly/1AeruBR
At APTRON Delhi, we believe in hands-on learning. That's why our Hadoop training in Delhi is designed to give you practical experience working with Hadoop. You'll work on real-world projects and learn from experienced instructors who have worked with Hadoop in the industry.
https://bit.ly/3NnvsHH
Big Data is still a challenge for many companies to collect, process, and analyze large amounts of structured and unstructured data. Hadoop provides an open source framework for distributed storage and processing of large datasets across commodity servers to help companies gain insights from big data. While Hadoop is commonly used, Spark is becoming a more popular tool that can run 100 times faster for iterative jobs and integrates with SQL, machine learning, and streaming technologies. Both Hadoop and Spark often rely on the Hadoop Distributed File System for storage and are commonly implemented together in big data projects and platforms from major vendors.
Presented By :- Rahul Sharma
B-Tech (Cloud Technology & Information Security)
2nd Year 4th Sem.
Poornima University (I.Nurture),Jaipur
www.facebook.com/rahulsharmarh18
This slide gives a simple and purposeful knowledge about popular Hadoop platforms.
From simple definition to importance of Hadoop in modern era the presentation also introduces Hadoop service providers along with its core components.
Do go through it once and comment below with your feedback. I am sure that this slide will help many in presenting basics of Hadoop for their projects or business purpose.
The crisp information has been generated after going through detailed information available on internet as well as research papers
The course additionally covers Configuring, Deploying, and Maintaining a Hadoop Cluster. The Hadoop Admin coaching is concentrated on sensible active exercises and encourages open discussions of however folk’s area unit exploitation Hadoop in enterprises managing massive knowledge sets.
1. Hadoop adoption has matured from initial small deployments to scaling up across enterprises, but configuring and managing large Hadoop environments can be difficult and expensive.
2. Hadoop as a Service (HaaS) provides an alternative where enterprises can deploy Hadoop in the cloud to avoid the challenges of managing large on-premise clusters.
3. HaaS allows enterprises to focus on data analysis rather than infrastructure while reducing costs and providing scalability, high availability, and self-configuration capabilities not easily achieved on-premise.
This document discusses how Apache Hadoop provides a solution for enterprises facing challenges from the massive growth of data. It describes how Hadoop can integrate with existing enterprise data systems like data warehouses to form a modern data architecture. Specifically, Hadoop provides lower costs for data storage, optimization of data warehouse workloads by offloading ETL tasks, and new opportunities for analytics through schema-on-read and multi-use data processing. The document outlines the core capabilities of Hadoop and how it has expanded to meet enterprise requirements for data management, access, governance, integration and security.
Architecting the Future of Big Data and SearchHortonworks
The document discusses the potential for integrating Apache Lucene and Apache Hadoop technologies. It covers their histories and current uses, as well as opportunities and challenges around making them work better together through tighter integration or code sharing. Developers and businesses are interested in ways to improve searching large amounts of data stored using Hadoop technologies.
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
The document discusses how Hortonworks Data Platform (HDP) enables a modern data architecture with Apache Hadoop. HDP provides a common data set stored in HDFS that can be accessed through various applications for batch, interactive, and real-time processing. This allows organizations to store all their data in one place and access it simultaneously through multiple means. YARN is the architectural center of HDP and enables this modern data architecture. HDP also provides enterprise capabilities like security, governance, and operations to make Hadoop suitable for business use.
Hybrid Data Warehouse Hadoop ImplementationsDavid Portnoy
The document discusses the evolving relationship between data warehouse (DW) and Hadoop implementations. It notes that DW vendors are incorporating Hadoop capabilities while the Hadoop ecosystem is growing to include more DW-like functions. Major DW vendors will likely continue playing a key role by acquiring successful new entrants or incorporating their technologies. The optimal approach involves a hybrid model that leverages the strengths of DWs and Hadoop, with queries determining where data resides and processing occurs. SQL-on-Hadoop architectures aim to bridge the two worlds by bringing SQL and DW tools to Hadoop.
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes key Hadoop components like HDFS for distributed file storage and MapReduce for distributed processing. Several companies that use Hadoop at large scale are mentioned, including Yahoo, Amazon and Facebook. Applications of Hadoop in healthcare for storing and analyzing large amounts of medical data are discussed. The document concludes that Hadoop is well-suited for big data applications due to its scalability, fault tolerance and cost effectiveness.
E2Matrix Jalandhar provides Best Big Data training based on current industry standards that helps attendees to secure placements in their dream jobs at MNCs. E2Matrix Provides Best Big Data Training in Jalandhar Amritsar Ludhiana Phagwara Mohali Chandigarh. E2Matrix is one of the best Big Data training institute offering hands on practical knowledge. At E2Matrix Big Data training is conducted by subject specialist corporate professionals best experience in managing real-time Big Data projects. E2Matrix implements a blend of academic learning and practical sessions to give the student optimum exposure. At E2Matrix’s well-equipped Big Data training Institute aspirants learn the skills for Big Data Overview, Use Cases, Data Analytics Process, Data Preparation, Tools for Data Preparation, Hands on Exercise : Using SQL and NoSql DB's, Hands on Exercise : Usage of Tools, Data Analysis Introduction, Classification, Data Visualization using R, Automation Testing Training on real time projects.
E2Matrix Jalandhar provides Best Big Data training based on current industry standards that helps attendees to secure placements in their dream jobs at MNCs. E2Matrix Provides Best Big Data Training in Jalandhar Amritsar Ludhiana Phagwara Mohali Chandigarh. E2Matrix is one of the best Big Data training institute offering hands on practical knowledge. At E2Matrix Big Data training is conducted by subject specialist corporate professionals best experience in managing real-time Big Data projects. E2Matrix implements a blend of academic learning and practical sessions to give the student optimum exposure. At E2Matrix’s well-equipped Big Data training Institute aspirants learn the skills for Big Data Overview, Use Cases, Data Analytics Process, Data Preparation, Tools for Data Preparation, Hands on Exercise : Using SQL and NoSql DB's, Hands on Exercise : Usage of Tools, Data Analysis Introduction, Classification, Data Visualization using R, Automation Testing Training on real time projects.
E2Matrix Jalandhar provides Best Big Data training based on current industry standards that helps attendees to secure placements in their dream jobs at MNCs. E2Matrix Provides Best Big Data Training in Jalandhar Amritsar Ludhiana Phagwara Mohali Chandigarh. E2Matrix is one of the best Big Data training institute offering hands on practical knowledge. At E2Matrix Big Data training is conducted by subject specialist corporate professionals best experience in managing real-time Big Data projects. E2Matrix implements a blend of academic learning and practical sessions to give the student optimum exposure. At E2Matrix’s well-equipped Big Data training Institute aspirants learn the skills for Big Data Overview, Use Cases, Data Analytics Process, Data Preparation, Tools for Data Preparation, Hands on Exercise : Using SQL and NoSql DB's, Hands on Exercise : Usage of Tools, Data Analysis Introduction, Classification, Data Visualization using R, Automation Testing Training on real time projects.
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
This is Mark Ledbetter's presentation from the September 22, 2014 Hortonworks webinar “What’s Possible with a Modern Data Architecture?” Mark is vice president for industry solutions at Hortonworks. He has more than twenty-five years experience in the software industry with a focus on Retail and supply chain.
This document provides an overview of big data and Hadoop. It discusses what big data is, its types including structured, semi-structured and unstructured data. Some key sources of big data are also outlined. Hadoop is presented as a solution for managing big data through its core components like HDFS for storage and MapReduce for processing. The Hadoop ecosystem including other related tools like Hive, Pig, Spark and YARN is also summarized. Career opportunities in working with big data are listed in the end.
Transform You Business with Big Data and HortonworksHortonworks
This document summarizes a presentation about Hortonworks and how it can help companies transform their businesses with big data and Hortonworks' Hadoop distribution. Hortonworks is the sole distributor of an open source, enterprise-grade Hadoop distribution called Hortonworks Data Platform (HDP). HDP addresses enterprise requirements for mixed workloads, high availability, security and more. The presentation discusses how Hortonworks enables interoperability and supports customers. It also provides an overview of how Pactera can help clients with big data implementation, architecture, and analytics.
BIG Data & Hadoop Applications in Social MediaSkillspeed
This document discusses how major social media networks like Facebook, Twitter, LinkedIn, Pinterest, and Instagram utilize big data and Hadoop technologies. It provides examples of how each network uses Hadoop for tasks like storing user data, performing analytics, and generating personalized recommendations at massive scales as their user bases and data volumes grow enormously. The document also briefly outlines SkillSpeed's Hadoop training course, which covers topics like HDFS, MapReduce, Pig, Hive, HBase and more to prepare students for jobs working with big data.
1. LAUSD has been developing its enterprise data and reporting capabilities since 2000, with various systems and dashboards launched over the years to provide different types of data and reporting, including student outcomes and achievement reports, individual student records, and teacher/staff data.
2. Current tools include MyData (with over 20 million student records), GetData (with instructional and business data), Whole Child (with academic and wellness data), OpenData, and Executive Dashboards.
3. Upcoming improvements include dashboards for social-emotional learning, physical education, and tools to support the Intensive Diagnostic Education Centers and Black Student Achievement Plan initiatives.
The document discusses the County of Los Angeles' efforts to better coordinate services across various departments by creating an enterprise data platform. It notes that the county serves over 750,000 patients annually through its health systems and oversees many other services related to homelessness, justice, child welfare, and public health. The proposed data platform would create a unified client identifier and data store to integrate client records across departments in order to generate insights, measure outcomes, and improve coordination of services.
Fastly is an edge cloud platform provider that aims to upgrade the internet experience by making applications and digital experiences fast, engaging, and secure. It has a global network of 100+ points of presence across 30+ countries serving over 1 trillion daily requests. The presentation discusses how internet requests are handled traditionally versus more modern approaches using an edge cloud platform like Fastly. It emphasizes that the edge must be programmable, deliver general purpose compute anywhere, and provide high reliability, security, and data privacy by default.
The document summarizes how Aware Health can save self-insured employers millions of dollars by reducing unnecessary surgeries, imaging, and lost work time for musculoskeletal conditions. It notes that 95% of common spine, wrist, and other surgeries are no more effective than non-surgical treatments. Aware Health uses diagnosis without imaging to prevent chronic pain and has shown real-world savings of $9.78 to $78.66 per member per month for employers, a 96% net promoter score, and over $2 million in annual savings for one enterprise customer.
- Project Lightspeed is the next generation of Apache Spark Structured Streaming that aims to provide faster and simpler stream processing with predictable low latency.
- It targets reducing tail latency by up to 2x through faster bookkeeping and offset management. It also enhances functionality with advanced capabilities like new operators and easy to use APIs.
- Project Lightspeed also aims to simplify deployment, operations, monitoring and troubleshooting of streaming applications. It seeks to improve ecosystem support for connectors, authentication and authorization.
- Some specific improvements include faster micro-batch processing, enhancing Python as a first class citizen, and making debugging of streaming jobs easier through visualizations.
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
Mike Limcaco, Analytics Specialist / Customer Engineer at Google
Measure trends in a particular topic or search term across Google Search across the US down to the city-level. Integrate these data signals into analytic pipelines to drive product, retail, media (video, audio, digital content) recommendations tailored to your audience segment. We'll discuss how Google unique datasets can be used with Google Cloud smart analytic services to process, enrich and surface the most relevant product or content that matches the ever-changing interests of your local customer segment.
Melinda Thielbar, Data Science Practice Lead and Director of Data Science at Fidelity Investments
From corporations to governments to private individuals, most of the AI community has recognized the growing need to incorporate ethics into the development and maintenance of AI models. Much of the current discussion, though, is meant for leaders and managers. This talk is directed to data scientists, data engineers, ML Ops specialists, and anyone else who is responsible for the hands-on, day-to-day of work building, productionalizing, and maintaining AI models. We'll give a short overview of the business case for why technical AI expertise is critical to developing an AI Ethics strategy. Then we'll discuss the technical problems that cause AI models to behave unethically, how to detect problems at all phases of model development, and the tools and techniques that are available to support technical teams in Ethical AI development.
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
Antje Barth, Principal Developer Advocate, AI/ML at AWS & Chris Fregly, Principal Engineer, AI & ML at AWS
The frequency and severity of natural disasters are increasing. In response, governments, businesses, nonprofits, and international organizations are placing more emphasis on disaster preparedness and response. Many organizations are accelerating their efforts to make their data publicly available for others to use. Repositories such as the Registry of Open Data on AWS and Humanitarian Data Exchange contain troves of data available for use by developers, data scientists, and machine learning practitioners. In this session, see how a community of developers came together though the AWS Disaster Response hackathon to build models to support natural disaster preparedness and response.
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
Sig Narvaez, Executive Solution Architect at MongoDB
MongoDB is now a Developer Data Platform. Come learn what�s new in the 6.0 release and Atlas following all the recent announcements made at MongoDB World 2022. Topics will include
- Atlas Search which combines 3 systems into one (database, search engine, and sync mechanisms) letting you focus on your product's differentiation.
- Atlas Data Federation to seamlessly query, transform, and aggregate data from one or more MongoDB Atlas databases, Atlas Data Lake and AWS S3 buckets
- Queryable Encryption lets you run expressive queries on fully randomized encrypted data to meet the most stringent security requirements
- Relational Migrator which analyzes your existing relational schemas and helps you design a new MongoDB schema.
- And more!
Data Con LA 2022 - Real world consumer segmentationData Con LA
Jaysen Gillespie, Head of Analytics and Data Science at RTB House
1. Shopkick has over 30M downloads, but the userbase is very heterogeneous. Anecdotal evidence indicated a wide variety of users for whom the app holds long-term appeal.
2. Marketing and other teams challenged Analytics to get beyond basic summary statistics and develop a holistic segmentation of the userbase.
3. Shopkick's data science team used SQL and python to gather data, clean data, and then perform a data-driven segmentation using a k-means algorithm.
4. Interpreting the results is more work -- and more fun -- than running the algo itself. We'll discuss how we transform from ""segment 1"", ""segment 2"", etc. to something that non-analytics users (Marketing, Operations, etc.) could actually benefit from.
5. So what? How did team across Shopkick change their approach given what Analytics had discovered.
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
Ravi Pillala, Chief Data Architect & Distinguished Engineer at Intuit
TurboTax is one of the well known consumer software brand which at its peak serves 385K+ concurrent users. In this session, We start with looking at how user behavioral data & tax domain events are captured in real time using the event bus and analyzed to drive real time personalization with various TurboTax data pipelines. We will also look at solutions performing analytics which make use of these events, with the help of Kafka, Apache Flink, Apache Beam, Spark, Amazon S3, Amazon EMR, Redshift, Athena and Amazon lambda functions. Finally, we look at how SageMaker is used to create the TurboTax model to predict if a customer is at risk or needs help.
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
George Mansoor, Chief Information Systems Officer at California State University
Overview of the CSU Data Architecture on moving on-prem ERP data to the AWS Cloud at scale using Delphix for Data Replication/Virtualization and AWS Data Migration Service (DMS) for data extracts
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
Anand Ranganathan, Chief AI Officer at Unscrambl
Conversational AI is getting more and more widely used for customer support and employee support use-cases. In this session, I'm going to talk about how it can be extended for data analysis and data science use-cases ... i.e., how users can interact with a bot to ask analytical questions on data in relational databases.
This allows users to explore complex datasets using a combination of text and voice questions, in natural language, and then get back results in a combination of natural language and visualizations. Furthermore, it allows collaborative exploration of data by a group of users in a channel in platforms like Microsoft Teams, Slack or Google Chat.
For example, a group of users in a channel can ask questions to a bot in plain English like ""How many cases of Covid were there in the last 2 months by state and gender"" or ""Why did the number of deaths from Covid increase in May 2022"", and jointly look at the results that come back. This facilitates data awareness, data-driven collaboration and joint decision making among teams in enterprises and outside.
In this talk, I'll describe how we can bring together various features including natural-language understanding, NL-to-SQL translation, dialog management, data story-telling, semantic modeling of data and augmented analytics to facilitate collaborate exploration of data using conversational AI.
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
Anil Inamdar, VP & Head of Data Solutions at Instaclustr
The most modernized enterprises utilize polyglot architecture, applying the best-suited database technologies to each of their organization's particular use cases. To successfully implement such an architecture, though, you need a thorough knowledge of the expansive NoSQL data technologies now available.
Attendees of this Data Con LA presentation will come away with:
-- A solid understanding of the decision-making process that should go into vetting NoSQL technologies and how to plan out their data modernization initiatives and migrations.
-- They will learn the types of functionality that best match the strengths of NoSQL key-value stores, graph databases, columnar databases, document-type databases, time-series databases, and more.
-- Attendees will also understand how to navigate database technology licensing concerns, and to recognize the types of vendors they'll encounter across the NoSQL ecosystem. This includes sniffing out open-core vendors that may advertise as “open source,"" but are driven by a business model that hinges on achieving proprietary lock-in.
-- Attendees will also learn to determine if vendors offer open-code solutions that apply restrictive licensing, or if they support true open source technologies like Hadoop, Cassandra, Kafka, OpenSearch, Redis, Spark, and many more that offer total portability and true freedom of use.
Data Con LA 2022 - Intro to Data ScienceData Con LA
Zia Khan, Computer Systems Analyst and Data Scientist at LearningFuze
Data Science tutorial is designed for people who are new to Data Science. This is a beginner level session so no prior coding or technical knowledge is required. Just bring your laptop with WiFi capability. The session starts with a review of what is data science, the amount of data we generate and how companies are using that data to get insight. We will pick a business use case, define the data science process, followed by hands-on lab using python and Jupyter notebook. During the hands-on portion we will work with pandas, numpy, matplotlib and sklearn modules and use a machine learning algorithm to approach the business use case.
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
Mariana Danilovic, Managing Director at Infiom, LLC
We will address:
(1) Community creation and engagement using tokens and NFTs
(2) Organization of DAO structures and ways to incentivize Web3 communities
(3) DeFi business models applied to Web3 ventures
(4) Why Metaverse matters for new entertainment and community engagement models.
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
Curtis ODell, Global Director Data Integrity at Tricentis
Join me to learn about a new end-to-end data testing approach designed for modern data pipelines that fills dangerous gaps left by traditional data management tools—one designed to handle structured and unstructured data from any source. You'll hear how you can use unique automation technology to reach up to 90 percent test coverage rates and deliver trustworthy analytical and operational data at scale. Several real world use cases from major banks/finance, insurance, health analytics, and Snowflake examples will be presented.
Key Learning Objective
1. Data journeys are complex and you have to ensure integrity of the data end to end across this journey from source to end reporting for compliance
2. Data Management tools do not test data, they profile and monitor at best, and leave serious gaps in your data testing coverage
3. Automation with integration to DevOps and DataOps' CI/CD processes are key to solving this.
4. How this approach has impact in your vertical
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
1. The document discusses methods for predicting and engineering viral Super Bowl ads, including a panel-based analysis of video content characteristics and a deep learning model measuring social media effects.
2. It provides examples of ads from Super Bowl 2022 that scored well using these methods, such as BMW and Budweiser ads, and compares predicted viral rankings to actual results.
3. The document also demonstrates how to systematically test, tweak, and target an ad campaign like Bajaj Pulsar's to increase virality through modifications to title, thumbnail, tags and content based on audience feedback.
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
Jai Bansal, Senior Manager, Data Science at Aetna
This talk describes an internal data product called Member Embeddings that facilitates modeling of member medical journeys with machine learning.
Medical claims are the key data source we use to understand health journeys at Aetna. Claims are the data artifacts that result from our members' interactions with the healthcare system. Claims contain data like the amount the provider billed, the place of service, and provider specialty. The primary medical information in a claim is represented in codes that indicate the diagnoses, procedures, or drugs for which a member was billed. These codes give us a semi-structured view into the medical reason for each claim and so contain rich information about members' health journeys. However, since the codes themselves are categorical and high-dimensional (10K cardinality), it's challenging to extract insight or predictive power directly from the raw codes on a claim.
To transform claim codes into a more useful format for machine learning, we turned to the concept of embeddings. Word embeddings are widely used in natural language processing to provide numeric vector representations of individual words.
We use a similar approach with our claims data. We treat each claim code as a word or token and use embedding algorithms to learn lower-dimensional vector representations that preserve the original high-dimensional semantic meaning.
This process converts the categorical features into dense numeric representations. In our case, we use sequences of anonymized member claim diagnosis, procedure, and drug codes as training data. We tested a variety of algorithms to learn embeddings for each type of claim code.
We found that the trained embeddings showed relationships between codes that were reasonable from the point of view of subject matter experts. In addition, using the embeddings to predict future healthcare-related events outperformed other basic features, making this tool an easy way to improve predictive model performance and save data scientist time.
Data Con LA 2022 - Data Streaming with KafkaData Con LA
Jie Chen, Manager Advisory, KPMG
Data is the new oil. However, many organizations have fragmented data in siloed line of businesses. In this topic, we will focus on identifying the legacy patterns and their limitations and introducing the new patterns packed by Kafka's core design ideas. The goal is to tirelessly pursue better solutions for organizations to overcome the bottleneck in data pipelines and modernize the digital assets for ready to scale their businesses. In summary, we will walk through three uses cases, recommend Dos and Donts, Take aways for Data Engineers, Data Scientist, Data architect in developing forefront data oriented skills.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
Leading AI Innovation As A Product Manager - Michael JidaelMichael Jidael
Unlike traditional product management, AI product leadership requires new mental models, collaborative approaches, and new measurement frameworks. This presentation breaks down how Product Managers can successfully lead AI Innovation in today's rapidly evolving technology landscape. Drawing from practical experience and industry best practices, I shared frameworks, approaches, and mindset shifts essential for product leaders navigating the unique challenges of AI product development.
In this deck, you'll discover:
- What AI leadership means for product managers
- The fundamental paradigm shift required for AI product development.
- A framework for identifying high-value AI opportunities for your products.
- How to transition from user stories to AI learning loops and hypothesis-driven development.
- The essential AI product management framework for defining, developing, and deploying intelligence.
- Technical and business metrics that matter in AI product development.
- Strategies for effective collaboration with data science and engineering teams.
- Framework for handling AI's probabilistic nature and setting stakeholder expectations.
- A real-world case study demonstrating these principles in action.
- Practical next steps to begin your AI product leadership journey.
This presentation is essential for Product Managers, aspiring PMs, product leaders, innovators, and anyone interested in understanding how to successfully build and manage AI-powered products from idea to impact. The key takeaway is that leading AI products is about creating capabilities (intelligence) that continuously improve and deliver increasing value over time.
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
Buckeye Dreamin 2024: Assessing and Resolving Technical DebtLynda Kane
Slide Deck from Buckeye Dreamin' 2024 presentation Assessing and Resolving Technical Debt. Focused on identifying technical debt in Salesforce and working towards resolving it.
Procurement Insights Cost To Value Guide.pptxJon Hansen
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement — not a competitor — to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
Hands On: Create a Lightning Aura Component with force:RecordDataLynda Kane
Slide Deck from the 3/26/2020 virtual meeting of the Cleveland Developer Group presentation on creating a Lightning Aura Component using force:RecordData.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Learn the Basics of Agile Development: Your Step-by-Step GuideMarcel David
New to Agile? This step-by-step guide is your perfect starting point. "Learn the Basics of Agile Development" simplifies complex concepts, providing you with a clear understanding of how Agile can improve software development and project management. Discover the benefits of iterative work, team collaboration, and flexible planning.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
Rock, Paper, Scissors: An Apex Map Learning JourneyLynda Kane
Slide Deck from Presentations to WITDevs (April 2021) and Cleveland Developer Group (6/28/2023) on using Rock, Paper, Scissors to learn the Map construct in Salesforce Apex development.
1. THE FUTURE OF
HADOOP: CHOOSING
THE RIGHT OPTIONS
Subash D’Souza
Hadoop Innovation Summit
2014
2. WHO AM I?
Recognized as a Champion of Big Data by Cloudera
Co-Organizer - Los Angeles Hadoop User Group
Organizer - Los Angeles HBase User Group
Organizer – Los Angeles Big Data Users Group
Organizer - Big Data Camp LA
Speaker – Big Data Camp LA 2013
Leading a BOF Session at Hadoop Summit Europe 2014
Author – HBase Developer’s Cookbook (Out Fall 2014)
Technical Reviewer – Apache Flume: Distributed Log Collection for Hadoop
3. HADOOP: OLD & NEW
Hadoop first released in 2006.
Based on the GFS and MapReduce papers released by Google
Ever since adoption has been massive and rapid
Companies like Facebook, Netflix, EBay, Yahoo, Expedia, Spotify and even the
Social Security Administration are adopting Hadoop
Hadoop 2.0 AKA YARN went GA in September of 2013
Is backwards compatible with Hadoop 1.0 API’s
Replaced Jobtracker and Tasktrackers with Application Master, Resource Manager
and Node Managers
4. A BRIEF HISTORY
Google
releases GFS
paper
2002
2003
Google
releases
MapReduce
paper
2004
Nutch adds
distributed
file system
Doug Cutting
launches
Nutch project
MapR
founded
2005
Hortonworks
founded
Cloudera
founded
2006
2007
Hadoop spun
out of Nutch
project at
Yahoo
MapReduce
implemented
in Nutch
Stinger/ Tez
to be
released
Hadoop 2.0
w/HA
available
2008
2009
2010
2011
Hadoop
breaks
Terasort
world record
2012
2013
2014
YARN goes
GA
HBase, Zookee
per, Flume and
more added to
CDH
Impala
(SQL on
Hadoop)
launched
5. PREVIOUSLY, THE STATE OF
DATA
As a data analyst, previously, you were not able to
ask questions you wanted to ask because you did
not have the data points available
Corollary, you couldn’t think of questions to ask of
your data because you didn’t know you had access
to those data points
7. FOCUS
No standard way to get to the data
This is a plus and minus, plus because there is variety to choose from, minus because the
no. of tools to pull the data is huge and evermore expanding
As a company what do you choose?
What do you focus on?
Question – Do you replace your current data
infrastructure or do you augment it?
16. CHOICES
Hortonworks – Completely Open Source – Everything on their platform is available
from Apache Hadoop Distribution. Available as a free download or with paid
support.
Cloudera – Offers the open source Apache Hadoop Distribution as well as
management tools built for the Cloudera Distribution. Available as a free download
or with paid support with the additional tools
MapR – Offers a version of Hadoop that replaces the HDFS with a proprietary
MFS(MapR File System). Everything else on their stack is based on the open
source Apache distribution. Offers a free M3 version along with paid M5 and M7
versions.
17. ADVANTAGES OF YARN
Ability to handle multi tenant clients, i.e. running
multiple
applications
atop
the
same
framework(multi-tenancy)
Splits the work of Job tracker into Resource
Manager and Application master so Job tracker
does not have to allocate resources as well as
manage the tasks
Ability to restart Jobs from the place where they
failed
Scales well beyond the limitations of MR1(4000
22. SQL ON HADOOP VS.
TRADITIONAL RDBMS
Data on Hadoop is not as responsive as a RDBMS
Data in Hadoop can scale much better than an
RDBMS
Data in Hadoop can be accessed using a variety of
mechanisms such as Hive, Imapala, Drill, etc. i.e.
the query engines are abstracted from the
Hadoop(HDFS) storage layer. The same cannot be
said of RDBMS where you would need between
one system to another example, Oracle cannot pull
from SQL Server and vice versa
23. QUESTION?
Do we augment or replace our current data
infrastructure?
Answer – Augment
Why? – combine the best of both worlds, use
aggregated data in your data stores and all the
detail data and lifetime in Hadoop
Of course, you will different SLA’s based on the
query you ask.
25. STARTUPS VS. MATURE
Startups that are in data should make the
consideration of going with YARN to gain the
advantages of YARN
Mature companies tend to be conservative and
hence will look to the more established use cases of
MR1
Startups and Mature companies should look at the
advantages of YARN as well as applying more near
real-time sql-on-hadoop
26. GETTING STARTED WITH
HADOOP VS. ESTABLISHED
HADOOP PRACTICES
Getting started with Hadoop – Opportunity to get off
the ground running YARN plus bleeding edge
technologies.
Established companies with a Hadoop practice tend
to be conservative but that shouldn’t prevent them
from coming with a migration plan to YARN
27. REAL TIME ANALYTICS
Kiji
HBase
Storm
Shark
Redshift
Impala
Stinger
Drill
Accumolo
Presto
Hawq
IBM BigSQL
32. FUTURE OF HADOOP: YARN &
NEAR REAL TIME SQL-ONHADOOP
Multi Tenancy
HA(High Availability)
Tools for SQL-On-Hadoop
Impala
Stinger/Tez
Drill
Shark
33. WHAT DO YOU CHOOSE?
The choices are huge
The toolsets are varied
First focus on the problems you are trying to solve. Don’t
choose Hadoop because it is the latest buzz word. Make
sure there is a real need to solve
Focus on developers and administrators and ensure that
whatever toolset you choose, they have the relevant
skillset or training will be provided or relevant resources will
be brought in from outside( whether through hiring or
consulting)
REMEMBER PROBLEMSET!!! i.e what you are trying to
34. CAVEATS
Work still being done on bringing real time sql-onhadoop to YARN.
Impala has Llama for this.
Stinger for Hive Preview is currently available
HBase on YARN(HOYA) is also actively being
worked on.
Since YARN is a low level API, some abstraction is
needed which is available with tools such as Samza
and Weave
35. BIG DATA = BIG IMPACT
Ken Rudin, Director of Analytics, Facebook
“You need to go the last mile and evangelize your
insights so that people actually act on them and
there is impact."
“It doesn’t matter how brilliant our analyses are. If
nothing changes we have made no impact”
36. GIVING BACK
Hadoop is an open source project
Work done on this and the ecosystem tools are by
committers and contributors, some of whom do this in
their own personal time, in reporting and fixing bugs as
well as new functionality.
Please
give
back
either
by
becoming
a
contributor(Testing, filing bugs) or getting out your use
case for Hadoop(at meetups and/or conferences such
as this one) so others can make use of the issues you
have faced as well see the rapid adoption of the