Putting Analytics in Big Data Analytics
Jake Cornelius
Director of Product Management, Pentaho Corporation
Learn more @ http://www.cloudera.com/hadoop/
1. The document discusses Pentaho's approach to big data analytics using a component-based data integration and visualization platform.
2. The platform allows business analysts and data scientists to prepare and analyze big data without advanced technical skills.
3. It provides a visual interface for building reusable data pipelines that can be run locally or deployed to Hadoop for analytics on large datasets.
The document discusses the importance of a hybrid data model for Hadoop-driven analytics. It notes that traditional data warehousing is not suitable for large, unstructured data in Hadoop environments due to limitations in handling data volume, variety, and velocity. The hybrid model combines a data lake in Hadoop for raw, large-scale data with data marts and warehouses. It argues that Pentaho's suite provides tools to lower technical barriers for extracting, transforming, and loading (ETL) data between the data lake and marts/warehouses, enabling analytics on Hadoop data.
Pentaho provides open source business analytics tools including Kettle for extraction, transformation and loading (ETL) of data, and Weka for machine learning and data mining. Kettle allows users to run ETL jobs directly on Hadoop clusters and its JDBC layer enables SQL queries to be pushed down to databases for better performance. While bringing Weka analytics to Hadoop data provides gains, challenges include ensuring true parallel machine learning algorithms and keeping clients notified of database updates.
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
This document discusses getting started with big data analytics using Hadoop and Pentaho. It provides an overview of installing and configuring Hadoop and Pentaho on a single machine or cluster. Dell's Crowbar tool is presented as a way to quickly deploy Hadoop clusters on Dell hardware in about two hours. The document also covers best practices like leveraging different technologies, starting with small datasets, and not overloading networks. A demo is given and contact information provided.
Why Your Product Needs an Analytic Strategy Pentaho
The document discusses strategies for enhancing products with analytics capabilities. It outlines three strategic approaches: 1) enhance current software products with analytics, 2) target new opportunities using existing data through direct data monetization or new products/services, and 3) reinvent value propositions using new data technologies like big data. The document provides examples of implementing analytics capabilities for different user personas and considerations for analytics deployments. It argues that analytics can provide benefits like improved decisions, customer stickiness, and new revenue opportunities.
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataPentaho
This document discusses a project between Pentaho and Verizon to leverage big data analytics. Verizon generates vast amounts of call detail record (CDR) data from mobile networks that is currently stored in a data warehouse for 2 years and then archived to tape. Pentaho's platform will help optimize the data warehouse by using Hadoop to store all CDR data history. This will free up data warehouse capacity for high value data and allow analysis of the full 10 years of CDR data. Pentaho tools will ingest raw CDR data into Hadoop, execute MapReduce jobs to enrich the data, load results into Hive, and enable analyzing the data to understand calling patterns by geography over time.
The document outlines Pentaho's roadmap and focus areas for business analytics products. It discusses enhancements planned for Pentaho Business Analytics 5.1, including new features for analyzing MongoDB data and improved visualizations. It also summarizes R&D activities like integrating real-time data processing with Storm and Spark. The roadmap focuses on hardening the Pentaho platform for large enterprises, extending capabilities for big data engineering and analytics, and improving embedded analytics.
The document discusses Pentaho's business intelligence (BI) platform for big data analytics. It describes Pentaho as providing a modern, unified platform for data integration and analytics that allows for native integration into the big data ecosystem. It highlights Pentaho's open source development model and that it has over 1,000 commercial customers and 10,000 production deployments. Several use cases are presented that demonstrate how Pentaho helps customers unlock value from big data stores.
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Pentaho
This document discusses approaches to implementing Hadoop, NoSQL, and analytical databases. It describes:
1) The current landscape of big data databases including Hadoop, NoSQL, and analytical databases that are often used together but come from different vendors with different interfaces.
2) Common uses of transactional databases, Hadoop, NoSQL databases, and analytical databases.
3) The complexity of current implementation approaches that involve multiple coding steps across various tools.
4) How Pentaho provides a unified platform and visual tools to reduce the time and effort needed for implementation by eliminating disjointed steps and enabling non-coders to develop workflows and analytics for big data.
30 for 30: Quick Start Your Pentaho EvaluationPentaho
These slides are from our recent 30 for 30 webinar tailored towards people that have downloaded the Pentaho evaluation and want to know more about all the data integration and business analytics components part of the trial, how to easily integrate data, and best practices for installing/developing content.
Check out this presentation from Pentaho and ESRG to learn why product managers should understand Big Data and hear about real-life products that have been elevated with these innovative technologies.
Learn more in the brief that inspired the presentation, Product Innovation with Big Data: http://www.pentaho.com/resources/whitepaper/product-innovation-big-data
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelDataWorks Summit
This document discusses Dignity Health's move to using Hadoop for healthcare analytics to build better predictive models. It outlines their goals of saving costs and lives by leveraging over 30 TB of clinical data using Hadoop and SAS technologies on their Dignity Health Insights platform. The presentation agenda covers Dignity Health, healthcare analytics challenges, their big data ecosystem architecture featuring Hadoop, and how they are using this infrastructure for applications like sepsis surveillance analytics.
Breakout: Operational Analytics with HadoopCloudera, Inc.
Operationalizing models and responding to large volumes of data, fast, requires bolt on systems that can struggle with processing (transforming the data), consistency (always responding to data), and scalability (processing and responding to large volumes of data). If the data volume become too large, these traditional systems fail to deliver their responses resulting in significant losses to organizations. Join this breakout to learn how to overcome the roadblocks.
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014Pentaho
Bo Borland presentation at MongoDB World in NYC, June 24, 2014. Data Integration and Advanced Analytics for MongoDB: Blend, Enrich and Analyze Disparate Data in a Single MongoDB View
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightPrecisely
The document discusses moving legacy data and workloads from traditional data warehouses to Hadoop. It describes how ELT processes on dormant data waste resources and how offloading this data to Hadoop can optimize costs and performance. The presentation includes a demonstration of using Tableau for self-service analytics on data in Hadoop and a case study of a financial organization reducing ELT development time from weeks to hours by offloading mainframe data to Hadoop.
All data accessible to all my organization - Presentation at OW2con'19, June...OW2
This document discusses how Dremio provides a unified access point for data across an entire organization. It summarizes how Dremio allows various users, including data engineers, scientists, analysts and business users, to access all kinds of data sources through SQL or REST APIs. Dremio also enables features like data catalogs, collaborative workspaces, and workload monitoring that help organizations better manage and govern their data.
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSteven Totman
Demand for quicker access to multiple integrated sources of data continues to rise. Immediate access to data stored in a variety of systems - such as mainframes, data warehouses, and data marts - to mine visually for business intelligence is the competitive differentiation enterprises need to win in today’s economy.
Stop playing the waiting game and learn about a new end-to-end solution for combining, analyzing, and visualizing data from practically any source in your enterprise environment.
Leading organizations are already taking advantage of this architectural innovation to gain modern insights while reducing costs and propelling their businesses ahead of the competition.
Are you tired of waiting? Don't let your architecture hold you back. Access this webinar and hear from a team of industry experts on how you can Break the Barriers to Big Data Insight.
Explore how data integration (or “mashups”) can maximize analytic value and help business teams create streamlined data pipelines that enables ad-hoc analytic inquiries. You’ll learn why businesses increasingly focused on blending data on demand and at the source, the concrete analytic advantages that this approach delivers, and the type of architectures required for delivering trusted, blended data. We provide a checklist to assess your data integration needs and capabilities, and review some real-world examples of how blending various data types has created significant analytic value and concrete business impact.
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
This document provides an introduction to analytics and big data using Hadoop. It discusses the growth of digital data and challenges of big data. Hadoop is presented as a solution for storing and processing large, unstructured datasets across commodity servers. The key components of Hadoop - HDFS for distributed storage and MapReduce for distributed processing - are described at a high level. Examples of industries using big data analytics are also listed.
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...ArabNet ME
A new foundation for the Modern Information Architecture.
Speaker: Amr Awadallah, CTO & Cofounder, Cloudera
Our legacy information architecture is not able to cope with the realities of today's business. This is because it is not able to scale to meet our SLAs due to separation of storage and compute, economically store the volumes and types of data we currently confront, provide the agility necessary for innovation, and most importantly, provide a full 360 degree view of our customers, products, and business. In this talk Dr. Amr Awadallah will present the Enterprise Data Hub (EDH) as the new foundation for the modern information architecture. Built with Apache Hadoop at the core, the EDH is an extremely scalable, flexible, and fault-tolerant, data processing system designed to put data at the center of your business.
This document summarizes Patrick de Vries' presentation on connecting everything at the Hadoop Summit 2016. The presentation discusses KPN's use of Hadoop to manage increasing data and network capacity needs. It outlines KPN's data flow process from source systems to Hadoop for processing and generating reports. The presentation also covers lessons learned in implementing Hadoop including having strong executive support, addressing cultural challenges around data ownership, and leveraging existing investments. Finally, it promotes joining a new TELCO Hadoop community for telecommunications providers to share use cases and lessons.
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...MongoDB
Drawing on Pentaho's wide experience in solving customers' big data issues, Davy Nys will position the importance of analytics in the IoT:
[-] Understanding the challenges behind data integration & analytics for IoT
[-] Future proofing your information architecture for IoT
[-] Delivering IoT analytics, now and tomorrow
[-] Real customer examples of where Pentaho can help
Better Together: The New Data Management OrchestraCloudera, Inc.
To ingest, store, process and leverage big data for maximum business impact requires integrating systems, processing frameworks, and analytic deployment options. Learn how Cloudera’s enterprise data hub framework, MongoDB, and Teradata Data Warehouse working in concert can enable companies to explore data in new ways and solve problems that not long ago might have seemed impossible.
Gone are the days of NoSQL and SQL competing for center stage. Visionary companies are driving data subsystems to operate in harmony. So what’s changed?
In this webinar, you will hear from executives at Cloudera, Teradata and MongoDB about the following:
How to deploy the right mix of tools and technology to become a data-driven organization
Examples of three major data management systems working together
Real world examples of how business and IT are benefiting from the sum of the parts
Join industry leaders Charles Zedlewski, Chris Twogood and Kelly Stirman for this unique panel discussion, moderated by BI Research analyst, Colin White.
1. The document discusses a Gartner report that assesses 20 vendors of data science and machine learning platforms. It evaluates the platforms' abilities to support the full data science life cycle.
2. The report places vendors in four categories - Leaders, Challengers, Visionaries, and Niche Players. It outlines the strengths and cautions of platforms from vendors like Amazon Web Services, Alteryx, and Anaconda.
3. Key criteria for evaluating the platforms include ease of use, support for different personas, capabilities for tasks like modeling and deployment, and growth and innovation. The report aims to help users choose the right platform for their needs.
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightCloudera, Inc.
Rethink data management and learn how to break down barriers to Big Data insight with Cloudera's enterprise data hub (EDH), Syncsort offload solutions, and Tableau Software visualization and analytics.
This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It describes why a data warehouse may need Hadoop to handle big data from sources like social media, sensors and logs. Examples are given of using Hadoop for ETL and analytics. The presentation provides an overview of Hadoop and how to connect it to the data warehouse using tools like Sqoop and external tables. It also offers tips on getting started and avoiding common pitfalls.
The document summarizes Pentaho's open source business intelligence and data integration products, including their new capabilities for Hadoop and big data analytics. It discusses Pentaho's partnerships with Amazon Web Services and Cloudera to more easily integrate Hadoop data. It also outlines how Pentaho helps users analyze and visualize both structured and unstructured data from Hadoop alongside traditional data sources.
The document discusses big data and Hadoop. It notes that big data comes in terabytes and petabytes, sometimes generated daily. Hadoop is presented as a framework for distributed computing on large datasets using MapReduce. While Hadoop can store and process massive amounts of data across commodity servers, it was not designed for business intelligence requirements. The document proposes addressing this by adding data integration and transformation capabilities to Hadoop through tools like Pentaho Data Integration, to enable it to better meet the needs of big data analytics.
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Pentaho
This document discusses approaches to implementing Hadoop, NoSQL, and analytical databases. It describes:
1) The current landscape of big data databases including Hadoop, NoSQL, and analytical databases that are often used together but come from different vendors with different interfaces.
2) Common uses of transactional databases, Hadoop, NoSQL databases, and analytical databases.
3) The complexity of current implementation approaches that involve multiple coding steps across various tools.
4) How Pentaho provides a unified platform and visual tools to reduce the time and effort needed for implementation by eliminating disjointed steps and enabling non-coders to develop workflows and analytics for big data.
30 for 30: Quick Start Your Pentaho EvaluationPentaho
These slides are from our recent 30 for 30 webinar tailored towards people that have downloaded the Pentaho evaluation and want to know more about all the data integration and business analytics components part of the trial, how to easily integrate data, and best practices for installing/developing content.
Check out this presentation from Pentaho and ESRG to learn why product managers should understand Big Data and hear about real-life products that have been elevated with these innovative technologies.
Learn more in the brief that inspired the presentation, Product Innovation with Big Data: http://www.pentaho.com/resources/whitepaper/product-innovation-big-data
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelDataWorks Summit
This document discusses Dignity Health's move to using Hadoop for healthcare analytics to build better predictive models. It outlines their goals of saving costs and lives by leveraging over 30 TB of clinical data using Hadoop and SAS technologies on their Dignity Health Insights platform. The presentation agenda covers Dignity Health, healthcare analytics challenges, their big data ecosystem architecture featuring Hadoop, and how they are using this infrastructure for applications like sepsis surveillance analytics.
Breakout: Operational Analytics with HadoopCloudera, Inc.
Operationalizing models and responding to large volumes of data, fast, requires bolt on systems that can struggle with processing (transforming the data), consistency (always responding to data), and scalability (processing and responding to large volumes of data). If the data volume become too large, these traditional systems fail to deliver their responses resulting in significant losses to organizations. Join this breakout to learn how to overcome the roadblocks.
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014Pentaho
Bo Borland presentation at MongoDB World in NYC, June 24, 2014. Data Integration and Advanced Analytics for MongoDB: Blend, Enrich and Analyze Disparate Data in a Single MongoDB View
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightPrecisely
The document discusses moving legacy data and workloads from traditional data warehouses to Hadoop. It describes how ELT processes on dormant data waste resources and how offloading this data to Hadoop can optimize costs and performance. The presentation includes a demonstration of using Tableau for self-service analytics on data in Hadoop and a case study of a financial organization reducing ELT development time from weeks to hours by offloading mainframe data to Hadoop.
All data accessible to all my organization - Presentation at OW2con'19, June...OW2
This document discusses how Dremio provides a unified access point for data across an entire organization. It summarizes how Dremio allows various users, including data engineers, scientists, analysts and business users, to access all kinds of data sources through SQL or REST APIs. Dremio also enables features like data catalogs, collaborative workspaces, and workload monitoring that help organizations better manage and govern their data.
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSteven Totman
Demand for quicker access to multiple integrated sources of data continues to rise. Immediate access to data stored in a variety of systems - such as mainframes, data warehouses, and data marts - to mine visually for business intelligence is the competitive differentiation enterprises need to win in today’s economy.
Stop playing the waiting game and learn about a new end-to-end solution for combining, analyzing, and visualizing data from practically any source in your enterprise environment.
Leading organizations are already taking advantage of this architectural innovation to gain modern insights while reducing costs and propelling their businesses ahead of the competition.
Are you tired of waiting? Don't let your architecture hold you back. Access this webinar and hear from a team of industry experts on how you can Break the Barriers to Big Data Insight.
Explore how data integration (or “mashups”) can maximize analytic value and help business teams create streamlined data pipelines that enables ad-hoc analytic inquiries. You’ll learn why businesses increasingly focused on blending data on demand and at the source, the concrete analytic advantages that this approach delivers, and the type of architectures required for delivering trusted, blended data. We provide a checklist to assess your data integration needs and capabilities, and review some real-world examples of how blending various data types has created significant analytic value and concrete business impact.
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
This document provides an introduction to analytics and big data using Hadoop. It discusses the growth of digital data and challenges of big data. Hadoop is presented as a solution for storing and processing large, unstructured datasets across commodity servers. The key components of Hadoop - HDFS for distributed storage and MapReduce for distributed processing - are described at a high level. Examples of industries using big data analytics are also listed.
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...ArabNet ME
A new foundation for the Modern Information Architecture.
Speaker: Amr Awadallah, CTO & Cofounder, Cloudera
Our legacy information architecture is not able to cope with the realities of today's business. This is because it is not able to scale to meet our SLAs due to separation of storage and compute, economically store the volumes and types of data we currently confront, provide the agility necessary for innovation, and most importantly, provide a full 360 degree view of our customers, products, and business. In this talk Dr. Amr Awadallah will present the Enterprise Data Hub (EDH) as the new foundation for the modern information architecture. Built with Apache Hadoop at the core, the EDH is an extremely scalable, flexible, and fault-tolerant, data processing system designed to put data at the center of your business.
This document summarizes Patrick de Vries' presentation on connecting everything at the Hadoop Summit 2016. The presentation discusses KPN's use of Hadoop to manage increasing data and network capacity needs. It outlines KPN's data flow process from source systems to Hadoop for processing and generating reports. The presentation also covers lessons learned in implementing Hadoop including having strong executive support, addressing cultural challenges around data ownership, and leveraging existing investments. Finally, it promotes joining a new TELCO Hadoop community for telecommunications providers to share use cases and lessons.
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...MongoDB
Drawing on Pentaho's wide experience in solving customers' big data issues, Davy Nys will position the importance of analytics in the IoT:
[-] Understanding the challenges behind data integration & analytics for IoT
[-] Future proofing your information architecture for IoT
[-] Delivering IoT analytics, now and tomorrow
[-] Real customer examples of where Pentaho can help
Better Together: The New Data Management OrchestraCloudera, Inc.
To ingest, store, process and leverage big data for maximum business impact requires integrating systems, processing frameworks, and analytic deployment options. Learn how Cloudera’s enterprise data hub framework, MongoDB, and Teradata Data Warehouse working in concert can enable companies to explore data in new ways and solve problems that not long ago might have seemed impossible.
Gone are the days of NoSQL and SQL competing for center stage. Visionary companies are driving data subsystems to operate in harmony. So what’s changed?
In this webinar, you will hear from executives at Cloudera, Teradata and MongoDB about the following:
How to deploy the right mix of tools and technology to become a data-driven organization
Examples of three major data management systems working together
Real world examples of how business and IT are benefiting from the sum of the parts
Join industry leaders Charles Zedlewski, Chris Twogood and Kelly Stirman for this unique panel discussion, moderated by BI Research analyst, Colin White.
1. The document discusses a Gartner report that assesses 20 vendors of data science and machine learning platforms. It evaluates the platforms' abilities to support the full data science life cycle.
2. The report places vendors in four categories - Leaders, Challengers, Visionaries, and Niche Players. It outlines the strengths and cautions of platforms from vendors like Amazon Web Services, Alteryx, and Anaconda.
3. Key criteria for evaluating the platforms include ease of use, support for different personas, capabilities for tasks like modeling and deployment, and growth and innovation. The report aims to help users choose the right platform for their needs.
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightCloudera, Inc.
Rethink data management and learn how to break down barriers to Big Data insight with Cloudera's enterprise data hub (EDH), Syncsort offload solutions, and Tableau Software visualization and analytics.
This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It describes why a data warehouse may need Hadoop to handle big data from sources like social media, sensors and logs. Examples are given of using Hadoop for ETL and analytics. The presentation provides an overview of Hadoop and how to connect it to the data warehouse using tools like Sqoop and external tables. It also offers tips on getting started and avoiding common pitfalls.
The document summarizes Pentaho's open source business intelligence and data integration products, including their new capabilities for Hadoop and big data analytics. It discusses Pentaho's partnerships with Amazon Web Services and Cloudera to more easily integrate Hadoop data. It also outlines how Pentaho helps users analyze and visualize both structured and unstructured data from Hadoop alongside traditional data sources.
The document discusses big data and Hadoop. It notes that big data comes in terabytes and petabytes, sometimes generated daily. Hadoop is presented as a framework for distributed computing on large datasets using MapReduce. While Hadoop can store and process massive amounts of data across commodity servers, it was not designed for business intelligence requirements. The document proposes addressing this by adding data integration and transformation capabilities to Hadoop through tools like Pentaho Data Integration, to enable it to better meet the needs of big data analytics.
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBICC Thomas More
7de BI congres van het BICC-Thomas More: 3 april 2014
Reisverslag van Business Intelligence naar Big Data
De reisbranche is sterk in beweging. Deze presentatie zal een reis door klassieke en moderne BI bestemmingen zijn, toont een serie snapshots van verschillende use cases in de reisbranche. Tijdens de sessie benadrukken we de capaciteit en flexibiliteit die een BI-tool nodig heeft om u te begeleiden op uw reis van klassieke BI-implementaties naar de moderne big data uitdagingen .
How advanced analytics is impacting the banking sectorMichael Haddad
The document discusses how advanced analytics is impacting the banking sector. It covers topics like regulatory changes forcing banks to invest in compliance; new digital technologies changing how customers interact with banks; and data analytics helping banks reduce risk, deliver personalized services, and retain skills. It also discusses Hitachi Data Systems' acquisition of Pentaho and how their combined platform can provide unified data integration and business analytics across structured, unstructured, and streaming data sources.
Putting Business Intelligence to Work on Hadoop Data StoresDATAVERSITY
An inexpensive way of storing large volumes of data, Hadoop is also scalable and redundant. But getting data out of Hadoop is tough due to a lack of a built-in query language. Also, because users experience high latency (up to several minutes per query), Hadoop is not appropriate for ad hoc query, reporting, and business analysis with traditional tools.
The first step in overcoming Hadoop's constraints is connecting to HIVE, a data warehouse infrastructure built on top of Hadoop, which provides the relational structure necessary for schedule reporting of large datasets data stored in Hadoop files. HIVE also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data.
But to really unlock the power of Hadoop, you must be able to efficiently extract data stored across multiple (often tens or hundreds) of nodes with a user-friendly ETL (extract, transform and load) tool that will then allow you to move your Hadoop data into a relational data mart or warehouse where you can use BI tools for analysis.
Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...MongoDB
The document discusses Pentaho's analytics and ETL solutions for MongoDB. It provides an overview of Pentaho Company and its platform for unified business analytics and data integration. It then outlines how Pentaho can be used to build a 360-degree view of customers by extracting, transforming and loading data from source systems into MongoDB and performing analytics and reporting on the MongoDB data. It demonstrates these capabilities with examples and screenshots.
Big Data has been a "buzz word" for a few years now, and it's generated a fair amount of hype. But, while the technology landscape is still evolving, product companies in the software, web, and hardware areas have actually led the way in delivering real value from data sources like weblogs, sensors, and social media as well as systems like Hadoop, NoSQL, and Analytical Databases. These organizations have built "Big Data Apps" that leverage fast, flexible data frameworks to solve a wide array of user problems, scale to massive audiences, and deliver superior predictive intelligence.
Join this webinar to learn why product managers should understand Big Data and hear about real-life products that have been elevated with these innovative technologies. You will hear from:
- Ben Hopkins, Product Marketing Manager at Pentaho, who will discuss what Big Data means for product strategy and why it represents a new toolset for product teams to meet user needs and build competitive advantage
- Jim Stascavage, VP of Engineering at ESRG, who will discuss how his company has innovated with Big Data and predictive analytics to deliver technology products that optimize fuel consumption and maintenance cycles in the maritime and heavy industry sectors, leveraging trillions of sensor data points a year.
Who Should Attend
Product Managers, Product Marketing Managers, Project Managers, Development Managers, Product Executives, and anyone responsible for addressing customer needs & influencing product strategy.
Pentaho Big Data Analytics with Vertica and HadoopMark Kromer
Overview of the Pentaho Big Data Analytics Suite from the Pentaho + Vertica presentation at Big Data Techcon 2014 in Boston for the session called "The Ultimate Selfie | Picture Yourself with the Fastest Analytics on Hadoop with HP Vertica and Pentaho"
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...MongoDB
1) The document discusses Pentaho's beliefs around Internet of Things (IoT) analytics, including applying the right data source and processing for different analytics needs, gaining insights by blending multiple data sources on demand, and planning for agility, flexibility and near real-time analytics.
2) It describes how emerging big data use cases demand blending different data sources and provides examples like improving operations and customer experience.
3) The document advocates an Extract-Transform-Report approach for IoT analytics that provides flexibility to integrate diverse data sources and enables real-time insights.
<b>Blending Hadoop and MongoDB with Pentaho </b>[11:10 am - 11:30 am]<br />For eCommerce companies, knowing how promoted wish-lists can spark consumer spending is an analytics goldmine. In this lightning talk, Bo Borland will demonstrate how Pentaho analytics can blend click-stream data about promoted wish-lists with sales transaction records using Hadoop, MongoDB and Pentaho to reveal patterns in online shopping behavior. Regardless of your industry or specific use model, come to this session to learn how to blend MongoDB data with any data source for greater business insight. Pentaho offers the first end-to-end analytic solution for MongoDB. From data ingestion to pixel perfect reporting and ad hoc “slice and dice” analysis, the solution meets today’s growing demand for a 360-degree view of your business.
Open Analytics 2014 - Pedro Alves - Innovation though Open SourceOpenAnalytics Spain
Delivering the Future of Analytics: Innovation through Open Source Pentaho was born out of the desire to achieve positive, disruptive change in the business analytics market, dominated by bureaucratic megavendors offering expensive heavy-weight products built on outdated technology platforms. Pentaho’s open, embeddable data integration and analytics platform was developed with a strong open source heritage. This provided Pentaho a first-mover advantage to engage early with adopters of big data technologies and solve the difficult challenges of integrating both established and emerging data types to drive analytics. Continued technology innovations to support the big data ecosystem, have kept customers ahead of the big data curve. With the ability to drastically reduce the time to design, develop and deploy big data solutions, Pentaho counts numerous big data customers, both large and small, across the financial services, retail, travel, healthcare and government industries around the world.
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
Intel's big data journey began in 2011 with an evaluation of Hadoop. Since then, Intel has expanded its use of Hadoop and Cloudera across multiple environments. Intel's 3-year roadmap focuses on evolving its Hadoop platform to support more advanced analytics, real-time capabilities, and integrating with traditional BI tools. Key strategies include designing for scalability, following an iterative approach to understand data, and leveraging open source technologies.
This document discusses strategies for filling a data lake by improving the process of data onboarding. It advocates using a template-based approach to streamline data ingestion from various sources and reduce dependence on hardcoded procedures. The key aspects are managing ELT templates and metadata through automated metadata extraction. This allows generating integration jobs dynamically based on metadata passed at runtime, providing flexibility to handle different source data with one template. It emphasizes reducing the risks associated with large data onboarding projects by maintaining a standardized and organized data lake.
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsKognitio
This document provides an agenda and summaries for a web briefing on unlocking the power of Hadoop to enable interactive analytics and real-time business intelligence. The agenda includes demonstrations on SQL and Hadoop with in-memory acceleration, interactive analytics with Hadoop, and modern data architectures. It also includes presentations on big data drivers and patterns, interoperating Hadoop with existing data tools, and using Hadoop to power new targeted applications.
Join Cloudian, Hortonworks and 451 Research for a panel-style Q&A discussion about the latest trends and technology innovations in Big Data and Analytics. Matt Aslett, Data Platforms and Analytics Research Director at 451 Research, John Kreisa, Vice President of Strategic Marketing at Hortonworks, and Paul Turner, Chief Marketing Officer at Cloudian, will answer your toughest questions about data storage, data analytics, log data, sensor data and the Internet of Things. Bring your questions or just come and listen!
Driving Real Insights Through Data ScienceVMware Tanzu
Major changes in industries have been brought about by the emergence of data-driven discoveries and applications. Many organizations are bringing together their data, and looking to drive change. But the ability to generate new insights in real time from a massive sets of data is still far from commonplace.
At this event, data technology experts and data scientists from Pivotal provided the latest business perspective on how data science and engineering can be used to accelerate the generation of new insights.
For information about upcoming Pivotal events, please visit: http://pivotal.io/news-events/#events
The document discusses using Cloudera DataFlow to address challenges with collecting, processing, and analyzing log data across many systems and devices. It provides an example use case of logging modernization to reduce costs and enable security solutions by filtering noise from logs. The presentation shows how DataFlow can extract relevant events from large volumes of raw log data and normalize the data to make security threats and anomalies easier to detect across many machines.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
The document outlines the 2021 finalists for the annual Data Impact Awards program, which recognizes organizations using Cloudera's platform and the impactful applications they have developed. It provides details on the challenges, solutions, and outcomes for each finalist project in the categories of Data Lifecycle Connection, Cloud Innovation, Data for Enterprise AI, Security & Governance Leadership, Industry Transformation, People First, and Data for Good. There are multiple finalists highlighted in each category demonstrating innovative uses of data and analytics.
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
The document outlines the agenda for Cloudera's Enterprise Data Cloud event in Vienna. It includes welcome remarks, keynotes on Cloudera's vision and customer success stories. There will be presentations on the new Cloudera Data Platform and customer case studies, followed by closing remarks. The schedule includes sessions on Cloudera's approach to data warehousing, machine learning, streaming and multi-cloud capabilities.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
The document discusses the benefits and trends of modernizing a data warehouse. It outlines how a modern data warehouse can provide deeper business insights at extreme speed and scale while controlling resources and costs. Examples are provided of companies that have improved fraud detection, customer retention, and machine performance by implementing a modern data warehouse that can handle large volumes and varieties of data from many sources.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
What is Model Context Protocol(MCP) - The new technology for communication bw...Vishnu Singh Chundawat
The MCP (Model Context Protocol) is a framework designed to manage context and interaction within complex systems. This SlideShare presentation will provide a detailed overview of the MCP Model, its applications, and how it plays a crucial role in improving communication and decision-making in distributed systems. We will explore the key concepts behind the protocol, including the importance of context, data management, and how this model enhances system adaptability and responsiveness. Ideal for software developers, system architects, and IT professionals, this presentation will offer valuable insights into how the MCP Model can streamline workflows, improve efficiency, and create more intuitive systems for a wide range of use cases.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
2. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Traditional BI
Tape/Trash
Data Mart(s)
Data
Source
?
? ?
?
?
??
3. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Data Lake(s)
Big Data Architecture
Data Mart(s)
Data
Source
Data WarehouseAd-Hoc
4. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Pentaho Data Integration
Hadoop
Pentaho Data
Integration
Data Marts, Data Warehouse,
Analytical Applications
Design
Deploy
Orchestrate
Pentaho Data
Integration
Pentaho Data
Integration
5. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Optimize
Visualize
Load
Files / HDFS
Hive
DM & DW
Applications & Systems
Web Tier
RDBMS
Hadoop
Reporting / Dashboards / Analysis
6. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Web Tier
RDBMS
Hadoop
Reporting / Dashboards / Analysis
HDFS
Hive
DM
7. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Demo
8. • Pentaho for Hadoop Download Capability
• Includes support for development, production support will follow with GA
• Collaborative effort between Pentaho and the Pentaho Community
• 60+ beta sites over three month beta cycle
• Pentaho contributed code for API integration with HIVE to the open source
Apache Foundation
• Pentaho and Cloudera Partnership
• Combines Pentaho ‘s business intelligence and data integration capabilities
with Cloudera’s Distribution for Hadoop (CDH)
• Enables business users to take advantage of Hadoop with ability to easily and
cost-effectively mine, visualize and analyze their Hadoop data
Pentaho for Hadoop Announcements
9. Pentaho for Hadoop Announcements (cont)
• Pentaho and Impetus Technologies Partnership
• Incorporates Pentaho Agile BI and Pentaho BI Suite for Hadoop into Impetus
Large Data Analytics practice
• First major SI to adopt Pentaho for Hadoop
• Facilitates large data analytics projects including expert consulting services,
best practices support in Hadoop implementations and nCluster including
deployment on private and public clouds
10. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Pentaho for Hadoop Resources & Events
Resources
Download www.pentaho.com/download/hadoop
Pentaho for Hadoop webpage - resources, press, events, partnerships and
more: www.pentaho.com/hadoop
Big Data Analytics: 5 part video series with James Dixon, Pentaho CTO
Events
Hadoop World: NYC - Oct 12, Gold Sponsor, Exhibitor, Richard Daley
presenting, ‘Putting Analytics in Big Data Analysis’
London Hadoop User Group - Oct 12, London
Agile BI Meets Big Data - Oct 13, New York City
11. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Thank You.
Join the conversation. You can find us on:
Pentaho Facebook Group
@Pentaho
http://blog.pentaho.com
Pentaho - Open Source Business Intelligence Group
Editor's Notes
#3: In a traditional BI system where we have not been able to store all of the raw data, we have solved the problem by being selective.
Firstly we selected the attributes of the data that we know we have questions about. Then we cleansed it and aggregated it to transaction levels or higher, and packaged it up in a form that is easy to consume. Then we put it into an expensive system that we could not scale, whether technically or financially. The rest of the data was thrown away or archived on tape, which for the purposes of analysis, is the same as throwing it away.
TRANSITION
The problem is we don’t know what is in the data that we are throwing away or archiving. We can only answer the questions that we could predict ahead of time.
#4: When we look at the Big Data architecture we described before we recall that
* We want to store all of the data, so we can answer both known and unknown questions
* We want to satisfy our standard reporting and analysis requirements
* We want to satisfying ad-hoc needs by providing the ability to dip into the lake at any time to extract data
* We want to balance balance performance and cost as we scale
We need the ability to take the data in the Data Lake and easily convert it into data suitable for a data mart, data warehouse or ad-hoc data set - without requiring custom Java code
#5: Fortunately we have an embeddable data integration engine, written in Java
We have taken our Data Integration engine, PDI and integrated with Hadoop in a number of different areas:
* We have the ability to move files between Hadoop and external locations
* We have the ability to read and write to HDFS files during data transformations
* We have the ability to execute data transformations within the MapReduce engine
* We have the ability to extract information from Hadoop and load it into external data bases and applications
* And we have the ability to orchestrate all of this so you can integrate Hadoop into the rest of your data architecture with scheduling, monitoring, logging etc
#6: Put in to diagram form so we can indicate the different layers in the architecture and also show the scale of the data we get this Big Data pyramid.
* At the bottom of the pyramid we have Hadoop, containing our complete set of data.
* Higher up we have our data mart layer. This layer has less data in it, but has better performance.
* At the top we have application-level data caches.
* Looking down from the top, from the perspective of our users, they can see the whole pyramid - they have access to the whole structure. The only thing that varies is the query time, depending on what data they want.
* Here we see that the RDBMS layer lets up optimize access to the data. We can decide how much data we want to stage in this layer. If we add more storage in this layer, we can increase performance of a larger subset of the data lake, but it costs more money.
#7: In this demo we will show how easy it is to execute a series of Hadoop and non-Hadoop tasks. We are going to
TRANSITION 1
Get a weblog file from an FTP server
TRANSITION 2
Make sure the source file does not exist with the Hadoop file system
TRANSITION 3
Copy the weblog file into Hadoop
TRANSITION 4
Read the weblog and process it - add metadata about the URLs, add geocoding, and enrich the operating system and browser attributes
TRANSITION 5
Write the results of the data transformation to a new, improved, data file
TRANSITION 6
Load the data into Hive
TRANSITION 7
Read an aggregated data set from Hadoop
TRANSITION 8
And write it into a database
TRANSITION 9
Slice and dice the data with the database
TRANSITION 10
And execute an ad-hoc query into Hadoop