As delivered at Trivadis Tech Event 2016 - how Big Data Discovery along with Python and pySpark was used to build predictive analytics models against wearables and smart home data
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoop’s capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery – at massive scale and with compelling economics.
In this session we’ll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with – and we’ll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Mark Rittman
The document discusses using Hadoop and NoSQL technologies like Apache HBase to perform social network analysis on Twitter data related to a company's website and blog. It describes ingesting tweet and website log data into Hadoop HDFS and processing it with tools like Hive. Graph algorithms from Oracle Big Data Spatial & Graph were then used on the property graph stored in HBase to identify influential Twitter users and communities. This approach provided real-time insights at scale compared to using a traditional relational database.
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Mark Rittman
As presented at OGh SQL Celebration Day in June 2016, NL. Covers new features in Big Data SQL including storage indexes, storage handlers and ability to install + license on commodity hardware
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?Mark Rittman
There are many options for providing SQL access over data in a Hadoop cluster, including proprietary vendor products along with open-source technologies such as Apache Hive, Cloudera Impala and Apache Drill; customers are using those to provide reporting over their Hadoop and relational data platforms, and looking to add capabilities such as calculation engines, data integration and federation along with in-memory caching to create complete analytic platforms. In this session we’ll look at the options that are available, compare database vendor solutions with their open-source alternative, and see how emerging vendors are going beyond simple SQL-on-Hadoop products to offer complete “data fabric” solutions that bring together old-world and new-world technologies and allow seamless offloading of archive data and compute work to lower-cost Hadoop platforms.
The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman
The document discusses the future of analytics, data integration, and business intelligence (BI) on big data platforms like Hadoop. It covers how BI has evolved from old-school data warehousing to enterprise BI tools to utilizing big data platforms. New technologies like Impala, Kudu, and dataflow pipelines have made Hadoop fast and suitable for analytics. Machine learning can be used for automatic schema discovery. Emerging open-source BI tools and platforms, along with notebooks, bring new approaches to BI. Hadoop has become the default platform and future for analytics.
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsMark Rittman
Mark Rittman, founder of Rittman Mead, discusses Oracle's approach to hybrid BI deployments and how it aligns with Gartner's vision of a modern BI platform. He explains how Oracle BI 12c supports both traditional top-down modeling and bottom-up data discovery. It also enables deploying components on-premises or in the cloud for flexibility. Rittman believes the future is bi-modal, with IT enabling self-service analytics alongside centralized governance.
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
Mark Rittman gave a presentation on the future of analytics on Oracle Big Data Appliance. He discussed how Hadoop has enabled highly scalable and affordable cluster computing using technologies like MapReduce, Hive, Impala, and Parquet. Rittman also talked about how these technologies have improved query performance and made Hadoop suitable for both batch and interactive/ad-hoc querying of large datasets.
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
This talk focus is on what a data reservoir is, how it related to the RDBMS DW, and how Big Data Discovery provides access to it to business and BI users
Unlock the value in your big data reservoir using oracle big data discovery a...Mark Rittman
The document discusses Oracle Big Data Discovery and how it can be used to analyze and gain insights from data stored in a Hadoop data reservoir. It provides an example scenario where Big Data Discovery is used to analyze website logs, tweets, and website posts and comments to understand popular content and influencers for a company. The data is ingested into the Big Data Discovery tool, which automatically enriches the data. Users can then explore the data, apply additional transformations, and visualize relationships to gain insights.
Mark Rittman presented on how a tweet about a smart kettle went viral. He analyzed the tweet data using Oracle Big Data Spatial and Graph on a Hadoop cluster. Over 3,000 tweets were captured from over 30 countries in 48 hours. Key influencers were identified using PageRank and by their large number of followers. Visualization tools like Cytoscape and Tom Sawyer Perspectives showed how the tweet spread over time and geography. The analysis revealed that the tweet went viral after being shared by the influential user @erinscafe on the first day.
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think
The document discusses the evolution of big data architectures from Hadoop and MapReduce to Lambda architecture and stream processing frameworks. It notes the limitations of early frameworks in terms of latency, scalability, and fault tolerance. Modern architectures aim to unify batch and stream processing for low latency queries over both historical and new data.
Lambda architecture for real time big dataTrieu Nguyen
- The document discusses the Lambda Architecture, a system designed by Nathan Marz for building real-time big data applications. It is based on three principles: human fault-tolerance, data immutability, and recomputation.
- The document provides two case studies of applying Lambda Architecture - at Greengar Studios for API monitoring and statistics, and at eClick for real-time data analytics on streaming user event data.
- Key lessons discussed are keeping solutions simple, asking the right questions to enable deep analytics and profit, using reactive and functional approaches, and turning data into useful insights.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.
Access additional slides from this meetup here:
http://www.slideshare.net/CasertaConcepts/big-data-warehousing-meetup-january-20
For more information on our services or upcoming events, please visit http://www.actian.com/ or http://www.casertaconcepts.com/.
The document discusses Big Data on Azure and provides an overview of HDInsight, Microsoft's Apache Hadoop-based data platform on Azure. It describes HDInsight cluster types for Hadoop, HBase, Storm and Spark and how clusters can be automatically provisioned on Azure. Example applications and demos of Storm, HBase, Hive and Spark are also presented. The document highlights key aspects of using HDInsight including storage integration and tools for interactive analysis.
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.
Accompanying Blog: Coming Soon!
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
[email protected]
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
There is a fundamental shift underway in IT to include open, software defined, distributed systems like Hadoop. As a result, every Oracle professional should strive to learn these new technologies or risk being left behind. This session is designed specifically for Oracle database professionals so they can better understand SQL on Hadoop and the benefits it brings to the enterprise. Attendees will see how SQL on Hadoop compares to Oracle in areas such as data storage, data ingestion, and SQL processing. Various live demos will provide attendees with a first-hand look at these new world technologies. Presented at Collaborate 18.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
The right architecture is key for any IT project. This is valid in the case for big data projects as well, but on the other hand there are not yet many standard architectures which have proven their suitability over years.
This session discusses different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Event Driven architecture as well as Lambda and Kappa architecture.
Each architecture is presented in a vendor- and technology-independent way using a standard architecture blueprint. In a second step, these architecture blueprints are used to show how a given architecture can support certain use cases and which popular open source technologies can help to implement a solution based on a given architecture.
This presentation examines the main building blocks for building a big data pipeline in the enterprise. The content uses inspiration from some of the top big data pipelines in the world like the ones built by Netflix, Linkedin, Spotify or Goldman Sachs
This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.
This document discusses the application of PostgreSQL in a large social infrastructure project involving smart meter management. It describes three main missions: (1) loading 10 million datasets within 10 minutes, (2) saving data for 24 months, and (3) stabilizing performance for large scale SELECT statements. Various optimizations are discussed to achieve these missions, including data modeling, performance tuning, reducing data size, and controlling execution plans. The results showed that all three missions were successfully completed by applying PostgreSQL expertise and customizing it for the large-scale requirements of the project.
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsMark Rittman
Mark Rittman, founder of Rittman Mead, discusses Oracle's approach to hybrid BI deployments and how it aligns with Gartner's vision of a modern BI platform. He explains how Oracle BI 12c supports both traditional top-down modeling and bottom-up data discovery. It also enables deploying components on-premises or in the cloud for flexibility. Rittman believes the future is bi-modal, with IT enabling self-service analytics alongside centralized governance.
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
Mark Rittman gave a presentation on the future of analytics on Oracle Big Data Appliance. He discussed how Hadoop has enabled highly scalable and affordable cluster computing using technologies like MapReduce, Hive, Impala, and Parquet. Rittman also talked about how these technologies have improved query performance and made Hadoop suitable for both batch and interactive/ad-hoc querying of large datasets.
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
This talk focus is on what a data reservoir is, how it related to the RDBMS DW, and how Big Data Discovery provides access to it to business and BI users
Unlock the value in your big data reservoir using oracle big data discovery a...Mark Rittman
The document discusses Oracle Big Data Discovery and how it can be used to analyze and gain insights from data stored in a Hadoop data reservoir. It provides an example scenario where Big Data Discovery is used to analyze website logs, tweets, and website posts and comments to understand popular content and influencers for a company. The data is ingested into the Big Data Discovery tool, which automatically enriches the data. Users can then explore the data, apply additional transformations, and visualize relationships to gain insights.
Mark Rittman presented on how a tweet about a smart kettle went viral. He analyzed the tweet data using Oracle Big Data Spatial and Graph on a Hadoop cluster. Over 3,000 tweets were captured from over 30 countries in 48 hours. Key influencers were identified using PageRank and by their large number of followers. Visualization tools like Cytoscape and Tom Sawyer Perspectives showed how the tweet spread over time and geography. The analysis revealed that the tweet went viral after being shared by the influential user @erinscafe on the first day.
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think
The document discusses the evolution of big data architectures from Hadoop and MapReduce to Lambda architecture and stream processing frameworks. It notes the limitations of early frameworks in terms of latency, scalability, and fault tolerance. Modern architectures aim to unify batch and stream processing for low latency queries over both historical and new data.
Lambda architecture for real time big dataTrieu Nguyen
- The document discusses the Lambda Architecture, a system designed by Nathan Marz for building real-time big data applications. It is based on three principles: human fault-tolerance, data immutability, and recomputation.
- The document provides two case studies of applying Lambda Architecture - at Greengar Studios for API monitoring and statistics, and at eClick for real-time data analytics on streaming user event data.
- Key lessons discussed are keeping solutions simple, asking the right questions to enable deep analytics and profit, using reactive and functional approaches, and turning data into useful insights.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.
Access additional slides from this meetup here:
http://www.slideshare.net/CasertaConcepts/big-data-warehousing-meetup-january-20
For more information on our services or upcoming events, please visit http://www.actian.com/ or http://www.casertaconcepts.com/.
The document discusses Big Data on Azure and provides an overview of HDInsight, Microsoft's Apache Hadoop-based data platform on Azure. It describes HDInsight cluster types for Hadoop, HBase, Storm and Spark and how clusters can be automatically provisioned on Azure. Example applications and demos of Storm, HBase, Hive and Spark are also presented. The document highlights key aspects of using HDInsight including storage integration and tools for interactive analysis.
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.
Accompanying Blog: Coming Soon!
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
[email protected]
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
There is a fundamental shift underway in IT to include open, software defined, distributed systems like Hadoop. As a result, every Oracle professional should strive to learn these new technologies or risk being left behind. This session is designed specifically for Oracle database professionals so they can better understand SQL on Hadoop and the benefits it brings to the enterprise. Attendees will see how SQL on Hadoop compares to Oracle in areas such as data storage, data ingestion, and SQL processing. Various live demos will provide attendees with a first-hand look at these new world technologies. Presented at Collaborate 18.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
The right architecture is key for any IT project. This is valid in the case for big data projects as well, but on the other hand there are not yet many standard architectures which have proven their suitability over years.
This session discusses different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Event Driven architecture as well as Lambda and Kappa architecture.
Each architecture is presented in a vendor- and technology-independent way using a standard architecture blueprint. In a second step, these architecture blueprints are used to show how a given architecture can support certain use cases and which popular open source technologies can help to implement a solution based on a given architecture.
This presentation examines the main building blocks for building a big data pipeline in the enterprise. The content uses inspiration from some of the top big data pipelines in the world like the ones built by Netflix, Linkedin, Spotify or Goldman Sachs
This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.
This document discusses the application of PostgreSQL in a large social infrastructure project involving smart meter management. It describes three main missions: (1) loading 10 million datasets within 10 minutes, (2) saving data for 24 months, and (3) stabilizing performance for large scale SELECT statements. Various optimizations are discussed to achieve these missions, including data modeling, performance tuning, reducing data size, and controlling execution plans. The results showed that all three missions were successfully completed by applying PostgreSQL expertise and customizing it for the large-scale requirements of the project.
The document outlines the preparation and structure for an effective white paper. It recommends determining the audience and their needs, getting internal buy-in, and defining the paper's objectives, scope, and call to action. The basic outline includes introducing the problem and solution, detailing the high-level and technical aspects of the solution, and summarizing with a call to action. Finally, roll-out strategies are presented such as posting the paper online and distributing it at events to maximize exposure.
This document outlines a new approach called Whole of Institution Reporting which aims to provide a comprehensive picture of an organization's progress, challenges, and performance by collecting both quantitative and qualitative data from across all levels and departments on a regular basis. The goal is to give leadership a holistic understanding of what is happening throughout the entire institution to help identify issues, opportunities for improvement, and inform strategic decision making.
AI-powered
conversational agent to
enhance the student
experience.
Blackboard Open LMS:
An open source learning
platform that brings together
the best of open source and
enterprise capabilities.
Blackboard Learn:
The most widely used
learning management
system in the world.
Blackboard Services:
We make Blackboard work
for you.
Blackboard Analytics:
Actionable insights to
improve outcomes and
optimize resources.
Blackboard Ally:
Accessibility tool that
produces accessible
formats for all learners.
Blackboard Collaborate:
Web conferencing for online
and blended learning.
Blackboard Open Content
Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s Ne...Rittman Analytics
Mark Rittman presented at Big Data World in London in March 2017 on data integration and data warehousing for cloud, big data, and IoT. He discussed the history of data warehousing and how it has evolved from traditional RDBMS implementations to embrace big data technologies like Hadoop. He described how cloud data warehouse offerings from Google BigQuery and Amazon Redshift combine the scalability of big data with the structure of data warehousing. Rittman also covered new approaches to ETL using data pipelines, schema discovery using machine learning, emerging open-source BI tools, and his current work in these areas.
The document provides an overview of big data concepts and frameworks. It discusses the dimensions of big data including volume, velocity, variety, veracity, value and variability. It then describes the traditional approach to data processing and its limitations in dealing with large, complex data. Hadoop and its core components HDFS and YARN are introduced as the solution. Spark is presented as a faster alternative to Hadoop for processing large datasets in memory. Other frameworks like Hive, Pig and Presto are also briefly mentioned.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
A new big data architecture involves ingesting, processing, and analyzing large or complex data sources. It includes batch processing of stored data, real-time processing of streaming data, interactive exploration, and predictive analytics. The key components are data sources, storage like a data lake, batch and stream processing, an analytical data store, analysis/reporting, and orchestration. Batch jobs prepare stored data while stream processing handles real-time data before loading into the analytical data store for querying, exploration, and insights.
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
Mark Rittman from Rittman Mead presented on Oracle Big Data Discovery. He discussed how many organizations are running big data initiatives involving loading large amounts of raw data into data lakes for analysis. Oracle Big Data Discovery provides a visual interface for exploring, analyzing, and transforming this raw data. It allows users to understand relationships in the data, perform enrichments, and prepare the data for use in tools like Oracle Business Intelligence.
This document summarizes a presentation on using SQL Server Integration Services (SSIS) with HDInsight. It introduces Tillmann Eitelberg and Oliver Engels, who are experts on SSIS and HDInsight. The agenda covers traditional ETL processes, challenges of big data, useful Apache Hadoop components for ETL, clarifying statements about Hadoop and ETL, using Hadoop in the ETL process, how SSIS is more than just an ETL tool, tools for working with HDInsight, getting started with Azure HDInsight, and using SSIS to load and transform data on HDInsight clusters.
Transform from database professional to a Big Data architectSaurabh K. Gupta
This document discusses transitioning from an Oracle DBA to a Big Data architect. It provides an overview of big data, key technologies, and how DBAs can leverage their skills. The speaker is introduced as a data leader with Oracle experience who will cover how DBAs can contribute to big data. The agenda includes an overview of big data, designing big data solutions, common technologies, and building a big data team. Common data sources, acquisition methods, storage options, and analytics are also summarized.
A summarized version of a presentation regarding Big Data architecture, covering from Big Data concept to Hadoop and tools like Hive, Pig and Cassandra
This document provides an overview of the Hadoop ecosystem. It begins by defining big data and explaining how Hadoop uses MapReduce and HDFS to allow for distributed processing and storage of large datasets across commodity hardware. It then describes various components of the Hadoop ecosystem for acquiring, arranging, analyzing, and visualizing data, including Flume, Sqoop, Kafka, HDFS, HBase, Spark, Pig, Hive, Impala, Mahout, and HUE. Real-world use cases of Hadoop at companies like Facebook, Twitter, and NASA are also discussed. Overall, the document outlines the key elements that make up the Hadoop ecosystem for working with big data.
A big data architecture handles large or complex data through ingestion, processing, and analysis. It typically includes data sources, storage like a data lake, batch and stream processing, an analytical data store, analysis/reporting, and orchestration. Common components are Azure Data Lake Store, Azure Stream Analytics, HDInsight, and Azure Synapse Analytics which enable batch and stream processing, serving analytical data, and automating workflows.
Colorado Springs Open Source Hadoop/MySQL David Smelker
This document discusses MySQL and Hadoop integration. It covers structured versus unstructured data and the capabilities and limitations of relational databases, NoSQL, and Hadoop. It also describes several tools for integrating MySQL and Hadoop, including Sqoop for data transfers, MySQL Applier for streaming changes to Hadoop, and MySQL NoSQL interfaces. The document outlines the typical life cycle of big data with MySQL playing a role in data acquisition, organization, analysis, and decisions.
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
Are You Ready? Stepping Up To The Big Data Challenge In 2016 - Learn why Testing is pivotal to the success of your Big Data Strategy.
According to a new report by analyst firm IDG, 70% of enterprises have either deployed or are planning to deploy big data projects and programs this year due to the increase in the amount of data they need to manage.
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data and Hadoop. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data - all with one data testing tool.
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value. Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value.
Minimizing the Complexities of Machine Learning with Data VirtualizationDenodo
Watch full webinar here: https://buff.ly/309CZ1Y
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
*How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
*How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
*How you can use the Denodo Platform with large data volumes in an efficient way
*About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB
T-Sciences offers iSpatial - a web-based Spatial Data Infrastructure (SDI) to enable integration of third-party applications with geo-visualization tools. The iHarvest tool further enables the mining and analysis of data aggregated in the iSpatial platform for spatio-temporal behavior modelling. At the back-end of both products is MongoDB, providing fundamental framework capabilities for the spatial indexing and data analysis techniques. Come witness how Thermopylae Sciences and Technology leveraged the aggregation framework, and extended the spatial capabilities of MongoDB to tackle dynamic spatio-behavioral data at scale.
The way we store and manage data is changing. In the old days, there were only a handful of file formats and databases. Now there are countless databases and numerous file formats. The methods by which we access the data has also increased in number. As R users, we often access and analyze data in highly inefficient ways. Big Data tech has solved some of those problems.
This presentation will take attendees on a quick tour of the various relevant Big Data technologies. I’ll explain how these technologies fit together to form a stack for various data analysis uses cases. We’ll talk about what these technologies mean for the future of analyzing data with R.
Even if you work with “small data” this presentation will still be of interest because some Big Data tech has a small data use case.
This document provides a summary of Oracle OpenWorld 2014 discussions on database cloud, in-memory database, native JSON support, big data, and Internet of Things (IoT) technologies. Key points include:
- Database Cloud on Oracle offers pay-as-you-go pricing and self-service provisioning similar to on-premise databases.
- Oracle Database 12c includes an in-memory option that can provide up to 100x faster analytics queries and 2-4x faster transaction processing.
- Native JSON support in 12c allows storing and querying JSON documents within the database.
- Big data technologies like Oracle Big Data SQL and Oracle Big Data Discovery help analyze large and diverse data sets from sources like
"Analyzing Twitter Data with Hadoop - Live Demo", presented at Oracle Open World 2014. The repository for the slides is in https://github.com/cloudera/cdh-twitter-example
This document provides an overview of big data and Hadoop. It defines big data as high-volume, high-velocity, and high-variety data that requires new techniques to capture value. Hadoop is introduced as an open-source framework for distributed storage and processing of large datasets across clusters of computers. Key components of Hadoop include HDFS for storage and MapReduce for parallel processing. Benefits of Hadoop are its ability to handle large amounts of structured and unstructured data quickly and cost-effectively at large scales.
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman
Mark Rittman, CTO of Rittman Mead, gave a keynote presentation on big data for Oracle developers and DBAs with a focus on Apache Spark, real-time analytics, and predictive analytics. He discussed how Hadoop can provide flexible, cheap storage for logs, feeds, and social data. He also explained several Hadoop processing frameworks like Apache Spark, Apache Tez, Cloudera Impala, and Apache Drill that provide faster alternatives to traditional MapReduce processing.
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive AnalyticsMark Rittman
This is a session for Oracle DBAs and devs that looks at the cutting edge big data techs like Spark, Kafka etc, and through demos shows how Hadoop is now a a real-time platform for fast analytics, data integration and predictive modeling
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...Mark Rittman
OBIEE12c comes with an updated version of Essbase that focuses entirely in this release on the query acceleration use-case. This presentation looks at this new release and explains how the new BI Accelerator Wizard manages the creation of Essbase cubes to accelerate OBIEE query performance
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Mark Rittman
This document summarizes a presentation about adding a Hadoop-based data reservoir to an Oracle data warehouse. The presentation discusses using a data reservoir to store large amounts of raw customer data from various sources to enable 360-degree customer analysis. It describes loading and integrating the data reservoir with the data warehouse using Oracle tools and how organizations can use it for more personalized customer marketing through advanced analytics and machine learning.
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization.
In this session we'll look at Oracle Big Data Discovery and how it provides a "visual face" to your big data initatives, and how it complements and extends the work that you currently do using business analytics tools.
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Mark Rittman
- Mark Rittman presented on deploying full OBIEE systems to Oracle Cloud. This involves migrating the data warehouse to Oracle Database Cloud Service, updating the RPD to connect to the cloud database, and uploading the RPD to Oracle BI Cloud Service. Using the wider Oracle PaaS ecosystem allows hosting a full BI platform in the cloud.
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Mark Rittman
Presentation from the Rittman Mead BI Forum 2015 masterclass, pt.2 of a two-part session that also covered creating the Discovery Lab. Goes through setting up Flume log + twitter feeds into CDH5 Hadoop using ODI12c Advanced Big Data Option, then looks at the use of OBIEE11g with Hive, Impala and Big Data SQL before finally using Oracle Big Data Discovery for faceted search and data mashup on-top of Hadoop
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...Mark Rittman
This document discusses an end-to-end example of using Hadoop, OBIEE, ODI and Oracle Big Data Discovery to analyze big data from various sources. It describes ingesting website log data and Twitter data into a Hadoop cluster, processing and transforming the data using tools like Hive and Spark, and using the results for reporting in OBIEE and data discovery in Oracle Big Data Discovery. ODI is used to automate the data integration process.
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015Mark Rittman
Slides from a two-day OBIEE11g seminar in Dubai, February 2015, at the Oracle University Expert Summit. Covers the following topics:
1. OBIEE 11g Overview & New Features
2. Adding Exalytics and In-Memory Analytics to OBIEE 11g
3. Source Control and Concurrent Development for OBIEE
4. No Silver Bullets - OBIEE 11g Performance in the Real World
5. Oracle BI Cloud Service Overview, Tips and Techniques
6. Moving to Oracle BI Applications 11g + ODI
7. Oracle Essbase and Oracle BI EE 11g Integration Tips and Techniques
8. OBIEE 11g and Predictive Analytics, Hadoop & Big Data
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODIMark Rittman
The document discusses Oracle's Big Data SQL, which brings Oracle SQL capabilities to Hadoop data stored in Hive tables. It allows querying Hive data using standard SQL from Oracle Database and viewing Hive metadata in Oracle data dictionary tables. Big Data SQL leverages the Hive metastore and uses direct reads and SmartScan to optimize queries against HDFS and Hive data. This provides a unified SQL interface and optimized query processing for both Oracle and Hadoop data.
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cMark Rittman
This document discusses using Hadoop and Hive for ETL work. It provides an overview of using Hadoop for distributed processing and storage of large datasets. It describes how Hive provides a SQL interface for querying data stored in Hadoop and how various Apache tools can be used to load, transform and store data in Hadoop. Examples of using Hive to view table metadata and run queries are also presented.
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Mark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014
In this presentation we cover some key Hadoop concepts including HDFS, MapReduce, Hive and NoSQL/HBase, with the focus on Oracle Big Data Appliance and Cloudera Distribution including Hadoop. We explain how data is stored on a Hadoop system and the high-level ways it is accessed and analysed, and outline Oracle’s products in this area including the Big Data Connectors, Oracle Big Data SQL, and Oracle Business Intelligence (OBI) and Oracle Data Integrator (ODI).
Part 4 - Hadoop Data Output and Reporting using OBIEE11gMark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
Once insights and analysis have been produced within your Hadoop cluster by analysts and technical staff, it’s usually the case that you want to share the output with a wider audience in the organisation. Oracle Business Intelligence has connectivity to Hadoop through Apache Hive compatibility, and other Oracle tools such as Oracle Big Data Discovery and Big Data SQL can be used to visualise and publish Hadoop data. In this final session we’ll look at what’s involved in connecting these tools to your Hadoop environment, and also consider where data is optimally located when large amounts of Hadoop data need to be analysed alongside more traditional data warehouse datasets
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session we’ll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. We’ll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoop’s schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a “data factory” to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.
How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345
I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around!
[email protected]
Andhra Pradesh Micro Irrigation Project” (APMIP), is the unique and first comprehensive project being implemented in a big way in Andhra Pradesh for the past 18 years.
The Project aims at improving
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
1. T : @markrittman
USING ORACLE BIG DATA DISCOVERY AS THE
DATA SCIENTIST'S TOOLKIT
Mark Rittman, Oracle ACE Director
TRIVADIS TECHEVENT 2016, ZÜRICH
2. •Oracle ACE Director, blogger + ODTUG member
•Regular columnist for Oracle Magazine
•Past ODTUG Executive Board Member
•Author of two books on Oracle BI
•Co-founder & CTO of Rittman Mead
•15+ Years in Oracle BI, DW, ETL + now Big Data
•Implementor, trainer, consultant + company founder
•Based in Brighton, UK
About The Presenter
2
3. •A visual front-end to the Hadoop data reservoir, providing end-user access to datasets
•Data sampled and loaded from Hadoop (Hive) into NoSQL Dgraph engine for fast analysis
•Catalog, profile, analyse and combine schema-on-read datasets across the Hadoop cluster
•Visualize and search datasets to gain insights, potentially load in summary form into DW
Oracle Big Data Discovery - What Is It?
3
7. Tools And Techniques Used By Data Scientists
7
IMPORTING AND
TIDYING DATA
VISUALISING AND
TRANSFORMING DATA
MODELING AND INFERRING
COMMUNICATING
AND BUNDLING
VISUALISING AND
TRANSFORMING DATA
COMMUNICATING
AND BUNDLING
8. Tools And Techniques Used By Data Scientists
8
IMPORTING AND
TIDYING DATA
MODELING AND INFERRING
•Whilst Big Data Discovery 1.1 enabled data wrangling, it was single-row only
•No ability to aggregate data or perform inter-row calculations
•No special null handling or other regularly-used techniques
•No ability to materialise joins (only in data visualizations)
•No ability to access commonly-used R,
Python and other stats libraries
•No solution for machine learning or
predictive analytics
10. IMPORTING AND
TIDYING DATA
METADATA AND
DEVELOPER
PRODUCTIVITY
COMMUNICATING
AND BUNDLING
•Metadata Curation
•Attribute-level Search
from Catalog
•Activity Hub
•Python Interface to
BDD Datasets
•Streamlined UI
•Faster Data Indexing
•Activity Hub
•Sunburst Visualization
•Aggregation
•Materialised Joins
•Better Pan and Zoom
•Speed and Scale
New Features In Oracle Big Data Discovery 1.2
10
11. •Interactive tool designed to work with BDD without using Studio's front-end
•Exposes all BDD concepts
(views, datasets, data sources etc)
•Supports Apache Spark
•HiveContext and SQLContext exposed
•BDD Shell SDK for easy access to BDD
features, functionality
•Access to third-party libraries such as
Pandas, Spark ML, numPy
•Use with web-based notebook such as
iPython, Jupyter, Zeppelin
Big Data Discovery Python Shell - What Is It?
11
14. •Over the past months I’ve been on sabattical, taking
time out to look at new Hadoop tech
•Building prototypes, working with with startups &
analysts outside of core Oracle world
•Asking myself the question “What will an analytics
platform look like in 5 years time?”
•But also during this time, getting fit, getting into
cycling and losing 14kg over 12 months
•Using Wahoo Elemnt + Strava for workout recording
•Withings Wifi scales for weight + body fat measurement
•Jawbone UP3 for steps, sleep, resting heart rate
•All the time, collecting data and storing it in Hadoop
Personal Data Science Project - “Quantified Self”
14
15. •Quantified Self is about self-knowledge through numbers
•Decide on some goals, work out what metrics to track
•Use wearables and other smart devices to record steps,
heart rate, workouts, weight and other health metrics
•Plot, correlate, track trends and combine datasets
•For me, goal was to maintain new “healthy weight”
•Understand drivers of weight gain or loss
•See how sleep affected productivity
•Understand what behaviours led to a “good day”
Personal Data Science Project - “Quantified Self”
15
18. Smart Devices Logging Data To Hadoop Cluster
18
Philips Hue
Lighting
Nest Protect (X2),
Thermostat, Cam
Withings
Smart Scales
Airplay
Speakers
Homebridge
Homekit / Smarthings
Connector
Samsung
Smart Things
Hub (Z-Wave, Zigbee)
Door, Motion, Moisture,
Presence Sensors
Apple Homekit,
Apple TV, Siri
IFTTT Maker Channel
JSON via HTTP POST
LogStash
(real-time)
(real-time)
(real-time)
• Gmail
• Withings Scales
• Strava
• Jawbone UP
• Weather
• Youtube
• IOS Photos
• Twitter
• RescueTime
• Pocket
• Instagram
• Google Calendar
• Facebook
(real-time)
6-Node CDH5.8 Hadoop Cluster in garage,
+ Oracle Big Data Discovery 1.2.0
on VMWare ESXi 4-node cluster
19. •Data extracted or transported to target platform using LogStash, CSV file batch loads
•Landed into HDFS as JSON documents, then exposed as Hive tables using Storage Handler
•Cataloged, visualised and analysed using Oracle Big Data Discovery + Python ML
Hadoop Cluster Dataset - “Personal Data Lake"
19
Data Transfer Data Access
“Personal” Data Lake
Jupyter
Web Notebook
6 Node Hadoop Cluster (CDH5.5)
Discovery & Development Labs
Oracle Big Data Discovery 1.2
Data sets and
samples Models and programs
Oracle DV
Desktop
Models
BDD Shell,
Python,
Spark ML
Data Factory
LogStash
via HTTP
Manual
CSV U/L
Data streams
CSV, IFTTT
or API call
Raw JSON log files
in HDFS
Each document an
event, daily record or
comms message
Hive Tables
w/ Elastic
Storage Handler
Index data turned
into tabular format
Health Data
Unstructured Comms
Data
Smart Home
Sensor Data
20. •Uses IFTTT cloud workflow service to subscribe to events on wearables’ APIs
•Triggers HTTP GET request via IFTTT Maker Channel to Logstash running at home
•Event data sent as JSON documents, loaded
into HDFS via webhdfs protocol
•Structured in Hadoop using Hive JSONSerDe
•Then loaded hourly into DGraph using
Big Data Discovery dataprocessing CLI
•Event data automatically enriched, and can
be joined to smart home data for analysis
Landing Wearables Data In Real-Time
20
New workout
logged using
Strava
1
Workout details uploaded
to Strava using cloud API
2 3
IFTTT recipe gets workout event
from Strava API, triggers an
HTTP GET web request
4 JSON document received by
Logstash, then forwarded to
Hadoop using webhdfs PUT
5 JSON documents landed in HDFS
in raw form, then structured using
Hive JSONSerDe
6 Hive data uploaded into Oracle Big Data
Discovery, visualised and wrangled, and
modelled using pySpark
In the Cloud
Home
21. •All smart device events and sensor readings are routed through Samsung Smart Things hub
•Including Apple HomeKit devices, through custom integration
•Event data uploads to Smart Things cloud service + storage
•Custom Groovy SmartApp subscribes to
device events, transmits JSON documents
to Logstash using HTTP GET requests
•Then process flow the same as with
wearables and social media / comms data
Landing Smart Home Data In Real-Time
21
Sensor or other smart device
raises a Smart Things event
1
Event logged in Samsung
Smarthings Cloud Service
from Smart Things Hub
2
4 JSON document received by
Logstash, then forwarded to
Hadoop using webhdfs PUT
5 JSON documents landed in HDFS
in raw form, then structured using
Hive JSONSerDe
6 Hive data uploaded into Oracle Big Data
Discovery, visualised and wrangled, and
modelled using pySpark
In the Cloud
Home
SmartApp subscribes to device events,
forwards them as JSON document
using HTTP GET requests
3
22. •As well as visualising the combined dataset, we could also use “machine learning”
•Find correlations, predict outcomes based on regression analysis, classify and cluster data
•Run algorithms on the full dataset to answer questions like:
•“What are the biggest determinants of weight gain or loss for me?”
•“On a good day, what are the typical combination of behaviours I exhibit”?
•“If I raised my cadence RPM average, how much further could I cycle per day?”
•“Is working late or missing lunch self-defeating in terms of overall weekly output?”
And Use Machine Learning For Insights…
22
MODELING AND INFERRING
23. •Analysis started with data from Jawbone UP2 ecosystem (manual export, and via IFTTT events)
•Base activity data (steps, active time, active calories expended)
•Sleep data (time asleep, time in-bed, light and deep sleep, resting heart-rate)
•Mood if recorded; food ingested if recorded
•Workout data as provided by Strava integration
•Weight data as provided by Withings integration
Initial Base Dataset - Jawbone Up Extract
23
1
2
3
24. •Understand the “spread” of data using histograms
•Use box-plot charts to identify outliers and range of “usual” values
•Sort attributes by strongest correlation to a target attribute
Perform Exploratory Analysis On Data
24
25. •Initial row-wise preparation and transformation of data using Groovy transformations
Transform (“Wrangle”) Data As Needed
25
26. •Very typical with self-recorded healthcare and workout data
•Most machine-learning algorithms expect every attribute to have a value per row
•Self-recorded data is typically sporadically recorded, lots of gaps in data
•Need to decide what to do with columns of poorly populate values
Dealing With Missing Data (“Nulls”)
26
1
2
3
27. •Previous versions of BDD allowed you to create joins for views
•Used in visualisations, equivalent to a SQL view i.e. SELECT only
•BDD 1.2.x allows you to add new joined attributes to data view, i.e. materialise
•In this instance, use to bring in data on emails, and on geolocation
Joining Datasets To Materialize Related Data
27
28. •Only sensible option when looking at change in weight compared to prior period
•Change compared to previous day too granular
Aggregate Data To Week Level
28
1
2
3
29. NOW FOR THE CLEVER BIT
MODELING AND INFERRING
29
33. Use Linear Regression on BDD Dataset via Python
33
•To answer the question - which metric is the most influential when it comes to weight change?
34. And the Answer … Amount of Sleep Each Night
34
•Most influential variable/attribute in my weight / loss gain is “# of emails sent”
•Inverse correlation - more emails I sent, the more weight I lose - but why?
•In my case - unusual set of circumstances that led to late nights, burst of intense work
•So busy I skipped meals, didn’t snack, stress and overwork perhaps
•And then compensated once work over by getting out on bike and exercising
•Correlation and most influential variable
will probably change in time
•This is where the data, measuring it,
and analysing it comes in
•Useful basis for experimenting
•And bring in the Smart Home data too
35. •Load device + event data into Cloudera Kudu rather than HDFS + Hive
•Current limitation is around Big Data Discovery - does not work with Kudu or Impala
•But useful for real-time metrics (BDD requires batch ingest, and samples the data)
•Use Kafka for more reliable event routing
•Push email, social media, saved documents etc into Cloudera Search
•Do more on the machine learning / data integration + correlation side
For The Future..?
35
39. THANK YOU
E X A M P L E S O F T E X T - S I M P L E A N D E A S Y T O U S E T H E T H E M E O F T H E D E M O T E M P L AT E
B L A C K A N D W H I T E W O R L D
THANK YOU
39
40. T : @markrittman
USING ORACLE BIG DATA DISCOVERY AS THE
DATA SCIENTIST'S TOOLKIT
Mark Rittman, Oracle ACE Director
TRIVADIS TECHEVENT 2016, ZÜRICH