Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
Strata San Jose 2017 - Ben Sharma PresentationZaloni
The document discusses creating a modern data architecture using a data lake. It describes Zaloni as a provider of data lake management solutions, including a data lake management and governance platform and self-service data platform. It outlines key features of a data lake such as storing different types of data, creating standardized datasets, and providing shorter time to insights. The document also discusses Zaloni's data lake maturity model and reference architecture.
Solr + Hadoop: Interactive Search for Hadoopgregchanan
This document discusses Cloudera Search, which integrates Apache Solr with Cloudera's distribution of Apache Hadoop (CDH) to provide interactive search capabilities. It describes the architecture of Cloudera Search, including components like Solr, SolrCloud, and Morphlines for extraction and transformation. Methods for indexing data in real-time using Flume or batch using MapReduce are presented. The document also covers querying, security features like Kerberos authentication and collection-level authorization using Sentry, and concludes by describing how to obtain Cloudera Search.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
Building a fraud detection application using the tools in the Hadoop ecosystem. Presentation given by authors of O'Reilly's Hadoop Application Architectures book at Strata + Hadoop World in San Jose, CA 2016.
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit
Over the past 2 years, Cloudera has focused on improving and supporting Apache Spark. They have integrated Spark with Hadoop components like YARN, HBase, and Kafka. Cloudera engineers have also contributed security, monitoring, and governance features to Spark. More than 200 customers now use Spark for tasks like ETL, machine learning, and streaming analytics. Customers want Spark to have security comparable to databases, high performance, and simplicity. Cloudera is developing technologies like Sentry and Kudu to meet these needs and make Spark more powerful and useful for enterprises.
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...DataWorks Summit
GeoWave is an open-source library that connects geospatial software with distributed computing frameworks. GeoWave leverages the scalability of a distributed key-value store for effective storage, retrieval, and analysis of massive geospatial datasets. It uses a space filling curve to preserve locality between multi-dimensional objects and the single dimensional sort order imposed by key-value stores. What this means to a user is that distributed spatial and spatial-temporal retrieval and analysis can be effectively accomplished at a massive scale.
At its core, GeoWave solves the problem of multi-dimensional indexing, and particularly extends this capability to spatial/temporal use cases. GeoWave supports raster, vector, and point cloud data, and provides common spatial algorithms that can be extended to create deep analytic capabilities. It also performs fast subsampling via distributed rendering that integrates with GeoServer, so that a user can interactively visualize data at map scale regardless of density.
Our goal in presenting GeoWave to the Hadoop Summit is to introduce it to the big data community. We will present GeoWave at a moderate level of detail, to include a short demonstration, and hopefully answer any questions regarding maturity, suitability and implementation details.
"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data
Speaker: Russ Savage, from Cask
Big Data Applications Meetup, 09/14/2016
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to talk: https://youtu.be/4j78g3WvC4Y
About the talk:
As data lake sizes grow, and more users begin exploring and including that data in their everyday analysis, keeping track of the sources for data becomes critical. Understanding how a dataset was generated and who is using it allows users and companies to ensure their analysis is leveraging the most accurate and up to date information. In this talk, we will explore the different techniques available to keep track of your data in your data lake and demonstrate how we at Cask approached and attempted to mitigate this issue.
The document describes Big Data Ready Enterprise (BDRE), an open source product that addresses common challenges in implementing and operating big data solutions at large scale. It provides out-of-the-box features to accelerate implementations using pluggable architecture, community support, and distribution compatibility. The document outlines BDRE's key benefits and capabilities for data ingestion, workflow automation, operational metadata management, and more. It also provides examples of BDRE implementations and screenshots of the product's interface.
Apache Druid ingests and enables instant query on many billions of events in real-time. But how? In this talk, each of the components of an Apache Druid cluster is described – along with the data and query optimisations at its core – that unlock fresh, fast data for all.
Bio: Peter Marshall (https://linkedin.com/in/amillionbytes/) leads outreach and engineering across Europe for Imply (http://imply.io/), a company founded by the original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA (hons) degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
This document discusses a case study on fraud detection using Hadoop. It begins with an overview of fraud detection requirements, including the need for real-time and near real-time processing of large volumes and varieties of data. It then covers considerations for the system architecture, including using HDFS and HBase for storage, Kafka for ingestion, and Spark and Storm for stream and batch processing. Data modeling with HBase and caching options are also discussed.
The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman
The document discusses the future of analytics, data integration, and business intelligence (BI) on big data platforms like Hadoop. It covers how BI has evolved from old-school data warehousing to enterprise BI tools to utilizing big data platforms. New technologies like Impala, Kudu, and dataflow pipelines have made Hadoop fast and suitable for analytics. Machine learning can be used for automatic schema discovery. Emerging open-source BI tools and platforms, along with notebooks, bring new approaches to BI. Hadoop has become the default platform and future for analytics.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
Build Big Data Enterprise solutions faster on Azure HDInsightDataWorks Summit
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
Lambda-less Stream Processing @Scale in LinkedIn
The document discusses challenges with stream processing including data accuracy and reprocessing. It proposes a "lambda-less" approach using windowed computations and handling late and out-of-order events to produce eventually correct results. Samza is used in LinkedIn's implementation to store streaming data locally using RocksDB for processing within configurable windows. The approach avoids code duplication compared to traditional lambda architectures while still supporting reprocessing through resetting offsets. Challenges remain in merging online and reprocessed results at large scale.
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
This document discusses how Redis can be used for analytics at high speeds. It provides examples of how Redis data structures and operations allow for real-time bidding, recommendations, and time-series analytics. Redis on flash is presented as a cost-effective way to achieve high performance by using flash as an extension of RAM. Redis modules are introduced as a way to extend Redis capabilities with features like full text search, graphs, and SQL.
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
This talk will present how to build data pipelines with no code using the open-source, Apache 2.0, Cask Hydrator. The talk will continue with a live demonstration of creating data pipelines for two use cases.
The document discusses rolling out a Hadoop-based data lake for self-service analytics within a corporate environment. It describes the background and motivation for implementing the data lake. Key challenges addressed include security, governance, and change management. Lessons learned include the importance of guidelines, reusable components, integration testing, and understanding users' diverse needs.
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
This document discusses the high-level architecture for a data platform to support a customer 360 view using data from connected vehicles (taxis). The architecture includes data sources, streaming data ingestion using Kafka, schema validation, stream processing for transformations and routing, and storage for analytics, search and long-term retention. The presentation covers design considerations for reliability, scalability and processing of both streaming and batch data to meet requirements like querying, visualization, and batch processing of historical data.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
The Hadoop Guarantee: Keeping Analytics Running On TimeInside Analysis
The Briefing Room with Dr. Robin Bloor and Pepperdata
Live Webcast September 15, 2015
Watch the Archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=32f198185d9d0c4cf32c27bdd1498b2a
Industry researchers agree: the importance of Hadoop will continue to grow as more companies recognize the range of benefits they can reap, from lower-cost storage to better business insights. At the same time, advances in the Hadoop ecosystem are addressing many of the key concerns that have hampered adoption, including performance and reliability. As a result, Hadoop is fast becoming a first-class citizen in the world of enterprise computing.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor explain how the Hadoop ecosystem is evolving into a mature foundation for managing enterprise data. He’ll be briefed by Sean Suchter of Pepperdata, who will explain how his company’s software brings predictability and reliability to Hadoop through dynamic, policy-based controls and monitoring. He’ll show how to guarantee service-level agreements by slowing down low-priority tasks as needed. He’ll also discuss the holy grail of Hadoop: how to enable mixed workloads.
Visit InsideAnalysis.com for more information.
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
The document discusses Hortonworks DataFlow (HDF), which is a platform for data in motion. HDF allows users to collect data at the edge, route and process streaming data with Apache NiFi and Kafka, and analyze, visualize, predict and prescribe outcomes from the data using HDF platform services. The HDF platform provides scalable stream processing, security, data provenance, and management capabilities for data in motion applications across the enterprise.
This document provides an overview of Riak TS, Basho's new purpose-built time series database. It describes Riak TS's key features like high write throughput, efficient range query support, and horizontal scalability. It also outlines Riak TS's data modeling approach of co-locating and partitioning time-series data, its SQL-like query language, and provides examples of its performance and roadmap. Finally, it demonstrates a potential use case application called UNCORKD for tracking wine check-ins and reviews.
This document provides an overview of key concepts related to India's balance of payments from 1950-51 to 2003-04. It discusses the current account, capital account, and how transactions are classified. The current account includes trade in goods and services, as well as income and current transfers. The capital account includes financial transactions and capital transfers. The balance of payments aims to record all two-way international transactions for a given period according to standard definitions set by the IMF, though some departures exist due to data constraints. Discrepancies in the accounts are addressed through an "errors and omissions" item to balance the accounts.
The document describes Big Data Ready Enterprise (BDRE), an open source product that addresses common challenges in implementing and operating big data solutions at large scale. It provides out-of-the-box features to accelerate implementations using pluggable architecture, community support, and distribution compatibility. The document outlines BDRE's key benefits and capabilities for data ingestion, workflow automation, operational metadata management, and more. It also provides examples of BDRE implementations and screenshots of the product's interface.
Apache Druid ingests and enables instant query on many billions of events in real-time. But how? In this talk, each of the components of an Apache Druid cluster is described – along with the data and query optimisations at its core – that unlock fresh, fast data for all.
Bio: Peter Marshall (https://linkedin.com/in/amillionbytes/) leads outreach and engineering across Europe for Imply (http://imply.io/), a company founded by the original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA (hons) degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
This document discusses a case study on fraud detection using Hadoop. It begins with an overview of fraud detection requirements, including the need for real-time and near real-time processing of large volumes and varieties of data. It then covers considerations for the system architecture, including using HDFS and HBase for storage, Kafka for ingestion, and Spark and Storm for stream and batch processing. Data modeling with HBase and caching options are also discussed.
The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman
The document discusses the future of analytics, data integration, and business intelligence (BI) on big data platforms like Hadoop. It covers how BI has evolved from old-school data warehousing to enterprise BI tools to utilizing big data platforms. New technologies like Impala, Kudu, and dataflow pipelines have made Hadoop fast and suitable for analytics. Machine learning can be used for automatic schema discovery. Emerging open-source BI tools and platforms, along with notebooks, bring new approaches to BI. Hadoop has become the default platform and future for analytics.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
Build Big Data Enterprise solutions faster on Azure HDInsightDataWorks Summit
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
Lambda-less Stream Processing @Scale in LinkedIn
The document discusses challenges with stream processing including data accuracy and reprocessing. It proposes a "lambda-less" approach using windowed computations and handling late and out-of-order events to produce eventually correct results. Samza is used in LinkedIn's implementation to store streaming data locally using RocksDB for processing within configurable windows. The approach avoids code duplication compared to traditional lambda architectures while still supporting reprocessing through resetting offsets. Challenges remain in merging online and reprocessed results at large scale.
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
This document discusses how Redis can be used for analytics at high speeds. It provides examples of how Redis data structures and operations allow for real-time bidding, recommendations, and time-series analytics. Redis on flash is presented as a cost-effective way to achieve high performance by using flash as an extension of RAM. Redis modules are introduced as a way to extend Redis capabilities with features like full text search, graphs, and SQL.
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
This talk will present how to build data pipelines with no code using the open-source, Apache 2.0, Cask Hydrator. The talk will continue with a live demonstration of creating data pipelines for two use cases.
The document discusses rolling out a Hadoop-based data lake for self-service analytics within a corporate environment. It describes the background and motivation for implementing the data lake. Key challenges addressed include security, governance, and change management. Lessons learned include the importance of guidelines, reusable components, integration testing, and understanding users' diverse needs.
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
This document discusses the high-level architecture for a data platform to support a customer 360 view using data from connected vehicles (taxis). The architecture includes data sources, streaming data ingestion using Kafka, schema validation, stream processing for transformations and routing, and storage for analytics, search and long-term retention. The presentation covers design considerations for reliability, scalability and processing of both streaming and batch data to meet requirements like querying, visualization, and batch processing of historical data.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
The Hadoop Guarantee: Keeping Analytics Running On TimeInside Analysis
The Briefing Room with Dr. Robin Bloor and Pepperdata
Live Webcast September 15, 2015
Watch the Archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=32f198185d9d0c4cf32c27bdd1498b2a
Industry researchers agree: the importance of Hadoop will continue to grow as more companies recognize the range of benefits they can reap, from lower-cost storage to better business insights. At the same time, advances in the Hadoop ecosystem are addressing many of the key concerns that have hampered adoption, including performance and reliability. As a result, Hadoop is fast becoming a first-class citizen in the world of enterprise computing.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor explain how the Hadoop ecosystem is evolving into a mature foundation for managing enterprise data. He’ll be briefed by Sean Suchter of Pepperdata, who will explain how his company’s software brings predictability and reliability to Hadoop through dynamic, policy-based controls and monitoring. He’ll show how to guarantee service-level agreements by slowing down low-priority tasks as needed. He’ll also discuss the holy grail of Hadoop: how to enable mixed workloads.
Visit InsideAnalysis.com for more information.
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
The document discusses Hortonworks DataFlow (HDF), which is a platform for data in motion. HDF allows users to collect data at the edge, route and process streaming data with Apache NiFi and Kafka, and analyze, visualize, predict and prescribe outcomes from the data using HDF platform services. The HDF platform provides scalable stream processing, security, data provenance, and management capabilities for data in motion applications across the enterprise.
This document provides an overview of Riak TS, Basho's new purpose-built time series database. It describes Riak TS's key features like high write throughput, efficient range query support, and horizontal scalability. It also outlines Riak TS's data modeling approach of co-locating and partitioning time-series data, its SQL-like query language, and provides examples of its performance and roadmap. Finally, it demonstrates a potential use case application called UNCORKD for tracking wine check-ins and reviews.
This document provides an overview of key concepts related to India's balance of payments from 1950-51 to 2003-04. It discusses the current account, capital account, and how transactions are classified. The current account includes trade in goods and services, as well as income and current transfers. The capital account includes financial transactions and capital transfers. The balance of payments aims to record all two-way international transactions for a given period according to standard definitions set by the IMF, though some departures exist due to data constraints. Discrepancies in the accounts are addressed through an "errors and omissions" item to balance the accounts.
This document appears to be a presentation on small retail business in times of recession. It discusses the challenges facing large retailers and small sellers during an economic downturn. The author argues that private retail may offer opportunities for new businesses that can profit with minimal risk. The presentation provides advice on how small retailers can maintain and expand their customer base during a recession by clearly defining their purpose and value, revising their strategy, focusing on customer needs, and developing their brand through quality products and services.
O documento descreve o sistema de crédito imobiliário brasileiro, comparando-o com outros países. Apresenta os principais conceitos do crédito imobiliário no Brasil, como SFH, SFI, funding e mutuário final. Realiza uma análise da relação crédito imobiliário/PIB do Brasil versus outros países entre 2008-2012, mostrando que no Brasil esse índice é menor.
This document discusses biohazards in dentistry and proper waste management. It notes that dental offices generate various types of regulated and non-regulated waste and outlines classifications. Regulated waste like sharps and tissues require special disposal. The document also emphasizes the importance of infection control and preventing cross-contamination both within dental offices and between offices and labs through proper sterilization, disinfection, and labeling of materials.
How to herd cat statues and make awesome thingsmeldra
This document discusses how to create documentation for a tool that can be translated into multiple languages and formats. It describes using Sphinx and LaTeX to generate documentation, translating the text and images into different languages, and patching the output to support non-Latin scripts and Unicode symbols across translations. Makefiles are used to automate the build and translation process.
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
What's the origin of Big Data? What are the real life usage scenarios where Hadoop has been successfully adopted? How do you get started within your organizations?
Merry Riana adalah pengusaha wanita muda berhasil asal Indonesia yang menjadi milyuner pada usia muda. Ia berhasil meraih kesuksesan setelah berjuang keras selama kuliah di Singapura dengan dana terbatas. Rahasia kesuksesannya adalah tidak menyerah pada kesulitan, bermimpi besar dan melakukan tindakan, serta berani mengambil resiko.
Grow your own pesticide free food - Urban Hydroponics (soilless culture)Arvind Narayanan
The document discusses the issues of global hunger and food insecurity. It notes that the Global Hunger Index comprehensively measures hunger globally using three indicators: undernourishment, child underweight rates, and child mortality. India ranks poorly on the index despite rapid economic growth due to factors like increasing population, urbanization, land usage for industry, decreased agricultural labor, climate issues, and lack of adoption of new technologies by farmers. The document argues that technology is the key to overcoming these external factors and that urban agriculture using hydroponics can help address hunger by making more efficient use of scarce resources like land, water and labor.
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
Cómo Oracle ha logrado separar el motor SQL de su emblemática base de datos para procesar las consultas y los drivers de acceso que permiten leer datos, tanto de ficheros sobre el Hadoop Distributed File System, como de la herramienta de Data Warehousing, HIVE.
This document discusses data management trends and Oracle's unified data management solution. It provides a high-level comparison of HDFS, NoSQL, and RDBMS databases. It then describes Oracle's Big Data SQL which allows SQL queries to be run across data stored in Hadoop. Oracle Big Data SQL aims to provide easy access to data across sources using SQL, unified security, and fast performance through smart scans.
Enterprise Hadoop is Here to Stay: Plan Your Evolution StrategyInside Analysis
The Briefing Room with Neil Raden and Teradata
Live Webcast on August 19, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=1acd0b7ace309f765dc3196001d26a5e
Modern enterprises have been able to solve information management woes with the data warehouse, now a staple across the IT landscape that has evolved to a high level of sophistication and maturity with thousands of global implementations. Today’s modern enterprise has a similar challenge; big data and the fast evolution of the Hadoop ecosystem create plenty of new opportunities but also a significant number of operational pains as new solutions emerge.
Register for this episode of The Briefing Room to hear veteran Analyst Neil Raden as he explores the details and nature of Hadoop’s evolution. He’ll be briefed by Cesar Rojas of Teradata, who will share how Teradata solves some of the Hadoop operational challenges. He will also explain how the integration between Hadoop and the data warehouse can help organizations develop a more responsive and robust data management environment.
Visit InsideAnlaysis.com for more information.
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
More and more organizations are turning to Hadoop and NoSQL to manage big data. In fact, many IT professionals consider each of those terms to be synonymous with big data. At the same time, these two technologies are seen as different beasts that handle different challenges. That means they are often deployed in a rather disjointed way, even when intended to solve the same overarching business problem. The emerging trend of “in-Hadoop databases” promises to narrow the deployment gap between them and enable new enterprise applications. In this talk, Dale will describe that integrated architecture and how customers have deployed it to benefit both the technical and the business teams.
Hitachi Data Systems Hadoop Solution. Customers are seeing exponential growth of unstructured data from their social media websites to operational sources. Their enterprise data warehouses are not designed to handle such high volumes and varieties of data. Hadoop, the latest software platform that scales to process massive volumes of unstructured and semi-structured data by distributing the workload through clusters of servers, is giving customers new option to tackle data growth and deploy big data analysis to help better understand their business. Hitachi Data Systems is launching its latest Hadoop reference architecture, which is pre-tested with Cloudera Hadoop distribution to provide a faster time to market for customers deploying Hadoop applications. HDS, Cloudera and Hitachi Consulting will present together and explain how to get you there. Attend this WebTech and learn how to: Solve big-data problems with Hadoop. Deploy Hadoop in your data warehouse environment to better manage your unstructured and structured data. Implement Hadoop using HDS Hadoop reference architecture. For more information on Hitachi Data Systems Hadoop Solution please read our blog: http://blogs.hds.com/hdsblog/2012/07/a-series-on-hadoop-architecture.html
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
In recent years, Apache™ Hadoop® has emerged from humble beginnings to disrupt the traditional disciplines of information management. As with all technology innovation, hype is rampant, and data professionals are easily overwhelmed by diverse opinions and confusing messages.
Even seasoned practitioners sometimes miss the point, claiming for example that Hadoop replaces relational databases and is becoming the new data warehouse. It is easy to see where these claims originate since both Hadoop and Teradata® systems run in parallel, scale up to enormous data volumes and have shared-nothing architectures. At a conceptual level, it is easy to think they are interchangeable, but the differences overwhelm the similarities. This session will shed light on the differences and help architects, engineering executives, and data scientists identify when to deploy Hadoop and when it is best to use MPP relational database in a data warehouse, discovery platform, or other workload-specific applications.
Two of the most trusted experts in their fields, Steve Wooledge, VP of Product Marketing from Teradata and Jim Walker of Hortonworks will examine how big data technologies are being used today by practical big data practitioners.
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsArcadia Data
Learn how HPE uses visual analytics within a data lake to create an “Industrial Internet of Things” model that solves their data analytics problem at scale.
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
This document discusses SQL Server and big data analytics projects in the real world. It covers the big data technology landscape, big data analytics, and three big data analytics scenarios using different technologies like Hadoop, MongoDB, and SQL Server. It also discusses SQL Server's role in the big data world and how to get data into Hadoop for analysis.
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesArcadia Data
The use of data lakes continue to grow, and a recent survey by Eckerson Group shows that organizations are getting real value from their deployments. However, there’s still a lot of room for improvement when it comes to giving business users access to the wealth of potential insights in the data lake.
While the data management aspect has been fairly well understood over the years, the success of business intelligence (BI) and analytics on data lakes lags behind. In fact, organizations often struggle with data lakes because they are only accessible by highly-skilled data scientists and not by business users. But BI tools have been able to access data warehouses for years, so what gives?
In this talk, we’ll discuss:
- Why traditional BI tools are architected well for data warehouses, but not data lakes.
- Why every organization should have two BI standards: one for data warehouses and one for data lakes.
- Innovative capabilities provided by BI for data lakes
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
This document discusses virtualizing Hadoop for the enterprise. It begins with discussing trends driving changes in enterprise IT like cloud, mobile apps, and big data. It then discusses how Hadoop can address big, fast, and flexible data needs. The rest of the document discusses how virtualizing Hadoop through solutions like Project Serengeti can provide enterprises with elasticity, high availability, and operational simplicity for their Hadoop implementations. It also discusses how virtualization allows enterprises to integrate Hadoop with other workloads and data platforms.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
This document discusses Hortonworks and its mission to enable modern data architectures through Apache Hadoop. It provides details on Hortonworks' commitment to open source development through Apache, engineering Hadoop for enterprise use, and integrating Hadoop with existing technologies. The document outlines Hortonworks' services and the Hortonworks Data Platform (HDP) for storage, processing, and management of data in Hadoop. It also discusses Hortonworks' contributions to Apache Hadoop and related projects as well as enhancing SQL capabilities and performance in Apache Hive.
The document summarizes Oracle's Big Data Appliance and solutions. It discusses the Big Data Appliance hardware which includes 18 servers with 48GB memory, 12 Intel cores, and 24TB storage per node. The software includes Oracle Linux, Apache Hadoop, Oracle NoSQL Database, Oracle Data Integrator, and Oracle Loader for Hadoop. Oracle Loader for Hadoop can be used to load data from Hadoop into Oracle Database in online or offline mode. The Big Data Appliance provides an optimized platform for storing and analyzing large amounts of data and is integrated with Oracle Exadata.
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive AnalyticsMark Rittman
This is a session for Oracle DBAs and devs that looks at the cutting edge big data techs like Spark, Kafka etc, and through demos shows how Hadoop is now a a real-time platform for fast analytics, data integration and predictive modeling
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKESMatt Stubbs
Date: 13th November 2018
Location: Self-Service Analytics Theatre
Time: 14:30 - 15:00
Speaker: Zaf Khan
Organisation: Arcadia Data
About: The use of data lakes continue to grow, and a recent survey by Eckerson Group shows that organizations are getting real value from their deployments. However, there’s still a lot of room for improvement when it comes to giving business users access to the wealth of potential insights in the data lake.
While the data management aspect has been fairly well understood over the years, the success of business intelligence (BI) and analytics on data lakes lags behind. In fact, organizations often struggle with data lakes because they are only accessible by highly-skilled data scientists and not by business users. But BI tools have been able to access data warehouses for years, so what gives?
In this talk, we’ll discuss:
• Why traditional BI tools are architected well for data warehouses, but not data lakes.
• Why every organization should have two BI standards: one for data warehouses and one for data lakes.
• Innovative capabilities provided by BI for data lakes
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
Virtualizing big data platforms like Hadoop provides organizations with agility, elasticity, and operational simplicity. It allows clusters to be quickly provisioned on demand, workloads to be independently scaled, and mixed workloads to be consolidated on shared infrastructure. This reduces costs while improving resource utilization for emerging big data use cases across many industries.
Data lakes often fail because they are only accessible by highly-skilled data scientists and not by business users. But BI tools have been able to access data warehouses for years, so what gives?
In this talk, we’ll discuss:
- Why existing BI tools are architected well for data warehouses, but not data lakes.
- The pros and cons of each architecture.
- Why every organization should have two BI standards: one for data warehouses and one for data lakes.
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
What is Model Context Protocol(MCP) - The new technology for communication bw...Vishnu Singh Chundawat
The MCP (Model Context Protocol) is a framework designed to manage context and interaction within complex systems. This SlideShare presentation will provide a detailed overview of the MCP Model, its applications, and how it plays a crucial role in improving communication and decision-making in distributed systems. We will explore the key concepts behind the protocol, including the importance of context, data management, and how this model enhances system adaptability and responsiveness. Ideal for software developers, system architects, and IT professionals, this presentation will offer valuable insights into how the MCP Model can streamline workflows, improve efficiency, and create more intuitive systems for a wide range of use cases.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo
From predictive maintenance to robotic automation, AI is driving the future of manufacturing. But without high-quality annotated data, even the smartest models fall short.
Discover how data annotation services are powering accuracy, safety, and efficiency in AI-driven manufacturing systems.
Precision in data labeling = Precision on the production floor.
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
The practice of big data - making big data approachable
1. Big Data in Practice: A Pragmatic approach to Adoption and Value creation
Raj Nair
Data Management Practitioner and Consultant
Content NOT FOR DISTRIBUTION: Property of Raj Nair
2. 1
Mainstream Big Data
2
Real World Use Cases and Applications
3
Practical Adoption : Opportunity Identification
4
Big Data 2.0 – What’s on the Horizon ?
5
Conclusion
Content NOT FOR DISTRIBUTION: Property of Raj Nair
3. Every Day Big Data
Reaching scale-up limits on your server
Represents tools, technologies, frameworks for storage and processing at scale
Represents Opportunity
Content NOT FOR DISTRIBUTION: Property of Raj Nair
4. Every Day Big Data
Reaching scale-up limits on your server
Represents tools, technologies, frameworks for storage and processing at scale
Represents Opportunity
Content NOT FOR DISTRIBUTION: Property of Raj Nair
5. Every Day Big Data
Reaching scale-up limits on your server
Represents tools, technologies, frameworks for storage and processing at scale
Represents Opportunity
Content NOT FOR DISTRIBUTION: Property of Raj Nair
6. Big Data 1.0 – The Hadoop Ecosystem
Software library
Framework for large scale distributed processing
Ability to scale to thousands of computers
Content NOT FOR DISTRIBUTION: Property of Raj Nair
7. Design Principles
-Large Data Sets
Classic Hadoop MapReduce – Batch Processing
-Moving computation is cheaper than moving data
-Hardware Failure, redundancy
Content NOT FOR DISTRIBUTION: Property of Raj Nair
8. What is this you call data?
Unlearn current notion of “Data”
Native Data Source
Content NOT FOR DISTRIBUTION: Property of Raj Nair
9. HDFS Storage and Archival
MapReduce Programming Library
Crunch
Data Pipeline processing
HBase
Real time access (low latency)
Pig
M/R Abstraction
Hive
Data Warehouse
Sqoop Data Transfer
Flume
Data Streaming
(High Latency)
Data Processing
Workload Management
Data Movement
Content NOT FOR DISTRIBUTION: Property of Raj Nair
10. Purpose
Use it for
HDFS
Distributed Storage
Raw data storage and archival
Flume
Data Movement
Continuous Streaming into HDFS
Sqoop
Data Movement
Data transfer from RDBMS to HDFS/HBase
HBase
Workload Mgmt
Near real-time read/write access to large data sets
Hive
Workload Mgmt
Analytical queries; data warehouse
Map Reduce
Data Processing
Low level custom code for data processing
Crunch
Data Processing (Java)
Coding M/R pipelines, aggregations
Pig
Data Processing
Scripting language; similar to Crunch
Content NOT FOR DISTRIBUTION: Property of Raj Nair
11. A Powerful Paradigm
Storage Layer
Query Engine
Processing Engine
Metadata
Hadoop – Separate Layers
Multiple Query Engines
Data in Native format
Oracle
SQL Server
Storage
Query
Storage
Query
Storage
Query
DB2
Tightly integrated Proprietary Stacks, cannot free your data
Content NOT FOR DISTRIBUTION: Property of Raj Nair
12. 1
Mainstream Big Data
2
Real World Use Cases and Applications
3
Practical Adoption : Opportunity Identification
4
Big Data 2.0 – What’s on the Horizon ?
5
Conclusion
Content NOT FOR DISTRIBUTION: Property of Raj Nair
13. Opportunity…
Transform Data Processing
Exploration
Information Enrichment
Data Archival
Content NOT FOR DISTRIBUTION: Property of Raj Nair
14. Data Processing Pipeline
Several sources
Varying Frequencies
Varying Formats
Quality check
Validations, Scrubbing
Transformations/Rules
Prune app data sources
Discard/Archive
Content NOT FOR DISTRIBUTION: Property of Raj Nair
15. ETL Engine
Data Warehouse
Data Storage
Content NOT FOR DISTRIBUTION: Property of Raj Nair
16. From Source to Business Value
Shoe-horning
Relational fit
Loading
Archiving / Purging
Biz Rules
Validations
Scrubbing
Mapping
Transforms
Staging
Distribution
Prep Tuning
Data stores
Minutes/Hours
Subset of Data
Hours
Reliability
Sourcing
Missed SLAs = Biz Frustration
Content NOT FOR DISTRIBUTION: Property of Raj Nair
17. From Source to Business Value
Significantly more data sources
Highly scalable, significantly performant data processing
New business value,
Faster time to value
Content NOT FOR DISTRIBUTION: Property of Raj Nair
18. Data Exploration
Large reservoir of data
Descriptive Statistics
Central Tendencies
Dispersion
Visualization
Surprise Me!
Content NOT FOR DISTRIBUTION: Property of Raj Nair
19. Data Exploration
Courtesy: Data Science Central
http://www.datasciencecentral.com/profiles/blogs/r-hadoop-data-analytics-heaven
Content NOT FOR DISTRIBUTION: Property of Raj Nair
23. Data Archival
Storage in Native Format
Redundancy , Replication
Easily accessible, inexpensive
Content NOT FOR DISTRIBUTION: Property of Raj Nair
24. 1
Mainstream Big Data
2
Real World Use Cases and Applications
3
Practical Adoption : Opportunity Identification
4
Big Data 2.0 – What’s on the Horizon ?
5
Conclusion
Content NOT FOR DISTRIBUTION: Property of Raj Nair
25. Practical Adoption
Big Data Technologies don’t solve all problems
Leveraging existing investments
Complexities of existing systems
Content NOT FOR DISTRIBUTION: Property of Raj Nair
26. Proof of Concept
Use your own data – realistic results
Focus on very specific pain points
Know what you are going to measure
Content NOT FOR DISTRIBUTION: Property of Raj Nair
27. Data Processing
Engine
Data Warehouse
Data Storage
Content NOT FOR DISTRIBUTION: Property of Raj Nair
28. Data Processing
Engine
Data Warehouse
Data Storage
Keep all your raw data
Cheaper Hardware
Low cost per byte $$
High value per byte
Offload from RDBMS
Improve scale, performance
Leverage existing tools
Content NOT FOR DISTRIBUTION: Property of Raj Nair
29. Hardware on a budget
Master:
- 12 cores
- 32 GB RAM
- 2 TB SATA Drives, 7.2K RPM
Workers:
- 4 Nodes
- 12 cores
- 16 GB RAM
- 4 TB SATA Drives each, 7.2 PRM
$5000
$5000 each
4-Port 10 Gig Switch - $1500
Grand Total < $30,000
Software costs ? - 0
Content NOT FOR DISTRIBUTION: Property of Raj Nair
30. Exploratory BI / Analysis
Data Storage
Makes Data exploration practically cheaper and faster
Use existing visualization tools (Tableau or other)
Check for integration with R
Content NOT FOR DISTRIBUTION: Property of Raj Nair
31. Data Architecture
•Single Important factor
•Don’t miss technology trends
But ….
It’s more about the battle plan
Content NOT FOR DISTRIBUTION: Property of Raj Nair
32. 1
Mainstream Big Data
2
Real World Use Cases and Applications
3
Practical Adoption : Opportunity Identification
4
Big Data 2.0 – What’s on the Horizon ?
5
Conclusion
Content NOT FOR DISTRIBUTION: Property of Raj Nair
33. SQL on Hadoop
Impala
Tez
Phoenix
•Cloudera
•MPP Engine
•HortonWorks
•SQL on Hive
•Apache
•SQL on HBase
Content NOT FOR DISTRIBUTION: Property of Raj Nair
34. In memory and Real Time
Spark
Storm
Apache Drill
•100x faster than M/R
•Event processing
•Low latency ad hoc queries
•Interactive queries at scale
Content NOT FOR DISTRIBUTION: Property of Raj Nair
35. 1
Mainstream Big Data
2
Real World Use Cases and Applications
3
Practical Adoption : Opportunity Identification
4
Big Data 2.0 – What’s on the Horizon ?
5
Conclusion
Content NOT FOR DISTRIBUTION: Property of Raj Nair
36. Where can I get Hadoop?
Distributors
Open Source Apache Project
And these guys…
Cloud
Content NOT FOR DISTRIBUTION: Property of Raj Nair
37. Conclusion
The Power & Paradigm of Distributed Computing
“Nativity” of Data – Unlearn old notions
Identify, understand your data processing pipeline
POC with a measurable, specific use case
Data Architecture – key to sustainable scalability
Stay informed
Content NOT FOR DISTRIBUTION: Property of Raj Nair