Deploying Big Data Platforms

Jul 8, 2016Download as pptx, pdf

0 likes457 views

A presentation discussing how to deploy Big data solutions. The difference between structured reporting systems which feed business processes and the data science systems which do cool stuff

LEARN • NETWORK • COLLABORATE • INFLUENCE

LEARN • NETWORK • COLLABORATE • INFLUENCE
Deploying Big Data platforms
LEARN • NETWORK • COLLABORATE • INFLUENCE
Chris Kernaghan
Principal Consultant

LEARN • NETWORK • COLLABORATE • INFLUENCE
Cholera epidemic first use of big data

LEARN • NETWORK • COLLABORATE • INFLUENCE
Big Data Epidemiology by Google

LEARN • NETWORK • COLLABORATE • INFLUENCE
How I really got started in Big Data
John, we need
to give Chris
more grey hair
Let’s throw him
into a Big Data
demo

LEARN • NETWORK • COLLABORATE • INFLUENCE
My examples

LEARN • NETWORK • COLLABORATE • INFLUENCE
Areas of focus
Data acquisition
and curation
Data storage Compute
infrastructure
Analysis and
Insight
Everything as Code*
* Well As much as possible

LEARN • NETWORK • COLLABORATE • INFLUENCE
Data Acquisition and curation
Areas of focus

LEARN • NETWORK • COLLABORATE • INFLUENCE
Data Lake
HANA

LEARN • NETWORK • COLLABORATE • INFLUENCE
How big was the Panama Papers data set

LEARN • NETWORK • COLLABORATE • INFLUENCE
Data Lake
Panama Papers Technology stack
SQL

LEARN • NETWORK • COLLABORATE • INFLUENCE
The tools used supported 370 journalists from
around the world
Infrastructure
was a pool of
up to 40
servers run in
AWS

LEARN • NETWORK • COLLABORATE • INFLUENCE
Data quality and curation are not one time activities
Remove the human element as much as possible

LEARN • NETWORK • COLLABORATE • INFLUENCE
Data security
• Data lake
– What data do you collect
– Do you have restrictions on what data can be combined
– How long does your data live

LEARN • NETWORK • COLLABORATE • INFLUENCE
Data security
• Geographical concerns
– Where does your data reside

LEARN • NETWORK • COLLABORATE • INFLUENCE
Data security
• Authentication
– Who is accessing your data

LEARN • NETWORK • COLLABORATE • INFLUENCE
Data Storage
Areas of focus

LEARN • NETWORK • COLLABORATE • INFLUENCE
How BIG is Big Data

LEARN • NETWORK • COLLABORATE • INFLUENCE
Storage Considerations
• IOPS are still important
– Big data still uses a lot of spinning disk
• Replication and Redundancy
– Eats a lot of disk space
• Build for failure
• Sometimes you have to go in-memory

LEARN • NETWORK • COLLABORATE • INFLUENCE
Compute infrastructure
Areas of focus

LEARN • NETWORK • COLLABORATE • INFLUENCE
Structured Reporting Versus Big Data/Science
Compute requirements
2
• Structured reporting systems run business processes
– Sized and static
– Under change control
– Business centric

LEARN • NETWORK • COLLABORATE • INFLUENCE
Structured Reporting Versus Big Data/Science
Compute requirements
2
• Data science systems answer difficult questions irregularly
– Cloud or heavy use of virtualisation
– Developer centric
– Rapidly evolving

LEARN • NETWORK • COLLABORATE • INFLUENCE
What you still need to remember
2
• Compute is cheap
• Scalability is critical

LEARN • NETWORK • COLLABORATE • INFLUENCE
What you still need to remember
2
• Software definition for consistency
• Automate as much as possible

LEARN • NETWORK • COLLABORATE • INFLUENCE
2
100 Hadoop
Nodes
122GB RAM
Each = 12.2TB RAM
Build time of 3Hrs

LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
2
Disk definition
Network
defintion
Software
Install

LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
3
• Deployment was consistent for each and every node of the
cluster
– Hostnames defined the same way
– Configuration files created the same way

LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
3
• Faster deployment
– Automated build 3hrs to build and deploy 100 nodes
– Manual build 800hrs + to build and deploy 100 nodes
• Use of automated tools to detect failure and start new node
(ElasticBeanstalk)

LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
3
• Reusability of script
– Heavy use of parameters means it is adaptable
• Use of Git meant distributed development was handled easily

LEARN • NETWORK • COLLABORATE • INFLUENCE
Analysis and Insight
3
Areas of focus
Presentation Tag Line

LEARN • NETWORK • COLLABORATE • INFLUENCE
Query the Data
• Programmatically
– Python
– R
• Application
– Lumira
– Business Objects
– Spark
– SQL
– Excel
– ElasticSearch

LEARN • NETWORK • COLLABORATE • INFLUENCE
Analysis and Visualisation
• Quick Analysis
– Lumira, Excel
• Graph
– Neo4J, Synerscope
• Charts
– Business Objects, Grafana, Kibana
• Dynamic
– D3
http://www.wikiviz.org/wiki/Tools

LEARN • NETWORK • COLLABORATE • INFLUENCE
Things to remember
• Remember the type of
platform you are using
• Storage is cheap but not
all storage is equal
• Scalability is critical
• Version control rocks
• Automate everything
you can
• Value is in the data but
not all data is valuable
• Data should not live
forever

LEARN • NETWORK • COLLABORATE • INFLUENCE
3
• Key Takeways

The document discusses Marketo's migration of their SAAS business analytics platform to Hadoop. It describes their requirements of near real-time processing of 1 billion activities per customer per day at scale. They conducted a technology selection process between various Hadoop components and chose HBase, Kafka and Spark Streaming. The implementation involved building expertise, designing and building their first cluster, implementing security including Kerberos, validation through passive testing, deploying the new system through a migration, and ongoing monitoring, patching and upgrading of the new platform. Challenges included managing expertise retention, Zookeeper performance on VMs, Kerberos integration, and capacity planning for the shared Hadoop cluster.

Harnessing the Power of Apache Hadoop Cloudera, Inc.

This document discusses harnessing the power of Apache Hadoop. It summarizes the benefits of using Hadoop to derive value from large, diverse datasets. It then outlines the steps to install and deploy Hadoop, challenges of doing so, and advantages of using Cloudera's Distribution of Hadoop (CDH) and management tools to more easily operationalize Hadoop. The document promotes an upcoming webinar on managing the Hadoop lifecycle.

Webinar: Don't Leave Your Data in the DarkDataStax

As new types of data sources emerge from cloud, mobile devices, social media and machine sensor devices, traditional databases hit the ceiling due to today’s dynamic, data-volume driven business culture. Join us in this online webinar and learn how you can incorporate a modern, NoSQL platform into daily operations to optimize and simplify data performance. DataStax recently announced DataStax Enterprise 4.0, a production-certified version of Apache Cassandra with an in-memory option, enterprise search, advanced security features and visual management tools. Give your developers a simple and powerful way to deliver the information your customers care about most—unconstrained by the complexities and high costs of traditional database systems. Learn how to: - Easily assign data based on its performance needs on traditional spinning disk, SSD or in-memory. All in the same database instance - Leverage DataStax’s built-in enhancements for broader information search and analysis even with many thousands of concurrent requests - Visually monitor, manage, and fine-tune your environment to get the most of your online data

Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...VMware Tanzu

Pivotal HAWQ, one of the world’s most advanced enterprise SQL on Hadoop technology, coupled with the Hortonworks Data Platform, the only 100% open source Apache Hadoop data platform, can turbocharge your analytic efforts. The slides from this technical webinar present a deep dive on this powerful modern data architecture for analytics and data science. Learn more here: http://pivotal.io/big-data/pivotal-hawq

Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago

Big data pipelinesVivek Aanand Ganesan

Building scalable data pipelines for big data involves dealing with legacy systems, implementing data lineage and provenance, managing the data lifecycle, and engineering pipelines that can handle large volumes of data. Effective data pipeline engineering requires understanding how to extract, transform and load data while addressing issues like privacy, security, and integrating diverse data sources. Frameworks like Cascading can help build pipelines, but proper testing and scaling is also required to develop robust solutions.

Big Data Computing ArchitectureGang Tao

How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit

For over 30 years, Parametric has been a leading provider of model-based portfolios to institutional and private investors, with unique implementation and customization expertise. Much like other cutting-edge financial services providers, Parametric operates with highly diverse, fast moving data from which they glean insights. Data sources range from benchmark providers to electronic trading participants to stock exchanges etc. The challenge is to not just onboard the data but also to figure out how to monetize it when the schemas are fast changing. This presents a problem to traditional architectures where large teams are needed to design the new ETL flow. Organizations that are able to quickly adapt to new schemas and data sources have a distinct competitive advantage. In this presentation and demo, Architects from Parametric , Chris Gambino & Vamsi Chemitiganti will present the data architecture designed in response to this business challenge. We discuss the approach (and trade-offs) to pooling, managing, processing the data using the latest techniques in data ingestion & pre-processing. The overall best practices in creating a central data pool are also discussed. Quantitative analysts to have the most accurate and up to date information for their models to work on. Attendees will be able to draw on their experiences both from a business and technology standpoint on not just creating a centralized data platform but also being able to distribute it to different units.

Solr + Hadoop: Interactive Search for Hadoopgregchanan

This document discusses Cloudera Search, which integrates Apache Solr with Cloudera's distribution of Apache Hadoop (CDH) to provide interactive search capabilities. It describes the architecture of Cloudera Search, including components like Solr, SolrCloud, and Morphlines for extraction and transformation. Methods for indexing data in real-time using Flume or batch using MapReduce are presented. The document also covers querying, security features like Kerberos authentication and collection-level authorization using Sentry, and concludes by describing how to obtain Cloudera Search.

How To Tell if Your Business Needs NoSQLDataStax

Expert IT analyst groups like Wikibon forecast that NoSQL database usage will grow at a compound rate of 60% each year for the next five years, and Gartner Groups says NoSQL databases are one of the top trends impacting information management in 2013. But is NoSQL right for your business? How do you know which business applications will benefit from NoSQL and which won't? What questions do you need to ask in order to make such decisions? If you're wondering what NoSQL is and if your business can benefit from NoSQL technology, join DataStax for the Webinar, "How to Tell if Your Business Needs NoSQL". This to-the-point presentation will provide practical litmus tests to help you understand whether NoSQL is right for your use case, and supplies examples of NoSQL technology in action with leading businesses that demonstrate how and where NoSQL databases can have the greatest impact." Speaker: Robin Schumacher, Vice President of Products at DataStax Robin Schumacher has spent the last 20 years working with databases and big data. He comes to DataStax from EnterpriseDB, where he built and led a market-driven product management group. Previously, Robin started and led the product management team at MySQL for three years before they were bought by Sun (the largest open source acquisition in history), and then by Oracle. He also started and led the product management team at Embarcadero Technologies, which was the #1 IPO in 2000. Robin is the author of three database performance books and frequent speaker at industry events. Robin holds BS, MA, and Ph.D. degrees from various universities.

Enterprise Data Warehouse Optimization: 7 Keys to SuccessHortonworks

Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...DataStax Academy

This document discusses using Spark and Cassandra for ad hoc analytics on Internet of Complex Things (IoCT) data. It describes modeling data in Cassandra, limitations of ad hoc queries in Cassandra, and how the Spark Cassandra connector enables running ad hoc queries in Spark by treating Cassandra tables as DataFrames that can be queried using SQL. It also covers running Spark SQL queries on Cassandra data using the JDBC server.

Instrumenting your Instruments DataWorks Summit/Hadoop Summit

This document summarizes Premal Shah's presentation on how 6sense instruments their systems to analyze customer data. 6sense uses Hadoop and other tools to ingest customer data from various sources, run modeling and scoring, and provide actionable insights to customers. They discuss the data pipeline, challenges of performance and scaling, and how they use metrics and tools like Sumo Logic and OpsClarity to optimize and monitor their systems.

Getting Ready to Use Redis with Apache Spark with Tague GriffithDatabricks

This technical tutorial is designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. The session starts with a quick introduction to Redis and the capabilities Redis provides. It will cover the basic data types provided by Redis and the module system. Using an ad serving use case, Griffith will look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. You will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark and then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will also be discussed, focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these features. By the end of the session, you should feel confident building a prototype/proof-of-concept application using Redis and Spark. You’ll understand how Redis complements Spark, and how to use Redis to serve complex, ML-models with high performance.

Lambda architecture for real time big dataTrieu Nguyen

- The document discusses the Lambda Architecture, a system designed by Nathan Marz for building real-time big data applications. It is based on three principles: human fault-tolerance, data immutability, and recomputation. - The document provides two case studies of applying Lambda Architecture - at Greengar Studios for API monitoring and statistics, and at eClick for real-time data analytics on streaming user event data. - Key lessons discussed are keeping solutions simple, asking the right questions to enable deep analytics and profit, using reactive and functional approaches, and turning data into useful insights.

The Hidden Value of Hadoop MigrationDatabricks

Analyzing the World's Largest Security Data Lake!DataWorks Summit

The document discusses Symantec's CloudFire Analytics platform for analyzing security data at scale. It describes how CloudFire provides Hadoop ecosystem tools on OpenStack virtual machines across 50+ data centers to support security product analytics. Key points covered include analytics services and data, administration and monitoring using tools like Ambari and OpsView, and plans for self-service analytics using dynamic clusters provisioned through CloudBreak integration.

Redash: Open Source SQL Analytics on Data LakesDatabricks

Complex Data Transformations Made EasyData Con LA

Data Con LA 2020 Description Join this session to learn how to build a modern cloud-scale data compute platform with code in just minutes! Using the industry's first IDE for building data applications, developers can now create data marts and data applications, while working interactively with large datasets. We will explore how easy it is to develop, test and operationalize powerful data compute applications over streaming data using SQL and Python and eager execution in Xcalar with the combination of declarative and visual imperative programming and eager execution You will see how you can reduce time to market for analyzing large volumes of data and building enterprise-level complex data compute applications. You will learn how to increase your developer productivity with SQL and Python, and put your complex business logic and ML models into production pipelines with the fastest time to value in industry. Speaker Nikita Ogievetsky, Xcalar, VP Product Engineering

The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman

The document discusses the future of analytics, data integration, and business intelligence (BI) on big data platforms like Hadoop. It covers how BI has evolved from old-school data warehousing to enterprise BI tools to utilizing big data platforms. New technologies like Impala, Kudu, and dataflow pipelines have made Hadoop fast and suitable for analytics. Machine learning can be used for automatic schema discovery. Emerging open-source BI tools and platforms, along with notebooks, bring new approaches to BI. Hadoop has become the default platform and future for analytics.

Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & RedshiftDataKitchen

IMCSummit 2015 - Day 2 Developer Track - The Internet of Analytics – Discover...In-Memory Computing Summit

IoT devices generate high volume, continuous streams of data that must be analyzed in-memory – before they land on disk – to identify potential outliers/failures or business opportunities. Companies need to build robust yet flexible applications that can instantly act on the information derived from analyzing their IoT data. Attend this session to learn how you can easily handle real-time data acquisition across structured and semi-structured data, as well as windowing, fast in-memory streaming analytics, event correlation, visualization, alerts, workflows and smart data storage.

ProtectWise Revolutionizes Enterprise Network Security in the Cloud with Data...DataStax Academy

ProtectWise has revolutionized enterprise network security with its Security DVR Platform, which combines detection, visibility, and response capabilities into a single cloud-based solution. The Platform ingests and analyzes massive amounts of network data using technologies like Cassandra, Solr, and stream processing to detect threats, gain network visibility, and power responsive analytics over days, months, and years of historical data. A demo of the Security DVR Visualizer was provided.

Reliable Data Intestion in BigData / IoTGuido Schmutz

Many of the Big Data and IoT use cases are based on combing data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.

ASPgems - kappa architectureJuantomás García Molina

Kappa Architecture is an alternative to Lambda Architecture that simplifies real-time data processing. It uses a distributed log like Kafka to store all input data immutably to allow reprocessing from the beginning if the processing code changes. This avoids having to maintain separate batch and real-time processing systems. The ASPgems team has implemented Kappa Architecture for several clients using Kafka, Spark Streaming, and Cassandra to provide real-time analytics and metrics in sectors like telecommunications, IoT, insurance, and energy.

Pivotal - Advanced Analytics for Telecommunications Hortonworks

Innovative mobile operators need to mine the vast troves of unstructured data now available to them to help develop compelling customer experiences and uncover new revenue opportunities. In this webinar, you’ll learn how HDB’s in-database analytics enable advanced use cases in network operations, customer care, and marketing for better customer experience. Join us, and get started on your advanced analytics journey today!

Data Science with Apache Spark - Crash Course - HS16SJDataWorks Summit/Hadoop Summit

The document provides an overview of machine learning concepts and techniques using Apache Spark. It discusses supervised and unsupervised learning methods like classification, regression, clustering and collaborative filtering. Specific algorithms like k-means clustering, decision trees and random forests are explained. It also introduces Apache Spark MLlib and how to build machine learning pipelines and models with Spark ML APIs.

Big Data overviewalexisroos

This document provides an overview of big data and discusses key concepts. It begins by defining big data and noting the increasing volume, velocity and variety of data being created. It then covers the big data landscape including storage models and technologies like Hadoop, analytics techniques like machine learning, and visualization. Finally, it discusses business uses cases and how big data is impacting industries and creating new business models through insights gained from data.

Introduction to Big DataRoi Blanco

More Related Content

What's hot (20)

How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit

Solr + Hadoop: Interactive Search for Hadoopgregchanan

How To Tell if Your Business Needs NoSQLDataStax

Enterprise Data Warehouse Optimization: 7 Keys to SuccessHortonworks

Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...DataStax Academy

Instrumenting your Instruments DataWorks Summit/Hadoop Summit

Getting Ready to Use Redis with Apache Spark with Tague GriffithDatabricks

Lambda architecture for real time big dataTrieu Nguyen

The Hidden Value of Hadoop MigrationDatabricks

Analyzing the World's Largest Security Data Lake!DataWorks Summit

Redash: Open Source SQL Analytics on Data LakesDatabricks

Complex Data Transformations Made EasyData Con LA

The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman

Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & RedshiftDataKitchen

IMCSummit 2015 - Day 2 Developer Track - The Internet of Analytics – Discover...In-Memory Computing Summit

ProtectWise Revolutionizes Enterprise Network Security in the Cloud with Data...DataStax Academy

Reliable Data Intestion in BigData / IoTGuido Schmutz

ASPgems - kappa architectureJuantomás García Molina

Pivotal - Advanced Analytics for Telecommunications Hortonworks

Data Science with Apache Spark - Crash Course - HS16SJDataWorks Summit/Hadoop Summit

How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit

Solr + Hadoop: Interactive Search for Hadoopgregchanan

How To Tell if Your Business Needs NoSQLDataStax

Enterprise Data Warehouse Optimization: 7 Keys to SuccessHortonworks

Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...DataStax Academy

Instrumenting your Instruments DataWorks Summit/Hadoop Summit

Getting Ready to Use Redis with Apache Spark with Tague GriffithDatabricks

Lambda architecture for real time big dataTrieu Nguyen

The Hidden Value of Hadoop MigrationDatabricks

Analyzing the World's Largest Security Data Lake!DataWorks Summit

Redash: Open Source SQL Analytics on Data LakesDatabricks

Complex Data Transformations Made EasyData Con LA

The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman

Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & RedshiftDataKitchen

IMCSummit 2015 - Day 2 Developer Track - The Internet of Analytics – Discover...In-Memory Computing Summit

ProtectWise Revolutionizes Enterprise Network Security in the Cloud with Data...DataStax Academy

Reliable Data Intestion in BigData / IoTGuido Schmutz

ASPgems - kappa architectureJuantomás García Molina

Pivotal - Advanced Analytics for Telecommunications Hortonworks

Data Science with Apache Spark - Crash Course - HS16SJDataWorks Summit/Hadoop Summit

Similar to Deploying Big Data Platforms (20)

Big Data overviewalexisroos

Introduction to Big DataRoi Blanco

Big Data WorldHossein Zahed

The document provides an overview of big data concepts including definitions, statistics on data generation and internet usage, applications and examples, challenges, and data types. It discusses key big data concepts such as the 3Vs of volume, velocity and variety; more Vs including veracity, value and visualization; data science areas and skills; the data workflow; and examples from companies like UPS, Walmart, eBay, and Kaiser Permanente.

PPT 1.1.2.pptx ehhllo hi hwi bdfhd dbdhubhushanshashi818

selected topics in CS-CHaaapteerobe.pptxBachaLamessaa

Unit 1vishal choudhary

The document provides an introduction to big data, including definitions and characteristics. It discusses how big data can be described by its volume, variety, and velocity. It notes that big data is large and complex data that is difficult to process using traditional data management tools. Common sources of big data include social media, sensors, and scientific instruments. Challenges in big data include capturing, storing, analyzing, and visualizing large and diverse datasets that are generated quickly. Distributed file systems and technologies like Hadoop are well-suited for processing big data.

Big data analytics with Apache HadoopSuman Saurabh

MapReduce allows distributed processing of large datasets across clusters of computers. It works by splitting the input data into independent chunks which are processed by the map function in parallel. The map function produces intermediate key-value pairs which are grouped by the reduce function to form the output data. Fault tolerance is achieved through replication of data across nodes and re-executing failed tasks. This makes MapReduce suitable for efficiently processing very large datasets in a distributed environment.

Big Data Analytics M1.pdf big data analyticsnithishlkumar9194

basic of data science and big data......anjanasharma77573

Be3 experimentingbigdatainabox-part1:comprehendingthescenarioKalyana Chakravarthy Kadiyala

The wave of Big Data is still in its high peaks, with age of prominence at about 5 years. Many are still amused, while few fortunate folks had a taste of it. Taste with essence. Few linger around the topics, terminology, and other buzz! This is a series attempt to gain our arms around the Domain and key coordinates of the subject. Subsequently dwell a bit deeper on implementation challenges, navigating a bit close to the core of the challenges. Whet tools, solution approaches and how knowledge from other related fields of Science fit into the overall ball game! Main abode for this going forward will be at www.ganaakruti.com.

Big data4businessusersBob Hardaway

Business with Big dataBruno Curtarelli

This document provides a brief history of big data, from the earliest known uses of data storage thousands of years ago to modern applications of big data. It outlines key developments such as the creation of early data storage and analysis methods, the development of computerized data processing, and the growth of data collection and sharing through the internet and mobile technology. The document also discusses the increasing volume of data generated every day through online activities and defines some of the main challenges in working with big data today.

Infographics and big dataHanna-Liisa Pender

Digital infographics can use graphics to more easily convey information through visualization. There are different types of infographics including spatial, chronological, and quantitative infographics that use diagrams, charts, maps, and other visual elements to communicate information. Big data refers to extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations. Big data is used across many industries for applications like customer analytics, predictive maintenance, risk analysis, and more. Hadoop is an open-source software framework that allows distributed processing of large data sets across clusters of computers using MapReduce.

Big Data and HadoopMaulikLakhani

The document provides an introduction to big data and Hadoop. It defines big data as large datasets that cannot be processed using traditional computing techniques due to the volume, variety, velocity, and other characteristics of the data. It discusses traditional data processing versus big data and introduces Hadoop as an open-source framework for storing, processing, and analyzing large datasets in a distributed environment. The document outlines the key components of Hadoop including HDFS, MapReduce, YARN, and Hadoop distributions from vendors like Cloudera and Hortonworks.

Intro to Data Science Big DataIndu Khemchandani

Mike keating - News Int - 18th BDL meetupbigdatalondon

BDA UNIT 1big data – web analytics – big data applications– big data technolo...BalachandarJ5

Tools and Methods for Big Data Analytics by Dahl WintersMelinda Thielbar

Bigdata and Hadoop with applicationsPadma Metta

Big Data Analytics and Hadoop is presented. Key points include: - Big data is large and complex data that is difficult to process using traditional methods. Domains that produce large datasets include meteorology, physics simulations, and internet search. - The four V's of big data are volume, velocity, variety, and veracity. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. Its core components are HDFS for storage and MapReduce for processing. - Apache Hadoop has gained popularity for big data analytics due to its ability to process large amounts of data in parallel using commodity hardware, its scalability, and automatic failover. A Hadoop ecosystem of

Big Data overviewalexisroos

Introduction to Big DataRoi Blanco

Big Data WorldHossein Zahed

PPT 1.1.2.pptx ehhllo hi hwi bdfhd dbdhubhushanshashi818

selected topics in CS-CHaaapteerobe.pptxBachaLamessaa

Unit 1vishal choudhary

Big data analytics with Apache HadoopSuman Saurabh

Big Data Analytics M1.pdf big data analyticsnithishlkumar9194

basic of data science and big data......anjanasharma77573

Be3 experimentingbigdatainabox-part1:comprehendingthescenarioKalyana Chakravarthy Kadiyala

Big data4businessusersBob Hardaway

Business with Big dataBruno Curtarelli

Infographics and big dataHanna-Liisa Pender

Big Data and HadoopMaulikLakhani

Intro to Data Science Big DataIndu Khemchandani

Mike keating - News Int - 18th BDL meetupbigdatalondon

BDA UNIT 1big data – web analytics – big data applications– big data technolo...BalachandarJ5

Tools and Methods for Big Data Analytics by Dahl WintersMelinda Thielbar

Bigdata and Hadoop with applicationsPadma Metta

More from Chris Kernaghan (16)

DevOps for SAP customersChris Kernaghan

How and why you need to build a big data labChris Kernaghan

The document discusses building a big data lab using cloud services like Google Cloud Platform (GCP). It notes that traditional homebrew labs have limited resources while cloud-based labs provide infinite resources and utility billing. It emphasizes defining goals for the lab work, acquiring necessary skills and knowledge, and using public datasets to complement internal data. Choosing the right tools and cloud platform like GCP, AWS, or Azure is important for high performance analytics on large data volumes and formats.

Can you do DevOps in SAP (DevOps -> SAP)Chris Kernaghan

Change Management in Hybrid landscapes 2017Chris Kernaghan

Beginners HANAChris Kernaghan

This document provides an overview of SAP HANA and business performance with SAP. It discusses the history of SAP HANA and how it has evolved from 2011 to provide real-time analysis, reporting and business capabilities. It also summarizes the HANA technology stack, database architecture, features, software lifecycle, infrastructure examples, backup/recovery process, user management and network connectivity.

Can you do DevOps in SAP (SAP -> DevOps)Chris Kernaghan

Change management in hybrid landscapesChris Kernaghan

Quick and dirty performance analysisChris Kernaghan

HANA - the backbone for S/4 HANAChris Kernaghan

Cloud or On PremiseChris Kernaghan

TEC118 –How Do You Manage the Configuration of Your Environments from Metal ...Chris Kernaghan

The document discusses configuration management in IT infrastructure. It describes how configuration management has evolved from manual processes using tools like Excel and Word documents to more automated approaches using infrastructure as code. It provides examples of configuration management systems like Puppet and Chef and shows their architectures and how they can be used to configure operating systems, databases, and applications in a consistent, repeatable manner. The presentation includes demonstrations of Puppet and Chef.

Automating Infrastructure as a Service Deployments and monitoring – TEC213Chris Kernaghan

The document discusses automating infrastructure as a service deployments and monitoring. It covers several topics: - IaaS environments allow for scalable cloud computing resources billed based on usage. SAP has been working with Amazon Web Services since 2008. - Automation can schedule repetitive tasks, enable consistent processes, and provide auditable records. DevOps focuses on collaboration, automation, measurement, and sharing to create flexible infrastructure. - Automating infrastructure provisioning, configuration management, change management, and exception monitoring can improve speed, reduce costs, and ensure compliance. Cloud security also needs automation to ensure data protection with the cloud's flexibility.

SAP Teched 2012 Session Tec3438 Automate IaaS SAP deploymentsChris Kernaghan

This document summarizes automation of infrastructure as a service deployments and monitoring. It discusses Infrastructure as a Service (IaaS) and how IaaS environments allow for scalable, on-demand provisioning of computing resources. It also discusses SAP's support for AWS and how Capgemini UK uses AWS for SAP deployments. The document advocates for automating infrastructure tasks to improve consistency, auditability and repeatability. It provides examples of automation for build processes, configuration management, change management, exception monitoring, and other areas. Overall, the document promotes automating infrastructure processes in IaaS environments to improve agility, reduce costs, and ensure compliance.

SAP TechEd 2013 session Tec118 managing your-environmentChris Kernaghan

This document discusses configuration management and provides examples of using Puppet and Chef for configuration management. It defines configuration management as managing the configuration of systems from hardware to applications. It explains that configuration management allows automating repetitive system administration tasks in a scheduled, consistent, auditable, and repeatable way. The document compares Puppet and Chef and provides examples of configuration scripts for each tool. It demos how to use Puppet and Chef to configure a system.

01 sap hana landscape and operations infrastructure v2 0Chris Kernaghan

This document discusses SAP HANA landscape and operations infrastructure. It covers HANA editions and technical scenarios, the HANA database lifecycle including patching, installation, backup and restore. It also discusses using HANA as a platform and monitoring HANA performance. Additionally, it outlines different data load scenarios into HANA and provides tips for tools like SAP BODS and SLT. The document concludes that introducing HANA brings technical challenges across areas like reporting, data management, development and operations that require consideration.

Sapuki sig 2013Chris Kernaghan

DevOps for SAP customersChris Kernaghan

How and why you need to build a big data labChris Kernaghan

Can you do DevOps in SAP (DevOps -> SAP)Chris Kernaghan

Change Management in Hybrid landscapes 2017Chris Kernaghan

Beginners HANAChris Kernaghan

Can you do DevOps in SAP (SAP -> DevOps)Chris Kernaghan

Change management in hybrid landscapesChris Kernaghan

Quick and dirty performance analysisChris Kernaghan

HANA - the backbone for S/4 HANAChris Kernaghan

Cloud or On PremiseChris Kernaghan

TEC118 –How Do You Manage the Configuration of Your Environments from Metal ...Chris Kernaghan

Automating Infrastructure as a Service Deployments and monitoring – TEC213Chris Kernaghan

SAP Teched 2012 Session Tec3438 Automate IaaS SAP deploymentsChris Kernaghan

SAP TechEd 2013 session Tec118 managing your-environmentChris Kernaghan

01 sap hana landscape and operations infrastructure v2 0Chris Kernaghan

Sapuki sig 2013Chris Kernaghan

Recently uploaded (20)

Trends Artificial Intelligence - Mary MeekerClive Dickens

Mary Meeker’s 2024 AI report highlights a seismic shift in productivity, creativity, and business value driven by generative AI. She charts the rapid adoption of tools like ChatGPT and Midjourney, likening today’s moment to the dawn of the internet. The report emphasizes AI’s impact on knowledge work, software development, and personalized services—while also cautioning about data quality, ethical use, and the human-AI partnership. In short, Meeker sees AI as a transformative force accelerating innovation and redefining how we live and work.

Trends Report: Artificial Intelligence (AI)Brian Ahier

What is Oracle EPM A Guide to Oracle EPM Cloud Everything You Need to KnowSMACT Works

In today's fast-paced business landscape, financial planning and performance management demand powerful tools that deliver accurate insights. Oracle EPM (Enterprise Performance Management) stands as a leading solution for organizations seeking to transform their financial processes. This comprehensive guide explores what Oracle EPM is, its key benefits, and how partnering with the right Oracle EPM consulting team can maximize your investment.

How Advanced Environmental Detection Is Revolutionizing Oil & Gas Safety.pdfRejig Digital

Unlock the future of oil & gas safety with advanced environmental detection technologies that transform hazard monitoring and risk management. This presentation explores cutting-edge innovations that enhance workplace safety, protect critical assets, and ensure regulatory compliance in high-risk environments. 🔍 What You’ll Learn: ✅ How advanced sensors detect environmental threats in real-time for proactive hazard prevention 🔧 Integration of IoT and AI to enable rapid response and minimize incident impact 📡 Enhancing workforce protection through continuous monitoring and data-driven safety protocols 💡 Case studies highlighting successful deployment of environmental detection systems in oil & gas operations Ideal for safety managers, operations leaders, and technology innovators in the oil & gas industry, this presentation offers practical insights and strategies to revolutionize safety standards and boost operational resilience. 👉 Learn more: https://www.rejigdigital.com/blog/continuous-monitoring-prevent-blowouts-well-control-issues/

Mark Zuckerberg teams up with frenemy Palmer Luckey to shape the future of XR...Scott M. Graffius

Mark Zuckerberg teams up with frenemy Palmer Luckey to shape the future of XR/VR/AR wearables 🥽 Drawing on his background in AI, Agile, hardware, software, gaming, and defense, Scott M. Graffius explores the collaboration in “Meta and Anduril’s EagleEye and the Future of XR: How Gaming, AI, and Agile are Transforming Defense.” It’s a powerful case of cross-industry innovation—where gaming meets battlefield tech. 📖 Read the article: https://www.scottgraffius.com/blog/files/meta-and-anduril-eagleeye-and-the-future-of-xr-how-gaming-ai-and-agile-are-transforming-defense.html #Agile #AI #AR #ArtificialIntelligence #AugmentedReality #Defense #DefenseTech #EagleEye #EmergingTech #ExtendedReality #ExtremeReality #FutureOfTech #GameDev #GameTech #Gaming #GovTech #Hardware #Innovation #Meta #MilitaryInnovation #MixedReality #NationalSecurity #TacticalTech #Tech #TechConvergence #TechInnovation #VirtualReality #XR

Data Virtualization: Bringing the Power of FME to Any ApplicationSafe Software

Imagine building web applications or dashboards on top of all your systems. With FME’s new Data Virtualization feature, you can deliver the full CRUD (create, read, update, and delete) capabilities on top of all your data that exploit the full power of FME’s all data, any AI capabilities. Data Virtualization enables you to build OpenAPI compliant API endpoints using FME Form’s no-code development platform. In this webinar, you’ll see how easy it is to turn complex data into real-time, usable REST API based services. We’ll walk through a real example of building a map-based app using FME’s Data Virtualization, and show you how to get started in your own environment – no dev team required. What you’ll take away: -How to build live applications and dashboards with federated data -Ways to control what’s exposed: filter, transform, and secure responses -How to scale access with caching, asynchronous web call support, with API endpoint level security. -Where this fits in your stack: from web apps, to AI, to automation Whether you’re building internal tools, public portals, or powering automation – this webinar is your starting point to real-time data delivery.

Palo Alto Networks Cybersecurity FoundationVICTOR MAESTRE RAMIREZ

LSNIF: Locally-Subdivided Neural Intersection FunctionTakahiro Harada

Neural representations have shown the potential to accelerate ray casting in a conventional ray-tracing-based rendering pipeline. We introduce a novel approach called Locally-Subdivided Neural Intersection Function (LSNIF) that replaces bottom-level BVHs used as traditional geometric representations with a neural network. Our method introduces a sparse hash grid encoding scheme incorporating geometry voxelization, a scene-agnostic training data collection, and a tailored loss function. It enables the network to output not only visibility but also hit-point information and material indices. LSNIF can be trained offline for a single object, allowing us to use LSNIF as a replacement for its corresponding BVH. With these designs, the network can handle hit-point queries from any arbitrary viewpoint, supporting all types of rays in the rendering pipeline. We demonstrate that LSNIF can render a variety of scenes, including real-world scenes designed for other path tracers, while achieving a memory footprint reduction of up to 106.2x compared to a compressed BVH. https://arxiv.org/abs/2504.21627

Extend-Microsoft365-with-Copilot-agents.pptxhoang971

Domino IQ – What to Expect, First Steps and Use Casespanagenda

Webinar Recording: https://www.panagenda.com/webinars/domino-iq-what-to-expect-first-steps-and-use-cases/ HCL Domino iQ Server – From Ideas Portal to implemented Feature. Discover what it is, what it isn’t, and explore the opportunities and challenges it presents. Key Takeaways - What are Large Language Models (LLMs) and how do they relate to Domino iQ - Essential prerequisites for deploying Domino iQ Server - Step-by-step instructions on setting up your Domino iQ Server - Share and discuss thoughts and ideas to maximize the potential of Domino iQ

Introduction to Typescript - GDG On Campus EUEGoogle Developer Group On Campus European Universities in Egypt

Interested in leveling up your JavaScript skills? Join us for our Introduction to TypeScript workshop. Learn how TypeScript can improve your code with dynamic typing, better tooling, and cleaner architecture. Whether you're a beginner or have some experience with JavaScript, this session will give you a solid foundation in TypeScript and how to integrate it into your projects. Workshop content: - What is TypeScript? - What is the problem with JavaScript? - Why TypeScript is the solution - Coding demo

Securiport - A Border Security CompanySecuriport

Agentic AI: Beyond the Buzz- LangGraph Studio V2Shashikant Jagtap

DevOps in the Modern Era - Thoughtfully Critical PodcastChris Wahl

Your startup on AWS - How to architect and maintain a Lean and Mean accountangelo60207

Prevent infrastructure costs from becoming a significant line item on your startup’s budget! Serial entrepreneur and software architect Angelo Mandato will share his experience with AWS Activate (startup credits from AWS) and knowledge on how to architect a lean and mean AWS account ideal for budget minded and bootstrapped startups. In this session you will learn how to manage a production ready AWS account capable of scaling as your startup grows for less than $100/month before credits. We will discuss AWS Budgets, Cost Explorer, architect priorities, and the importance of having flexible, optimized Infrastructure as Code. We will wrap everything up discussing opportunities where to save with AWS services such as S3, EC2, Load Balancers, Lambda Functions, RDS, and many others.

TimeSeries Machine Learning - PyData London 2025Suyash Joshi

Introduction to Internet of things .ppt.hok12341073

How to Detect Outliers in IBM SPSS Statistics.pptxVersion 1 Analytics

The case for on-premises AIPrincipled Technologies

Exploring the advantages of on-premises Dell PowerEdge servers with AMD EPYC processors vs. the cloud for small to medium businesses’ AI workloads AI initiatives can bring tremendous value to your business, but you need to support your new AI workloads effectively. That means choosing the best possible infrastructure for your needs—and many companies are finding that the cloud isn’t right for them. According to a recent Rackspace survey of IT executives, 69 percent of companies have moved some of their applications on-premises from the cloud, with half of those citing security and compliance as the reason and 44 percent citing cost. On-premises solutions provide a number of advantages. With full control over your security infrastructure, you can be certain that all compliance requirements remain firmly in the hands of your IT team. Opting for on-premises also gives you the ability to design your infrastructure to the precise needs of that team and your new AI workloads. Depending on the workload, you may also see performance benefits, along with more predictable costs. As you start to build your next AI initiative, consider an on-premises solution utilizing AMD EPYC processor-powered Dell PowerEdge servers.

If You Use Databricks, You Definitely Need FMESafe Software

Trends Artificial Intelligence - Mary MeekerClive Dickens

Trends Report: Artificial Intelligence (AI)Brian Ahier

What is Oracle EPM A Guide to Oracle EPM Cloud Everything You Need to KnowSMACT Works

How Advanced Environmental Detection Is Revolutionizing Oil & Gas Safety.pdfRejig Digital

Mark Zuckerberg teams up with frenemy Palmer Luckey to shape the future of XR...Scott M. Graffius

Data Virtualization: Bringing the Power of FME to Any ApplicationSafe Software

Palo Alto Networks Cybersecurity FoundationVICTOR MAESTRE RAMIREZ

LSNIF: Locally-Subdivided Neural Intersection FunctionTakahiro Harada

Extend-Microsoft365-with-Copilot-agents.pptxhoang971

Domino IQ – What to Expect, First Steps and Use Casespanagenda

Introduction to Typescript - GDG On Campus EUEGoogle Developer Group On Campus European Universities in Egypt

Securiport - A Border Security CompanySecuriport

Agentic AI: Beyond the Buzz- LangGraph Studio V2Shashikant Jagtap

DevOps in the Modern Era - Thoughtfully Critical PodcastChris Wahl

Your startup on AWS - How to architect and maintain a Lean and Mean accountangelo60207

TimeSeries Machine Learning - PyData London 2025Suyash Joshi

Introduction to Internet of things .ppt.hok12341073

How to Detect Outliers in IBM SPSS Statistics.pptxVersion 1 Analytics

The case for on-premises AIPrincipled Technologies

If You Use Databricks, You Definitely Need FMESafe Software

Deploying Big Data Platforms

1. LEARN • NETWORK • COLLABORATE • INFLUENCE

2. LEARN • NETWORK • COLLABORATE • INFLUENCE Deploying Big Data platforms LEARN • NETWORK • COLLABORATE • INFLUENCE Chris Kernaghan Principal Consultant

3. LEARN • NETWORK • COLLABORATE • INFLUENCE Cholera epidemic first use of big data

4. LEARN • NETWORK • COLLABORATE • INFLUENCE Big Data Epidemiology by Google

5. LEARN • NETWORK • COLLABORATE • INFLUENCE How I really got started in Big Data John, we need to give Chris more grey hair Let’s throw him into a Big Data demo

6. LEARN • NETWORK • COLLABORATE • INFLUENCE My examples

7. LEARN • NETWORK • COLLABORATE • INFLUENCE

8. LEARN • NETWORK • COLLABORATE • INFLUENCE Areas of focus Data acquisition and curation Data storage Compute infrastructure Analysis and Insight Everything as Code* * Well As much as possible

9. LEARN • NETWORK • COLLABORATE • INFLUENCE Data Acquisition and curation Areas of focus

10. LEARN • NETWORK • COLLABORATE • INFLUENCE Data Lake HANA

11. LEARN • NETWORK • COLLABORATE • INFLUENCE How big was the Panama Papers data set

12. LEARN • NETWORK • COLLABORATE • INFLUENCE How big was the Panama Papers data set

13. LEARN • NETWORK • COLLABORATE • INFLUENCE Data Lake Panama Papers Technology stack SQL

14. LEARN • NETWORK • COLLABORATE • INFLUENCE The tools used supported 370 journalists from around the world Infrastructure was a pool of up to 40 servers run in AWS

15. LEARN • NETWORK • COLLABORATE • INFLUENCE Data quality and curation are not one time activities Remove the human element as much as possible

16. LEARN • NETWORK • COLLABORATE • INFLUENCE Data security • Data lake – What data do you collect – Do you have restrictions on what data can be combined – How long does your data live

17. LEARN • NETWORK • COLLABORATE • INFLUENCE Data security • Geographical concerns – Where does your data reside

18. LEARN • NETWORK • COLLABORATE • INFLUENCE Data security • Authentication – Who is accessing your data

19. LEARN • NETWORK • COLLABORATE • INFLUENCE Data Storage Areas of focus

20. LEARN • NETWORK • COLLABORATE • INFLUENCE How BIG is Big Data

21. LEARN • NETWORK • COLLABORATE • INFLUENCE

22. LEARN • NETWORK • COLLABORATE • INFLUENCE Storage Considerations • IOPS are still important – Big data still uses a lot of spinning disk • Replication and Redundancy – Eats a lot of disk space • Build for failure • Sometimes you have to go in-memory

23. LEARN • NETWORK • COLLABORATE • INFLUENCE Compute infrastructure Areas of focus

24. LEARN • NETWORK • COLLABORATE • INFLUENCE Structured Reporting Versus Big Data/Science Compute requirements 2 • Structured reporting systems run business processes – Sized and static – Under change control – Business centric

25. LEARN • NETWORK • COLLABORATE • INFLUENCE Structured Reporting Versus Big Data/Science Compute requirements 2 • Data science systems answer difficult questions irregularly – Cloud or heavy use of virtualisation – Developer centric – Rapidly evolving

26. LEARN • NETWORK • COLLABORATE • INFLUENCE What you still need to remember 2 • Compute is cheap • Scalability is critical

27. LEARN • NETWORK • COLLABORATE • INFLUENCE What you still need to remember 2 • Software definition for consistency • Automate as much as possible

28. LEARN • NETWORK • COLLABORATE • INFLUENCE 2 100 Hadoop Nodes 122GB RAM Each = 12.2TB RAM Build time of 3Hrs

29. LEARN • NETWORK • COLLABORATE • INFLUENCE Use of scripted builds from VM to application 2 Disk definition Network defintion Software Install

30. LEARN • NETWORK • COLLABORATE • INFLUENCE Use of scripted builds from VM to application 3 • Deployment was consistent for each and every node of the cluster – Hostnames defined the same way – Configuration files created the same way

31. LEARN • NETWORK • COLLABORATE • INFLUENCE Use of scripted builds from VM to application 3 • Faster deployment – Automated build 3hrs to build and deploy 100 nodes – Manual build 800hrs + to build and deploy 100 nodes • Use of automated tools to detect failure and start new node (ElasticBeanstalk)

32. LEARN • NETWORK • COLLABORATE • INFLUENCE Use of scripted builds from VM to application 3 • Reusability of script – Heavy use of parameters means it is adaptable • Use of Git meant distributed development was handled easily

33. LEARN • NETWORK • COLLABORATE • INFLUENCE

34. LEARN • NETWORK • COLLABORATE • INFLUENCE Analysis and Insight 3 Areas of focus Presentation Tag Line

35. LEARN • NETWORK • COLLABORATE • INFLUENCE Query the Data • Programmatically – Python – R • Application – Lumira – Business Objects – Spark – SQL – Excel – ElasticSearch

36. LEARN • NETWORK • COLLABORATE • INFLUENCE Analysis and Visualisation • Quick Analysis – Lumira, Excel • Graph – Neo4J, Synerscope • Charts – Business Objects, Grafana, Kibana • Dynamic – D3 http://www.wikiviz.org/wiki/Tools

37. LEARN • NETWORK • COLLABORATE • INFLUENCE Things to remember • Remember the type of platform you are using • Storage is cheap but not all storage is equal • Scalability is critical • Version control rocks • Automate everything you can • Value is in the data but not all data is valuable • Data should not live forever

38. LEARN • NETWORK • COLLABORATE • INFLUENCE 3 • Key Takeways

39. LEARN • NETWORK • COLLABORATE • INFLUENCE

Editor's Notes

#4: John Snow London
#5: 2008 H1N1 flu pandemic in US CDC had out of date data
#7: Panama papers – transient use case Under Armour – constant data use case answering lots of different questions Common Sense Finance institution – transient audit data use case Natures Hope – Pushing structured data into data lake to provide better temperate control as part of their data lifecycle Intel – using event streaming to drive manufacturing processes
#10: We are literally drowning in data – data lakes What data do we acquire – sensor data, web data, social media, transactional data What data is actually necessary, how long does it need to live for, what is its data life cycle What data do we need that we do not have access to How do we curate data for data lakes
#11: We are literally drowning in data – data lakes What data do we acquire – sensor data, web data, social media, transactional data What data is actually necessary, how long does it need to live for, what is its data life cycle What data do we need that we do not have access to How do we curate data for data lakes
#12: We have four developers and three journalists.
#14: Time line Working on Platform for 3 years across the various links Processed Panama papers in around 12 months
#22: How do we store data – databases and files Big data data storage systems HDFS Cloud based S3 or Azure Storage Databases – SQL and NoSQL CSV Hardware – massively scalable software defined infrastructures which expect failure
#29: John broke my cluster 20 nodes – scaled to 100 nodes

Deploying Big Data Platforms

Recommended

More Related Content

What's hot (20)

Similar to Deploying Big Data Platforms (20)

More from Chris Kernaghan (16)

Recently uploaded (20)

Deploying Big Data Platforms

Editor's Notes