SlideShare a Scribd company logo
BIG DATA AND ANALYTICS CONCEPTS
BOHITESH MISRA
CHIEF TECHNOLOGY OFFICER
IT STARTUPS
BOHITESH.MISRA@GMAIL.COM
The Internet of Things connects
all manner of end-points, a
treasure trove of data
Networks and device
proliferation enable
access to a massive and
growing amount of
traditionally siloed
information
Analytics and business
intelligence tools empower
decision makers as never
before by extracting and
presenting meaningful
information in real-time,
helping us be more
predictive than reactive
BUILDING A CONNECTED AND SMART ECOSYSTEM:
A ROADMAP TO BUSINESS NIRVANA
IoT Big Data Analytics
GARTNER HYPE CYCLE - 2015
CONTENT
1. What is Big Data
2. Characteristic of Big Data
3. Why Big Data
4. How it is Different
5. Big Data sources
6. Tools used in Big Data
7. Application of Big Data
8. Risks of Big Data
9. Benefits of Big Data
10.How Big Data Impact on IT
11.Future of Big Data
BIG DATA
• Big Data may well be the Next Big Thing in the IT world.
• The first organizations to embrace big data were online
and startup firms. Firms like Google, eBay, LinkedIn, and
Facebook were built around big data from the beginning.
• Like many new information technologies, big data can bring
about dramatic cost reductions, substantial improvements in
the time required to perform a computing task, or new
product and service offerings.
• ‘Big Data’ is similar to ‘small data’, but bigger in size
• but having data bigger it requires different approaches, Techniques,
tools and architecture
• an aim to solve new problems or old problems in a better way
• Big Data generates value from the storage and processing of very
large quantities of digital information that cannot be analyzed with
traditional computing techniques.
WHAT IS BIG DATA?
Big data and analytics
THREE CHARACTERISTICS OF BIG DATA
Volume
• Data
quantity
Velocity
• Data
Speed
Variety
• Data
Types
BIG DATA - VOLUME
•A typical PC might have had 10 gigabytes of storage in 2000.
•Today, Facebook ingests 500 terabytes of new data every day.
•Boeing 737 will generate 240 terabytes of flight data during a
single flight across the US.
• The smart phones, the data they create and consume; sensors
embedded into everyday objects will soon result in billions of new,
constantly-updated data feeds containing environmental, location,
and other information, including video.
BIG DATA - VELOCITY
• Clickstreams and ad impressions capture user behavior at millions of events
per second
• high-frequency stock trading algorithms reflect market changes within
microseconds
• machine to machine processes exchange data between billions of devices
• infrastructure and sensors generate massive log data in real-time
BIG DATA - VARIETY
• Big Data isn't just numbers, dates, and strings. Big Data is
also geospatial data, 3D data, audio and video, and
unstructured text, including log files and social media.
• Traditional database systems were designed to address
smaller volumes of structured data, fewer updates or a
predictable, consistent data structure.
• Big Data analysis includes different types of data
STORING BIG DATA
❖Analyzing your data characteristics
• Selecting data sources for analysis
• Eliminating redundant data
• Establishing the role of NoSQL
❖Overview of Big Data stores
• Data models: key value, graph, document, column-family
• Hadoop Distributed File System
• HBase
• Hive
PROCESSING BIG DATA
❖Integrating disparate data stores
• Mapping data to the programming framework
• Connecting and extracting data from storage
• Transforming data for processing
• Subdividing data in preparation for Hadoop MapReduce
❖Employing Hadoop MapReduce
• Creating the components of Hadoop MapReduce jobs
• Distributing data processing across server farms
• Executing Hadoop MapReduce jobs
• Monitoring the progress of job flows
WHY BIG DATA
•FB generates 10TB daily
•Twitter generates 7TB of data Daily
•IBM claims 90% of today’s
stored data was generated
in just the last two years.
BIG DATA SOURCES
Users
Application
Systems
Sensors
Large and growing files
(Big data files)
DATA GENERATION POINTS - EXAMPLES
Mobile Devices
Readers/Scanners
Science facilities
Microphones
Cameras
Social Media
Programs/ Software
BIG DATA ANALYTICS
• Examining large amount of data
• Appropriate information
• Identification of hidden patterns, unknown correlations
• Competitive advantage
• Better business decisions: strategic and operational
• Effective marketing, customer satisfaction, increased revenue
• Where processing is hosted?
• Distributed Servers / Cloud
• Where data is stored?
• Distributed Storage
• What is the programming model?
• Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?
• High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
• Analytic / Semantic Processing
TYPES OF TOOLS USED IN BIG-DATA
Application Of Big Data analytics
Homeland
Security
Smarter Healthcare
Integrated and smart
patient care systems
and processes
Retail & Multi-channel
sales
Highly personalized
customer experience
across channels and
devices
Telecom
Manufacturing
Intelligent
interconnectivity across
the enterprise for
enhanced control, speed
and efficiency
Traffic Control
Trading Analytics
Search Quality
Log Analysis
Finance & Banking
Seamless customer
experience across all
banking channels
HOW BIG DATA IMPACTS ON IT
• Big data is a troublesome force presenting opportunities with challenges to IT
organizations.
• By 2016 4.4 million IT jobs in Big Data ; 1.9 million is in US itself
• India will require a minimum of 1 lakh data scientists in the next couple of
years in addition to data analysts and data managers to support the Big Data
space.
POTENTIAL VALUE OF BIG DATA
• $300 billion potential annual
value to US health care.
• $600 billion potential annual
consumer surplus from using
personal location data.
• 60% potential in retailers’
operating margins.
BENEFITS OF BIG DATA
•Real-time big data isn’t just a process for storing petabytes
or exabytes of data in a data warehouse, It’s about the
ability to make better decisions and take meaningful
actions at the right time.
•Fast forward to the present and technologies like Hadoop
give you the scale and flexibility to store data before you
know how you are going to process it.
•Technologies such as MapReduce,Hive and Impala enable
you to run queries without changing the data structures
underneath.
BENEFITS OF BIG DATA
• Our newest research finds that organizations are using big data to target
customer-centric outcomes, tap into internal data and build a better information
ecosystem.
• Big Data is already an important part of the $64 billion database and data
analytics market
• It offers commercial opportunities of a comparable scale to enterprise software in
the late 1980s
• And the Internet boom of the 1990s, and the social media explosion of today.
FUTURE OF BIG DATA
• $15 billion on software firms only specializing in data management
and analytics.
• This industry on its own is worth more than $100 billion and growing
at almost 10% a year which is roughly twice as fast as the software
business as a whole.
• The McKinsey Global Institute estimates that data volume is growing
40% per year, and will grow 44x between 2009 and 2020.
INDIA – BIG DATA
• Gaining attraction and market
• Huge market opportunities for IT services (82.9% of revenues) and
analytics firms (17.1 % )
• Current market size is $200 million. By 2015 $1 billion
• The opportunity for Indian service providers lies in offering services
around Big Data implementation and analytics for global
multinationals
BIG DATA ANALYTICS TECHNOLOGIES
NoSQL : non-relational or at least non-SQL database solutions
such as HBase (also a part of the Hadoop ecosystem),
Cassandra, MongoDB, Riak, CouchDB, and many others.
Hadoop: It is an ecosystem of software packages, including
MapReduce, HDFS, and a whole host of other software
packages
THE FOUR PILLARS FOR AN EFFECTIVE BIG DATA STRATEGY
Storage User Experience
Digital intelligence and
Analytics
Content Discovery
and Management
Just these segments account for more than $10 billion in served, addressable markets.
MOTIVATION FOR SPECIALIZED BIG DATA SYSTEMS
• Cost of data storage is dropping, but rate of data capture is soaring
• Sources: online/digital, communications, messaging, usage, transactions…
• Furthermore, need for real-time data-driven insights is also more urgent
• Traditional data warehouses and RDBMS systems cannot keep up
• They are unable to capture, manage and optimize the volume and diversity of data
marketers are seeking to harness today
• Structured, unstructured, and semi-structured data are all essential ingredients in
today’s marketing mix; traditional systems cannot handle this
• Big Data systems: cluster-based, commodity priced, distributed
computing database management system
• Most often based on Hadoop, but usable without MapReduce programming skills
• Key features: linear scalability, parallel computing, node redundancy, and
centralized access to data
• Server clusters behave like a massive single mainframe: What traditional
databases do in months, a Big Data management system can do in hours
INTERNET OF THINGS
&
PREDICTIVE ANALYTICS
INTERNET OF THINGS
• Each “thing” or connected device is part of the digital shadow of a person
• For there to be a market in the internet of things, two things must be true:
1) The “thing” in question must provide utility to the human, and
2) The digital shadow must provide value to an enterprise.
MARKET
• The “market” is made up of many parts :
➢From wearable to drivable to home and
➢Industrial sensors and controllers, and
• Each part is made up of segments :
➢Innovators,
➢Early adopters,
➢Pragmatists,
➢Conservatives, and
➢Laggards across many industries.
PREDICTIVE ANALYTICS
• From the data streams that implement the “digital shadows” of people, we
can use predictive analytics to understand their needs and behavior better
than ever before.
• Every new dimension of data increases the predictive power, enabling
enterprises to answer the question “what does the human want?”
INTERNET OF THINGS & PREDICTIVE ANALYTICS
• Transforming the internet of things and its sibling, predictive analytics, to be
programmable by the same labor pool that has developed the apps which drove
the mobile revolution makes basic economic sense.
• Types of data generated by the internet of things is coupled with :
➢data analysis
➢data discovery tools and
➢ techniques to help business leaders identify emerging developments such as machines that
might need maintenance :
to prevent costly breakdowns or
 sudden shifts in customer or
market conditions that might signal some action a company should take.
• The internet of things, the physical world will become a networked information system—
through sensors and actuators embedded in real physical objects and linked through
wired and wireless networks via the internet protocol.
• This holds special value for manufacturing:
➢The potential for connected physical systems to improve productivity in the production process and
➢The supply chain is huge.
• Consider processes that govern themselves, where smart products can take corrective
action to avoid damages and where individual parts are automatically replenished.
• Such technologies already exist and could drive the fourth industrial revolution—
following the steam engine, the conveyor belt (assembly line - think ford model t), and the
first phase of it and automation technology.
EXAMPLE 1 : AUTO INSURANCE
• The first-order vector was a connected accelerometer offered to drivers :
➢ to improve their insurance rates based on proven “safe driving” habits.
• Through this digital shadow, the insurance provider can make much better
actuarial predictions than through the coarse-grained data they had before
➢age,
➢gender, and
➢ traffic violations.
• This is interesting in the same way the blackberry was interesting - a basic
capability adopted for basic business improvement.
• The second-order vector is much stronger :
➢the ability to transform the insurance market to better meet the needs of customers while
changing the rules of competition.
➢based on real-time driving information insurance companies can :
▪ move to a real-time spot-pricing model driven by an exchange (not unlike the stock exchange),
▪ bidding on drivers and
▪ providing insurance on demand. Not driving today? Don’t pay for insurance. Need to drive fast
tomorrow? Pay a little more but don’t worry about your “permanent record”.
• These outcomes are all based on tying the internet of things to predictive
analytics.
EXAMPLE 2 : HEALTH CARE
• The first-order vector is similar, a wearable accelerometer offered to patients :
➢ To improve traceability of their compliance with their exercise prescription,
➢Enabling better outcomes for cardiac patients.
➢Unlike prescription refills, exercise compliance has been untraceable before, so this digital
shadow is a breakthrough for medicine.
• Similar developments exist in digestible sensors within medications :
➢which activate only on contact with stomach acid,
➢providing higher truth and
➢better granularity than a monthly refill.
• In second-order vector in healthcare ,the ability to combine multiple streams of
information that were previously invisible has the potential to drive better health
outcomes through provably higher patient compliance.
• Sorting these data streams at scale will allow health providers and health insurance
companies to rapidly iterate health protocols across a population of humans, augmenting
human expertise with predictive analytics.
• Outcome-based analysis based on predictive models built from data can reduce :
➢waste,
➢error rates, and
➢lawsuits while driving better margins.
• Larger exchanges of this type of data will tend to :
➢ perform better,
➢creating a more effective market and
➢ a better pool of empirical research for science.
EXAMPLE 3 : AUTO COMPANIES
• They have installed thousands of "black boxes" inside their prototype and field
testing vehicles to capture second by second data from the dozens of control units
which manage today's automobiles.
• These boxes simply plug into the vehicle's on-board diagnostic (obd) port which is
typically located under the front dashboard of all cars.
• They collect 500-750 different vehicle performance parameters that add up to
terabytes of data in hours!
• The intent of the automakers for installing these boxes is to collect data which their
engineers can later analyze to fix bugs and improve on existing designs.
• For example, one car manufacturer found out from this data that their minivan batteries
would end up in a recall.
➢The problem was an underpowered alternator - it was not able to fully recharge the batteries
because the most common drive cycle for this particular minivan was less than 3 miles.
➢As a result, there appeared to be a lot of complaints about dead batteries and the company was
potentially facing the recall of millions of minivans which had this alternator.
➢The boxes collect information about driving cycles and this data was really useful in understanding
the real reason behind the dead batteries.
➢The test vehicles which had short drive cycles were the ones which reported dead batteries! simply
changing the alternator to higher capacity could fix the problem.
➢Now it was an easy fix to extend this solution to the entire fleet.
ENDLESS OPPORTUNITY
The opportunities are literally endless,
➢Ranging from early fault detection (predicting when a particular component
is likely to fail)
➢To automatically adjusting driving route based on traffic pattern
predictions.
The ultimate test of predictive analytics in the internet of things is of course fully
autonomous systems, such as :
➢the nissan car of 2020 or
➢ the google self driving car of today.
In the end all autonomous systems will need the ability to build predictive
capabilities - in other words, machines must learn machine learning!
EXAMPLE 4 : GOOGLE’S SELF DRIVING CAR
Google claims that their self-driving car of today has logged more
than 300,000 miles with almost zero incidence of accidents.
The one time a minor crash did occur was when the car was rear-
ended by a human-driven car!
So, when the technology is fully mature, it is not just parking valets
who become obsolete, other higher paying professions such as
automotive safety systems experts may also need to look for other
options!
Predictive analytics is the enabler that will make this happen.
EXAMPLE 5 : JET AIRLINER
• A jet airliner generates 20 terabytes of diagnostic data per hour of flight.
• The average oil platform has 40,000 sensors, generating data 24/7.
• M2M is now generating enormous volumes of data and is testing the capabilities of
traditional database technologies.
• To extract rich, real-time insight from the vast amounts of machine-generated data,
companies will have to build a technology foundation with speed and scale because raw
data, whatever the source, is only useful after it has been transformed into knowledge
through analysis.
• Investigative analytics tools enable interactive, ad-hoc querying on complex big data sets
to identify patterns and insights and can perform analysis at massive scale with precision
even as machine-generated data grows beyond the petabyte scale
FINDING RIGHT ANALYTICS DATABASE TECHNOLOGY
• To find the right analytics database technology to capture, connect, and drive
meaning from data, companies should consider the following requirements:
➢ Real-time Analysis : Businesses can’t afford for data to get stale. Data solutions need to :
▪ load quickly and easily,
▪ and must dynamically query,
▪ analyze, and
▪ communicate m2m information in real-time, without huge investments in it administration, support, and tuning.
➢Flexible Querying And Ad-hoc Reporting : When intelligence needs to change quickly, analytic tools can’t
▪ be constrained by data schemas that limit the number and
▪ type of queries that can be performed.
This type of deeper analysis also cannot be constrained by tinkering or time-
consuming manual configuration (such as indexing and managing data partitions) to
create and change analytic queries.
➢Efficient Compression : Efficient data compression is key to enabling M2M data management within :
▪ A network node,
▪ Smart device, or
▪ Massive data center cluster.
Better compression allows :
▪ For less storage capacity overall,
▪ As well as tighter data sampling and
▪ Longer historical data sets,
▪ Increasing the accuracy of query results.
➢Ease Of Use And Cost : Data analysis must be :
▪ Affordable, Easy-to-use, and
▪ Simple to implement in order to justify the investment.
This demands low-touch solutions that are optimized to deliver :
▪ Fast analysis of large volumes of data,
▪ With minimal hardware, Administrative effort, and
▪ Customization needed to set up or
▪ Change query and reporting parameters.
EXAMPLE 6 : UNION PACIFIC RAILROAD
• The railroad is using sensor and analytics technologies to predict and prevent train derailments,
• For example, the company has placed infrared sensors on every 20 miles of its tracks to gather 20
million temperature readings of train wheels each day to look for signs of overheating, which is a
sign of impending failure.
• Meanwhile, trackside microphones are used to pick up “growling” bearings in the wheels.
• Data from such physical measurements are sent via fiber optic lines to union pacific’s data centers.
• Complex pattern-matching algorithms and analytics are used to identify irregularities, allowing
union pacific experts to determine within minutes of capturing the data whether a driver should
pull a train over for inspection or reduce its speed until it reaches the next station to be repaired.
HOW TO ANALYZE MACHINE AND SENSOR DATA
• Capture and refine data from heating, ventilation, and air conditioning (hvac) systems in
20 large buildings around the world using the hortonworks data platform, and how to
analyze the refined sensor data to maintain optimal building temperatures.
• Sensor data - A sensor is a device that measures a physical quantity and transforms it
into a digital signal. sensors are always on, capturing data at a low cost, and powering
the “internet of things.”
• Potential uses of sensor data
➢Sensors can be used to collect data from many sources, such as:
➢To monitor machines or infrastructure such as ventilation equipment, bridges, energy meters, or
airplane engines. This data can be used for predictive analytics, to repair or replace these items
before they break.
➢To monitor natural phenomena such as meteorological patterns, underground pressure during oil
extraction, or patient vital statistics during recovery from a medical procedure.
APACHE HADOOP - HDFS
OUTLINE
• Architecture of Hadoop Distributed File System
• Hadoop usage
• Ideas for Hadoop related research
HADOOP, WHY?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
• Need common infrastructure
– Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
• Workloads are IO bound and not CPU bound
HIVE, WHY?
• Need a Multi Petabyte Warehouse
• Files are insufficient data abstractions
• Need tables, schemas, partitions, indices
• SQL is highly popular
• Need for an open data format
– RDBMS have a closed data format
– flexible schema
• Hive is a Hadoop subproject!
Hadoop
What is Hadoop?
 It's a framework for running applications on large clusters of
commodity hardware which produces huge data and to process it
Hadoop Includes
 HDFS a distributed filesystem
 Map/Reduce HDFS implements this programming model. It is an offline
computing engine
Concept
Moving computation is more efficient than moving large data
• Data intensive applications with Petabytes of data.
• Web pages - 20+ billion web pages x 20KB = 400+ terabytes
• One computer can read 30-35 MB/sec from disk ~four months to read the web
• same problem with 1000 machines, < 3 hours
• Difficulty with a large number of machines
• communication and coordination
• recovering from machine failure
• status reporting
• debugging
• optimization
• locality
WHO USES HADOOP?
• Facebook
• Amazon/A9
• Google
• IBM
• New York Times
• Yahoo!
• PowerSet
COMMODITY HARDWARE
Typically in 2 level architecture
– Nodes are commodity PCs
– 30-40 nodes/rack
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit
GOALS OF HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to where data
resides
– Provides very high aggregate bandwidth
• User Space, runs on heterogeneous OS
Secondary
NameNode
Client
HDFS Architecture
NameNode
DataNodes
Cluster Membership
Cluster Membership
NameNode : Maps a file to a file-id and list of MapNodes
DataNode : Maps a block-id to a physical location on disk
SecondaryNameNode: Periodic merge of Transaction log
DISTRIBUTED FILE SYSTEM
• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
Big data and analytics
HDFS – HADOOP DISTRIBUTED FILE SYSTEM
HADOOP CLUSTER ARCHITECTURE
• Map/Reduce Master “Jobtracker”
• Accepts MR jobs submitted by users
• Assigns Map and Reduce tasks to
Tasktrackers
• Monitors task and tasktracker status,
reexecutes tasks upon failure
• Map/Reduce Slave “Tasktrackers”
• Run Map and Reduce tasks upon
instruction from the Jobtracker
• Manage storage and transmission of
intermediate output.
NAMENODE METADATA
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
DATANODE
• A Block Server
– Stores data in the local file system
– Stores meta-data of a block
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
• Files are broken in to large blocks.
– Typically 128 MB block size
– Blocks are replicated for reliability
• One replica on local node, another replica on a remote rack,
Third replica on local rack, Additional replicas are randomly placed
• Understands rack locality
– Data placement exposed so that computation can be migrated to data
• Client talks to both NameNode and DataNodes
– Data is not sent through the namenode, clients access data directly from
DataNode
– Throughput of file system scales nearly linearly with the number of nodes.
DATA MODEL
DATA CORRECTNESS
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
NAMENODE FAILURE
• A single point of failure – new version has a secondary
namenode
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
HADOOP MAP/REDUCE
• The Map-Reduce programming model
– Framework for distributed processing of large data sets
– Pluggable user code runs in generic framework
• Common design pattern in data processing
cat * | grep | sort | unique -c | cat > file
input | map | shuffle | reduce | output
• Natural for:
– Log processing
– Web search indexing
– Ad-hoc queries
EXAMPLE - HADOOP AT FACEBOOK
• Production cluster
• 4800 cores, 600 machines, 16GB per machine – April 2009
• 8000 cores, 1000 machines, 32 GB per machine – July 2009
• 4 SATA disks of 1 TB each per machine
• 2 level network hierarchy, 40 machines per rack
• Total cluster size is 2 PB, projected to be 12 PB in Q3 2009
• Test cluster
• 800 cores, 16GB each
DATA FLOW
Web Servers
Scribe Servers
Network
Storage
Hadoop ClusterOracle RAC MySQL
HADOOP AND HIVE USAGE
• Statistics :
• 15 TB uncompressed data ingested per day
• 55TB of compressed data scanned per day
• 3200+ jobs on production cluster per day
• 80M compute minutes per day
• Barrier to entry is reduced:
• 80+ engineers have run jobs on Hadoop platform
• Analysts (non-engineers) starting to use Hadoop through Hive
BID DATA LEARNING PATH
• 1. Understand the difference between various data handling techniques like OLTP,
OLAP, Data Mining, Data Warehoue, Data Mart, etc.
• 2. Understand various visualization techniques like Bar Chart, Heat Map, Tree Map,
Density Map, etc.
• 3. Understand Data Mining / Analytics algorithms.
• 4. Identify various sources of data and identify data elements that need to be
focused upon for Analytics.
• 6. Understand data quality checks and ensure data quality. Without a quality data
analytics may deviate a lot from actual scenario.
BID DATA LEARNING PATH
• 7. Answer why I need a distributed system.
• 8. Study various data handling techniques provided by NoSQL databases like Mongo
DB, Cassandra, etc.
• 9. Find out how Hadoop or related Big Data techniques can be used for distributed
data by using horizontal scalability techniques.
• 10. Finalize algorithms that will run on top of this data and identify tools or develop
program for these algorithms.
• 11. Use visualization techniques learnt in step 2 above to present the output.
• 12. Keep on making the tool/program more and more intelligent as a continuous
process.
Note: A combination of Relational and NoSQL databases may be required for
performing required analytics and/or generating visualizations.
CONCLUSION
• Why commodity hardware ?
because cheaper
designed to tolerate faults
• Why HDFS ?
network bandwidth vs seek latency
• Why Map reduce programming model?
parallel programming
large data sets
moving computation to data
single compute + data cluster
• Hadoop Log Analysis
• Failure prediction and root cause analysis
• Hadoop Data Rebalancing
• Based on access patterns and load
• Best use of flash memory?
• Design new topology based on commodity hardware
MORE IDEAS FOR FURTHER DISCUSSION AND
RESEARCH
USEFUL LINKS
•HDFS Design:
• http://hadoop.apache.org/core/docs/current/hdfs_design.html
•Hadoop API:
• http://hadoop.apache.org/core/docs/current/api/
•Hive:
• http://hadoop.apache.org/hive/
THANK YOU DATA SCIENTISTS !

More Related Content

PPT
Big Data
Vinayak Kamath
 
PDF
The ABCs of Treating Data as Product
DATAVERSITY
 
PDF
Big data introduction
Chirag Ahuja
 
PDF
Modern Data architecture Design
Kujambu Murugesan
 
PPTX
What is Big Data?
Bernard Marr
 
PPTX
Big Data PPT by Rohit Dubey
Rohit Dubey
 
PPTX
Big data architectures and the data lake
James Serra
 
PPSX
Applications of Big Data Analytics in Businesses
T.S. Lim
 
Big Data
Vinayak Kamath
 
The ABCs of Treating Data as Product
DATAVERSITY
 
Big data introduction
Chirag Ahuja
 
Modern Data architecture Design
Kujambu Murugesan
 
What is Big Data?
Bernard Marr
 
Big Data PPT by Rohit Dubey
Rohit Dubey
 
Big data architectures and the data lake
James Serra
 
Applications of Big Data Analytics in Businesses
T.S. Lim
 

What's hot (20)

PDF
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
PDF
Data Catalog as the Platform for Data Intelligence
Alation
 
PPTX
Databricks for Dummies
Rodney Joyce
 
PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PDF
AIOps - The next 5 years
Moogsoft
 
PPTX
A Step-by-Step Guide to Metadata Management
SaachiShankar
 
PPTX
Data Lake Overview
James Serra
 
PPS
Data Warehouse 101
PanaEk Warawit
 
PDF
Data Architecture Strategies: Data Architecture for Digital Transformation
DATAVERSITY
 
PPTX
Big data
Nausheen Hasan
 
PDF
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
PPTX
Data Architecture Brief Overview
Hal Kalechofsky
 
PDF
Data Governance Best Practices
DATAVERSITY
 
PDF
Data Governance Takes a Village (So Why is Everyone Hiding?)
DATAVERSITY
 
PDF
Data Quality Best Practices
DATAVERSITY
 
PPTX
Big Data
Rohit Jain
 
PPTX
Most Common Data Governance Challenges in the Digital Economy
Robyn Bollhorst
 
PDF
Improving Data Literacy Around Data Architecture
DATAVERSITY
 
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Data Catalog as the Platform for Data Intelligence
Alation
 
Databricks for Dummies
Rodney Joyce
 
DW Migration Webinar-March 2022.pptx
Databricks
 
AIOps - The next 5 years
Moogsoft
 
A Step-by-Step Guide to Metadata Management
SaachiShankar
 
Data Lake Overview
James Serra
 
Data Warehouse 101
PanaEk Warawit
 
Data Architecture Strategies: Data Architecture for Digital Transformation
DATAVERSITY
 
Big data
Nausheen Hasan
 
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Data Architecture Brief Overview
Hal Kalechofsky
 
Data Governance Best Practices
DATAVERSITY
 
Data Governance Takes a Village (So Why is Everyone Hiding?)
DATAVERSITY
 
Data Quality Best Practices
DATAVERSITY
 
Big Data
Rohit Jain
 
Most Common Data Governance Challenges in the Digital Economy
Robyn Bollhorst
 
Improving Data Literacy Around Data Architecture
DATAVERSITY
 
Ad

Similar to Big data and analytics (20)

PPTX
Kartikey tripathi
KARTIKEY TRIPATHI
 
DOCX
Content1. Introduction2. What is Big Data3. Characte.docx
dickonsondorris
 
PPTX
ppt final.pptx
kalai75
 
PPTX
Big_Data_ppt[1] (1).pptx
TanguturiAvinash
 
PDF
Bigdatappt 140225061440-phpapp01
nayanbhatia2
 
PPTX
Special issues on big data
Vedanand Singh
 
PPTX
Big data ppt
Nasrin Hussain
 
PPTX
Presentation on Big Data
Md. Salman Ahmed
 
PPTX
big-data-8722-m8RQ3h1.pptx
VaishnavGhadge1
 
PPTX
Big Data ppt
Vivek Gautam
 
PPTX
bigdata.pptx
KammetaJoshna
 
PPT
big data
subhakirthi
 
PPTX
Big data
Mahmudul Alam
 
PPTX
Big data
SaraRao3
 
PPTX
Bigdata " new level"
Vamshikrishna Goud
 
PPTX
Big data Analytics
Guduru Lakshmi Kiranmai
 
PPTX
Bigdata
sayan sarker
 
PPTX
BigDataFinal.pptx
PentaTech
 
PPTX
Big data
madhavsolanki
 
Kartikey tripathi
KARTIKEY TRIPATHI
 
Content1. Introduction2. What is Big Data3. Characte.docx
dickonsondorris
 
ppt final.pptx
kalai75
 
Big_Data_ppt[1] (1).pptx
TanguturiAvinash
 
Bigdatappt 140225061440-phpapp01
nayanbhatia2
 
Special issues on big data
Vedanand Singh
 
Big data ppt
Nasrin Hussain
 
Presentation on Big Data
Md. Salman Ahmed
 
big-data-8722-m8RQ3h1.pptx
VaishnavGhadge1
 
Big Data ppt
Vivek Gautam
 
bigdata.pptx
KammetaJoshna
 
big data
subhakirthi
 
Big data
Mahmudul Alam
 
Big data
SaraRao3
 
Bigdata " new level"
Vamshikrishna Goud
 
Big data Analytics
Guduru Lakshmi Kiranmai
 
Bigdata
sayan sarker
 
BigDataFinal.pptx
PentaTech
 
Big data
madhavsolanki
 
Ad

More from Bohitesh Misra, PMP (10)

PDF
Innovation in enterpreneurship_2021
Bohitesh Misra, PMP
 
PDF
Use of data science for startups_Sept 2021
Bohitesh Misra, PMP
 
PDF
Building castles on sand - Project Management in distributed project environment
Bohitesh Misra, PMP
 
PDF
Disruptive technologies - Session 4 - Biochip Digital twin Smart Fabrics
Bohitesh Misra, PMP
 
PDF
Disruptive technologies - Session 3 - Green it_Smartdust
Bohitesh Misra, PMP
 
PDF
Disruptive technologies - Session 2 - Blockchain smart_contracts
Bohitesh Misra, PMP
 
PDF
Disruptive technologies - Session 1 - introduction
Bohitesh Misra, PMP
 
PDF
What is data science ?
Bohitesh Misra, PMP
 
PPTX
Business analytics why now_what next
Bohitesh Misra, PMP
 
PDF
Internet of Things (IoT) based Solar Energy System security considerations
Bohitesh Misra, PMP
 
Innovation in enterpreneurship_2021
Bohitesh Misra, PMP
 
Use of data science for startups_Sept 2021
Bohitesh Misra, PMP
 
Building castles on sand - Project Management in distributed project environment
Bohitesh Misra, PMP
 
Disruptive technologies - Session 4 - Biochip Digital twin Smart Fabrics
Bohitesh Misra, PMP
 
Disruptive technologies - Session 3 - Green it_Smartdust
Bohitesh Misra, PMP
 
Disruptive technologies - Session 2 - Blockchain smart_contracts
Bohitesh Misra, PMP
 
Disruptive technologies - Session 1 - introduction
Bohitesh Misra, PMP
 
What is data science ?
Bohitesh Misra, PMP
 
Business analytics why now_what next
Bohitesh Misra, PMP
 
Internet of Things (IoT) based Solar Energy System security considerations
Bohitesh Misra, PMP
 

Recently uploaded (20)

PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Presentation on animal welfare a good topic
kidscream385
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 

Big data and analytics

  • 1. BIG DATA AND ANALYTICS CONCEPTS BOHITESH MISRA CHIEF TECHNOLOGY OFFICER IT STARTUPS [email protected]
  • 2. The Internet of Things connects all manner of end-points, a treasure trove of data Networks and device proliferation enable access to a massive and growing amount of traditionally siloed information Analytics and business intelligence tools empower decision makers as never before by extracting and presenting meaningful information in real-time, helping us be more predictive than reactive BUILDING A CONNECTED AND SMART ECOSYSTEM: A ROADMAP TO BUSINESS NIRVANA IoT Big Data Analytics
  • 4. CONTENT 1. What is Big Data 2. Characteristic of Big Data 3. Why Big Data 4. How it is Different 5. Big Data sources 6. Tools used in Big Data 7. Application of Big Data 8. Risks of Big Data 9. Benefits of Big Data 10.How Big Data Impact on IT 11.Future of Big Data
  • 5. BIG DATA • Big Data may well be the Next Big Thing in the IT world. • The first organizations to embrace big data were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning. • Like many new information technologies, big data can bring about dramatic cost reductions, substantial improvements in the time required to perform a computing task, or new product and service offerings.
  • 6. • ‘Big Data’ is similar to ‘small data’, but bigger in size • but having data bigger it requires different approaches, Techniques, tools and architecture • an aim to solve new problems or old problems in a better way • Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques. WHAT IS BIG DATA?
  • 8. THREE CHARACTERISTICS OF BIG DATA Volume • Data quantity Velocity • Data Speed Variety • Data Types
  • 9. BIG DATA - VOLUME •A typical PC might have had 10 gigabytes of storage in 2000. •Today, Facebook ingests 500 terabytes of new data every day. •Boeing 737 will generate 240 terabytes of flight data during a single flight across the US. • The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.
  • 10. BIG DATA - VELOCITY • Clickstreams and ad impressions capture user behavior at millions of events per second • high-frequency stock trading algorithms reflect market changes within microseconds • machine to machine processes exchange data between billions of devices • infrastructure and sensors generate massive log data in real-time
  • 11. BIG DATA - VARIETY • Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media. • Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure. • Big Data analysis includes different types of data
  • 12. STORING BIG DATA ❖Analyzing your data characteristics • Selecting data sources for analysis • Eliminating redundant data • Establishing the role of NoSQL ❖Overview of Big Data stores • Data models: key value, graph, document, column-family • Hadoop Distributed File System • HBase • Hive
  • 13. PROCESSING BIG DATA ❖Integrating disparate data stores • Mapping data to the programming framework • Connecting and extracting data from storage • Transforming data for processing • Subdividing data in preparation for Hadoop MapReduce ❖Employing Hadoop MapReduce • Creating the components of Hadoop MapReduce jobs • Distributing data processing across server farms • Executing Hadoop MapReduce jobs • Monitoring the progress of job flows
  • 14. WHY BIG DATA •FB generates 10TB daily •Twitter generates 7TB of data Daily •IBM claims 90% of today’s stored data was generated in just the last two years.
  • 15. BIG DATA SOURCES Users Application Systems Sensors Large and growing files (Big data files)
  • 16. DATA GENERATION POINTS - EXAMPLES Mobile Devices Readers/Scanners Science facilities Microphones Cameras Social Media Programs/ Software
  • 17. BIG DATA ANALYTICS • Examining large amount of data • Appropriate information • Identification of hidden patterns, unknown correlations • Competitive advantage • Better business decisions: strategic and operational • Effective marketing, customer satisfaction, increased revenue
  • 18. • Where processing is hosted? • Distributed Servers / Cloud • Where data is stored? • Distributed Storage • What is the programming model? • Distributed Processing (e.g. MapReduce) • How data is stored & indexed? • High-performance schema-free databases (e.g. MongoDB) • What operations are performed on data? • Analytic / Semantic Processing TYPES OF TOOLS USED IN BIG-DATA
  • 19. Application Of Big Data analytics Homeland Security Smarter Healthcare Integrated and smart patient care systems and processes Retail & Multi-channel sales Highly personalized customer experience across channels and devices Telecom Manufacturing Intelligent interconnectivity across the enterprise for enhanced control, speed and efficiency Traffic Control Trading Analytics Search Quality Log Analysis Finance & Banking Seamless customer experience across all banking channels
  • 20. HOW BIG DATA IMPACTS ON IT • Big data is a troublesome force presenting opportunities with challenges to IT organizations. • By 2016 4.4 million IT jobs in Big Data ; 1.9 million is in US itself • India will require a minimum of 1 lakh data scientists in the next couple of years in addition to data analysts and data managers to support the Big Data space.
  • 21. POTENTIAL VALUE OF BIG DATA • $300 billion potential annual value to US health care. • $600 billion potential annual consumer surplus from using personal location data. • 60% potential in retailers’ operating margins.
  • 22. BENEFITS OF BIG DATA •Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse, It’s about the ability to make better decisions and take meaningful actions at the right time. •Fast forward to the present and technologies like Hadoop give you the scale and flexibility to store data before you know how you are going to process it. •Technologies such as MapReduce,Hive and Impala enable you to run queries without changing the data structures underneath.
  • 23. BENEFITS OF BIG DATA • Our newest research finds that organizations are using big data to target customer-centric outcomes, tap into internal data and build a better information ecosystem. • Big Data is already an important part of the $64 billion database and data analytics market • It offers commercial opportunities of a comparable scale to enterprise software in the late 1980s • And the Internet boom of the 1990s, and the social media explosion of today.
  • 24. FUTURE OF BIG DATA • $15 billion on software firms only specializing in data management and analytics. • This industry on its own is worth more than $100 billion and growing at almost 10% a year which is roughly twice as fast as the software business as a whole. • The McKinsey Global Institute estimates that data volume is growing 40% per year, and will grow 44x between 2009 and 2020.
  • 25. INDIA – BIG DATA • Gaining attraction and market • Huge market opportunities for IT services (82.9% of revenues) and analytics firms (17.1 % ) • Current market size is $200 million. By 2015 $1 billion • The opportunity for Indian service providers lies in offering services around Big Data implementation and analytics for global multinationals
  • 26. BIG DATA ANALYTICS TECHNOLOGIES NoSQL : non-relational or at least non-SQL database solutions such as HBase (also a part of the Hadoop ecosystem), Cassandra, MongoDB, Riak, CouchDB, and many others. Hadoop: It is an ecosystem of software packages, including MapReduce, HDFS, and a whole host of other software packages
  • 27. THE FOUR PILLARS FOR AN EFFECTIVE BIG DATA STRATEGY Storage User Experience Digital intelligence and Analytics Content Discovery and Management Just these segments account for more than $10 billion in served, addressable markets.
  • 28. MOTIVATION FOR SPECIALIZED BIG DATA SYSTEMS • Cost of data storage is dropping, but rate of data capture is soaring • Sources: online/digital, communications, messaging, usage, transactions… • Furthermore, need for real-time data-driven insights is also more urgent • Traditional data warehouses and RDBMS systems cannot keep up • They are unable to capture, manage and optimize the volume and diversity of data marketers are seeking to harness today • Structured, unstructured, and semi-structured data are all essential ingredients in today’s marketing mix; traditional systems cannot handle this • Big Data systems: cluster-based, commodity priced, distributed computing database management system • Most often based on Hadoop, but usable without MapReduce programming skills • Key features: linear scalability, parallel computing, node redundancy, and centralized access to data • Server clusters behave like a massive single mainframe: What traditional databases do in months, a Big Data management system can do in hours
  • 30. INTERNET OF THINGS • Each “thing” or connected device is part of the digital shadow of a person • For there to be a market in the internet of things, two things must be true: 1) The “thing” in question must provide utility to the human, and 2) The digital shadow must provide value to an enterprise.
  • 31. MARKET • The “market” is made up of many parts : ➢From wearable to drivable to home and ➢Industrial sensors and controllers, and • Each part is made up of segments : ➢Innovators, ➢Early adopters, ➢Pragmatists, ➢Conservatives, and ➢Laggards across many industries.
  • 32. PREDICTIVE ANALYTICS • From the data streams that implement the “digital shadows” of people, we can use predictive analytics to understand their needs and behavior better than ever before. • Every new dimension of data increases the predictive power, enabling enterprises to answer the question “what does the human want?”
  • 33. INTERNET OF THINGS & PREDICTIVE ANALYTICS • Transforming the internet of things and its sibling, predictive analytics, to be programmable by the same labor pool that has developed the apps which drove the mobile revolution makes basic economic sense. • Types of data generated by the internet of things is coupled with : ➢data analysis ➢data discovery tools and ➢ techniques to help business leaders identify emerging developments such as machines that might need maintenance : to prevent costly breakdowns or  sudden shifts in customer or market conditions that might signal some action a company should take.
  • 34. • The internet of things, the physical world will become a networked information system— through sensors and actuators embedded in real physical objects and linked through wired and wireless networks via the internet protocol. • This holds special value for manufacturing: ➢The potential for connected physical systems to improve productivity in the production process and ➢The supply chain is huge. • Consider processes that govern themselves, where smart products can take corrective action to avoid damages and where individual parts are automatically replenished. • Such technologies already exist and could drive the fourth industrial revolution— following the steam engine, the conveyor belt (assembly line - think ford model t), and the first phase of it and automation technology.
  • 35. EXAMPLE 1 : AUTO INSURANCE • The first-order vector was a connected accelerometer offered to drivers : ➢ to improve their insurance rates based on proven “safe driving” habits. • Through this digital shadow, the insurance provider can make much better actuarial predictions than through the coarse-grained data they had before ➢age, ➢gender, and ➢ traffic violations. • This is interesting in the same way the blackberry was interesting - a basic capability adopted for basic business improvement.
  • 36. • The second-order vector is much stronger : ➢the ability to transform the insurance market to better meet the needs of customers while changing the rules of competition. ➢based on real-time driving information insurance companies can : ▪ move to a real-time spot-pricing model driven by an exchange (not unlike the stock exchange), ▪ bidding on drivers and ▪ providing insurance on demand. Not driving today? Don’t pay for insurance. Need to drive fast tomorrow? Pay a little more but don’t worry about your “permanent record”. • These outcomes are all based on tying the internet of things to predictive analytics.
  • 37. EXAMPLE 2 : HEALTH CARE • The first-order vector is similar, a wearable accelerometer offered to patients : ➢ To improve traceability of their compliance with their exercise prescription, ➢Enabling better outcomes for cardiac patients. ➢Unlike prescription refills, exercise compliance has been untraceable before, so this digital shadow is a breakthrough for medicine. • Similar developments exist in digestible sensors within medications : ➢which activate only on contact with stomach acid, ➢providing higher truth and ➢better granularity than a monthly refill.
  • 38. • In second-order vector in healthcare ,the ability to combine multiple streams of information that were previously invisible has the potential to drive better health outcomes through provably higher patient compliance. • Sorting these data streams at scale will allow health providers and health insurance companies to rapidly iterate health protocols across a population of humans, augmenting human expertise with predictive analytics. • Outcome-based analysis based on predictive models built from data can reduce : ➢waste, ➢error rates, and ➢lawsuits while driving better margins. • Larger exchanges of this type of data will tend to : ➢ perform better, ➢creating a more effective market and ➢ a better pool of empirical research for science.
  • 39. EXAMPLE 3 : AUTO COMPANIES • They have installed thousands of "black boxes" inside their prototype and field testing vehicles to capture second by second data from the dozens of control units which manage today's automobiles. • These boxes simply plug into the vehicle's on-board diagnostic (obd) port which is typically located under the front dashboard of all cars. • They collect 500-750 different vehicle performance parameters that add up to terabytes of data in hours!
  • 40. • The intent of the automakers for installing these boxes is to collect data which their engineers can later analyze to fix bugs and improve on existing designs. • For example, one car manufacturer found out from this data that their minivan batteries would end up in a recall. ➢The problem was an underpowered alternator - it was not able to fully recharge the batteries because the most common drive cycle for this particular minivan was less than 3 miles. ➢As a result, there appeared to be a lot of complaints about dead batteries and the company was potentially facing the recall of millions of minivans which had this alternator. ➢The boxes collect information about driving cycles and this data was really useful in understanding the real reason behind the dead batteries. ➢The test vehicles which had short drive cycles were the ones which reported dead batteries! simply changing the alternator to higher capacity could fix the problem. ➢Now it was an easy fix to extend this solution to the entire fleet.
  • 41. ENDLESS OPPORTUNITY The opportunities are literally endless, ➢Ranging from early fault detection (predicting when a particular component is likely to fail) ➢To automatically adjusting driving route based on traffic pattern predictions. The ultimate test of predictive analytics in the internet of things is of course fully autonomous systems, such as : ➢the nissan car of 2020 or ➢ the google self driving car of today. In the end all autonomous systems will need the ability to build predictive capabilities - in other words, machines must learn machine learning!
  • 42. EXAMPLE 4 : GOOGLE’S SELF DRIVING CAR Google claims that their self-driving car of today has logged more than 300,000 miles with almost zero incidence of accidents. The one time a minor crash did occur was when the car was rear- ended by a human-driven car! So, when the technology is fully mature, it is not just parking valets who become obsolete, other higher paying professions such as automotive safety systems experts may also need to look for other options! Predictive analytics is the enabler that will make this happen.
  • 43. EXAMPLE 5 : JET AIRLINER • A jet airliner generates 20 terabytes of diagnostic data per hour of flight. • The average oil platform has 40,000 sensors, generating data 24/7. • M2M is now generating enormous volumes of data and is testing the capabilities of traditional database technologies. • To extract rich, real-time insight from the vast amounts of machine-generated data, companies will have to build a technology foundation with speed and scale because raw data, whatever the source, is only useful after it has been transformed into knowledge through analysis. • Investigative analytics tools enable interactive, ad-hoc querying on complex big data sets to identify patterns and insights and can perform analysis at massive scale with precision even as machine-generated data grows beyond the petabyte scale
  • 44. FINDING RIGHT ANALYTICS DATABASE TECHNOLOGY • To find the right analytics database technology to capture, connect, and drive meaning from data, companies should consider the following requirements: ➢ Real-time Analysis : Businesses can’t afford for data to get stale. Data solutions need to : ▪ load quickly and easily, ▪ and must dynamically query, ▪ analyze, and ▪ communicate m2m information in real-time, without huge investments in it administration, support, and tuning. ➢Flexible Querying And Ad-hoc Reporting : When intelligence needs to change quickly, analytic tools can’t ▪ be constrained by data schemas that limit the number and ▪ type of queries that can be performed. This type of deeper analysis also cannot be constrained by tinkering or time- consuming manual configuration (such as indexing and managing data partitions) to create and change analytic queries.
  • 45. ➢Efficient Compression : Efficient data compression is key to enabling M2M data management within : ▪ A network node, ▪ Smart device, or ▪ Massive data center cluster. Better compression allows : ▪ For less storage capacity overall, ▪ As well as tighter data sampling and ▪ Longer historical data sets, ▪ Increasing the accuracy of query results. ➢Ease Of Use And Cost : Data analysis must be : ▪ Affordable, Easy-to-use, and ▪ Simple to implement in order to justify the investment. This demands low-touch solutions that are optimized to deliver : ▪ Fast analysis of large volumes of data, ▪ With minimal hardware, Administrative effort, and ▪ Customization needed to set up or ▪ Change query and reporting parameters.
  • 46. EXAMPLE 6 : UNION PACIFIC RAILROAD • The railroad is using sensor and analytics technologies to predict and prevent train derailments, • For example, the company has placed infrared sensors on every 20 miles of its tracks to gather 20 million temperature readings of train wheels each day to look for signs of overheating, which is a sign of impending failure. • Meanwhile, trackside microphones are used to pick up “growling” bearings in the wheels. • Data from such physical measurements are sent via fiber optic lines to union pacific’s data centers. • Complex pattern-matching algorithms and analytics are used to identify irregularities, allowing union pacific experts to determine within minutes of capturing the data whether a driver should pull a train over for inspection or reduce its speed until it reaches the next station to be repaired.
  • 47. HOW TO ANALYZE MACHINE AND SENSOR DATA • Capture and refine data from heating, ventilation, and air conditioning (hvac) systems in 20 large buildings around the world using the hortonworks data platform, and how to analyze the refined sensor data to maintain optimal building temperatures. • Sensor data - A sensor is a device that measures a physical quantity and transforms it into a digital signal. sensors are always on, capturing data at a low cost, and powering the “internet of things.” • Potential uses of sensor data ➢Sensors can be used to collect data from many sources, such as: ➢To monitor machines or infrastructure such as ventilation equipment, bridges, energy meters, or airplane engines. This data can be used for predictive analytics, to repair or replace these items before they break. ➢To monitor natural phenomena such as meteorological patterns, underground pressure during oil extraction, or patient vital statistics during recovery from a medical procedure.
  • 49. OUTLINE • Architecture of Hadoop Distributed File System • Hadoop usage • Ideas for Hadoop related research
  • 50. HADOOP, WHY? • Need to process Multi Petabyte Datasets • Expensive to build reliability in each application. • Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. • Need common infrastructure – Efficient, reliable, Open Source Apache License • The above goals are same as Condor, but • Workloads are IO bound and not CPU bound
  • 51. HIVE, WHY? • Need a Multi Petabyte Warehouse • Files are insufficient data abstractions • Need tables, schemas, partitions, indices • SQL is highly popular • Need for an open data format – RDBMS have a closed data format – flexible schema • Hive is a Hadoop subproject!
  • 52. Hadoop What is Hadoop?  It's a framework for running applications on large clusters of commodity hardware which produces huge data and to process it Hadoop Includes  HDFS a distributed filesystem  Map/Reduce HDFS implements this programming model. It is an offline computing engine Concept Moving computation is more efficient than moving large data
  • 53. • Data intensive applications with Petabytes of data. • Web pages - 20+ billion web pages x 20KB = 400+ terabytes • One computer can read 30-35 MB/sec from disk ~four months to read the web • same problem with 1000 machines, < 3 hours • Difficulty with a large number of machines • communication and coordination • recovering from machine failure • status reporting • debugging • optimization • locality
  • 54. WHO USES HADOOP? • Facebook • Amazon/A9 • Google • IBM • New York Times • Yahoo! • PowerSet
  • 55. COMMODITY HARDWARE Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit
  • 56. GOALS OF HDFS • Very Large Distributed File System – 10K nodes, 100 million files, 10 PB • Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them • Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth • User Space, runs on heterogeneous OS
  • 57. Secondary NameNode Client HDFS Architecture NameNode DataNodes Cluster Membership Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log
  • 58. DISTRIBUTED FILE SYSTEM • Single Namespace for entire cluster • Data Coherency – Write-once-read-many access model – Client can only append to existing files • Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes • Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 60. HDFS – HADOOP DISTRIBUTED FILE SYSTEM
  • 61. HADOOP CLUSTER ARCHITECTURE • Map/Reduce Master “Jobtracker” • Accepts MR jobs submitted by users • Assigns Map and Reduce tasks to Tasktrackers • Monitors task and tasktracker status, reexecutes tasks upon failure • Map/Reduce Slave “Tasktrackers” • Run Map and Reduce tasks upon instruction from the Jobtracker • Manage storage and transmission of intermediate output.
  • 62. NAMENODE METADATA • Meta-data in Memory – The entire metadata is in main memory – No demand paging of meta-data • Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor • A Transaction Log – Records file creations, file deletions. etc
  • 63. DATANODE • A Block Server – Stores data in the local file system – Stores meta-data of a block – Serves data and meta-data to Clients • Block Report – Periodically sends a report of all existing blocks to the NameNode • Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 64. • Files are broken in to large blocks. – Typically 128 MB block size – Blocks are replicated for reliability • One replica on local node, another replica on a remote rack, Third replica on local rack, Additional replicas are randomly placed • Understands rack locality – Data placement exposed so that computation can be migrated to data • Client talks to both NameNode and DataNodes – Data is not sent through the namenode, clients access data directly from DataNode – Throughput of file system scales nearly linearly with the number of nodes. DATA MODEL
  • 65. DATA CORRECTNESS • Use Checksums to validate data – Use CRC32 • File Creation – Client computes checksum per 512 byte – DataNode stores the checksum • File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas
  • 66. NAMENODE FAILURE • A single point of failure – new version has a secondary namenode • Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system (NFS/CIFS) • Need to develop a real HA solution
  • 67. HADOOP MAP/REDUCE • The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework • Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output • Natural for: – Log processing – Web search indexing – Ad-hoc queries
  • 68. EXAMPLE - HADOOP AT FACEBOOK • Production cluster • 4800 cores, 600 machines, 16GB per machine – April 2009 • 8000 cores, 1000 machines, 32 GB per machine – July 2009 • 4 SATA disks of 1 TB each per machine • 2 level network hierarchy, 40 machines per rack • Total cluster size is 2 PB, projected to be 12 PB in Q3 2009 • Test cluster • 800 cores, 16GB each
  • 69. DATA FLOW Web Servers Scribe Servers Network Storage Hadoop ClusterOracle RAC MySQL
  • 70. HADOOP AND HIVE USAGE • Statistics : • 15 TB uncompressed data ingested per day • 55TB of compressed data scanned per day • 3200+ jobs on production cluster per day • 80M compute minutes per day • Barrier to entry is reduced: • 80+ engineers have run jobs on Hadoop platform • Analysts (non-engineers) starting to use Hadoop through Hive
  • 71. BID DATA LEARNING PATH • 1. Understand the difference between various data handling techniques like OLTP, OLAP, Data Mining, Data Warehoue, Data Mart, etc. • 2. Understand various visualization techniques like Bar Chart, Heat Map, Tree Map, Density Map, etc. • 3. Understand Data Mining / Analytics algorithms. • 4. Identify various sources of data and identify data elements that need to be focused upon for Analytics. • 6. Understand data quality checks and ensure data quality. Without a quality data analytics may deviate a lot from actual scenario.
  • 72. BID DATA LEARNING PATH • 7. Answer why I need a distributed system. • 8. Study various data handling techniques provided by NoSQL databases like Mongo DB, Cassandra, etc. • 9. Find out how Hadoop or related Big Data techniques can be used for distributed data by using horizontal scalability techniques. • 10. Finalize algorithms that will run on top of this data and identify tools or develop program for these algorithms. • 11. Use visualization techniques learnt in step 2 above to present the output. • 12. Keep on making the tool/program more and more intelligent as a continuous process. Note: A combination of Relational and NoSQL databases may be required for performing required analytics and/or generating visualizations.
  • 73. CONCLUSION • Why commodity hardware ? because cheaper designed to tolerate faults • Why HDFS ? network bandwidth vs seek latency • Why Map reduce programming model? parallel programming large data sets moving computation to data single compute + data cluster
  • 74. • Hadoop Log Analysis • Failure prediction and root cause analysis • Hadoop Data Rebalancing • Based on access patterns and load • Best use of flash memory? • Design new topology based on commodity hardware MORE IDEAS FOR FURTHER DISCUSSION AND RESEARCH
  • 75. USEFUL LINKS •HDFS Design: • http://hadoop.apache.org/core/docs/current/hdfs_design.html •Hadoop API: • http://hadoop.apache.org/core/docs/current/api/ •Hive: • http://hadoop.apache.org/hive/
  • 76. THANK YOU DATA SCIENTISTS !