Big data and analytics

BIG DATA AND ANALYTICS CONCEPTS
BOHITESH MISRA
CHIEF TECHNOLOGY OFFICER
IT STARTUPS
BOHITESH.MISRA@GMAIL.COM

The Internet of Things connects
all manner of end-points, a
treasure trove of data
Networks and device
proliferation enable
access to a massive and
growing amount of
traditionally siloed
information
Analytics and business
intelligence tools empower
decision makers as never
before by extracting and
presenting meaningful
information in real-time,
helping us be more
predictive than reactive
BUILDING A CONNECTED AND SMART ECOSYSTEM:
A ROADMAP TO BUSINESS NIRVANA
IoT Big Data Analytics

CONTENT
1. What is Big Data
2. Characteristic of Big Data
3. Why Big Data
4. How it is Different
5. Big Data sources
6. Tools used in Big Data
7. Application of Big Data
8. Risks of Big Data
9. Benefits of Big Data
10.How Big Data Impact on IT
11.Future of Big Data

BIG DATA
• Big Data may well be the Next Big Thing in the IT world.
• The first organizations to embrace big data were online
and startup firms. Firms like Google, eBay, LinkedIn, and
Facebook were built around big data from the beginning.
• Like many new information technologies, big data can bring
about dramatic cost reductions, substantial improvements in
the time required to perform a computing task, or new
product and service offerings.

• ‘Big Data’ is similar to ‘small data’, but bigger in size
• but having data bigger it requires different approaches, Techniques,
tools and architecture
• an aim to solve new problems or old problems in a better way
• Big Data generates value from the storage and processing of very
large quantities of digital information that cannot be analyzed with
traditional computing techniques.
WHAT IS BIG DATA?

THREE CHARACTERISTICS OF BIG DATA
Volume
• Data
quantity
Velocity
• Data
Speed
Variety
• Data
Types

BIG DATA - VOLUME
•A typical PC might have had 10 gigabytes of storage in 2000.
•Today, Facebook ingests 500 terabytes of new data every day.
•Boeing 737 will generate 240 terabytes of flight data during a
single flight across the US.
• The smart phones, the data they create and consume; sensors
embedded into everyday objects will soon result in billions of new,
constantly-updated data feeds containing environmental, location,
and other information, including video.

BIG DATA - VELOCITY
• Clickstreams and ad impressions capture user behavior at millions of events
per second
• high-frequency stock trading algorithms reflect market changes within
microseconds
• machine to machine processes exchange data between billions of devices
• infrastructure and sensors generate massive log data in real-time

BIG DATA - VARIETY
• Big Data isn't just numbers, dates, and strings. Big Data is
also geospatial data, 3D data, audio and video, and
unstructured text, including log files and social media.
• Traditional database systems were designed to address
smaller volumes of structured data, fewer updates or a
predictable, consistent data structure.
• Big Data analysis includes different types of data

STORING BIG DATA
❖Analyzing your data characteristics
• Selecting data sources for analysis
• Eliminating redundant data
• Establishing the role of NoSQL
❖Overview of Big Data stores
• Data models: key value, graph, document, column-family
• Hadoop Distributed File System
• HBase
• Hive

PROCESSING BIG DATA
❖Integrating disparate data stores
• Mapping data to the programming framework
• Connecting and extracting data from storage
• Transforming data for processing
• Subdividing data in preparation for Hadoop MapReduce
❖Employing Hadoop MapReduce
• Creating the components of Hadoop MapReduce jobs
• Distributing data processing across server farms
• Executing Hadoop MapReduce jobs
• Monitoring the progress of job flows

WHY BIG DATA
•FB generates 10TB daily
•Twitter generates 7TB of data Daily
•IBM claims 90% of today’s
stored data was generated
in just the last two years.

BIG DATA SOURCES
Users
Application
Systems
Sensors
Large and growing files
(Big data files)

DATA GENERATION POINTS - EXAMPLES
Mobile Devices
Readers/Scanners
Science facilities
Microphones
Cameras
Social Media
Programs/ Software

BIG DATA ANALYTICS
• Examining large amount of data
• Appropriate information
• Identification of hidden patterns, unknown correlations
• Competitive advantage
• Better business decisions: strategic and operational
• Effective marketing, customer satisfaction, increased revenue

• Where processing is hosted?
• Distributed Servers / Cloud
• Where data is stored?
• Distributed Storage
• What is the programming model?
• Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?
• High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
• Analytic / Semantic Processing
TYPES OF TOOLS USED IN BIG-DATA

Application Of Big Data analytics
Homeland
Security
Smarter Healthcare
Integrated and smart
patient care systems
and processes
Retail & Multi-channel
sales
Highly personalized
customer experience
across channels and
devices
Telecom
Manufacturing
Intelligent
interconnectivity across
the enterprise for
enhanced control, speed
and efficiency
Traffic Control
Trading Analytics
Search Quality
Log Analysis
Finance & Banking
Seamless customer
experience across all
banking channels

HOW BIG DATA IMPACTS ON IT
• Big data is a troublesome force presenting opportunities with challenges to IT
organizations.
• By 2016 4.4 million IT jobs in Big Data ; 1.9 million is in US itself
• India will require a minimum of 1 lakh data scientists in the next couple of
years in addition to data analysts and data managers to support the Big Data
space.

POTENTIAL VALUE OF BIG DATA
• $300 billion potential annual
value to US health care.
• $600 billion potential annual
consumer surplus from using
personal location data.
• 60% potential in retailers’
operating margins.

BENEFITS OF BIG DATA
•Real-time big data isn’t just a process for storing petabytes
or exabytes of data in a data warehouse, It’s about the
ability to make better decisions and take meaningful
actions at the right time.
•Fast forward to the present and technologies like Hadoop
give you the scale and flexibility to store data before you
know how you are going to process it.
•Technologies such as MapReduce,Hive and Impala enable
you to run queries without changing the data structures
underneath.

BENEFITS OF BIG DATA
• Our newest research finds that organizations are using big data to target
customer-centric outcomes, tap into internal data and build a better information
ecosystem.
• Big Data is already an important part of the $64 billion database and data
analytics market
• It offers commercial opportunities of a comparable scale to enterprise software in
the late 1980s
• And the Internet boom of the 1990s, and the social media explosion of today.

FUTURE OF BIG DATA
• $15 billion on software firms only specializing in data management
and analytics.
• This industry on its own is worth more than $100 billion and growing
at almost 10% a year which is roughly twice as fast as the software
business as a whole.
• The McKinsey Global Institute estimates that data volume is growing
40% per year, and will grow 44x between 2009 and 2020.

INDIA – BIG DATA
• Gaining attraction and market
• Huge market opportunities for IT services (82.9% of revenues) and
analytics firms (17.1 % )
• Current market size is $200 million. By 2015 $1 billion
• The opportunity for Indian service providers lies in offering services
around Big Data implementation and analytics for global
multinationals

BIG DATA ANALYTICS TECHNOLOGIES
NoSQL : non-relational or at least non-SQL database solutions
such as HBase (also a part of the Hadoop ecosystem),
Cassandra, MongoDB, Riak, CouchDB, and many others.
Hadoop: It is an ecosystem of software packages, including
MapReduce, HDFS, and a whole host of other software
packages

THE FOUR PILLARS FOR AN EFFECTIVE BIG DATA STRATEGY
Storage User Experience
Digital intelligence and
Analytics
Content Discovery
and Management
Just these segments account for more than $10 billion in served, addressable markets.

MOTIVATION FOR SPECIALIZED BIG DATA SYSTEMS
• Cost of data storage is dropping, but rate of data capture is soaring
• Sources: online/digital, communications, messaging, usage, transactions…
• Furthermore, need for real-time data-driven insights is also more urgent
• Traditional data warehouses and RDBMS systems cannot keep up
• They are unable to capture, manage and optimize the volume and diversity of data
marketers are seeking to harness today
• Structured, unstructured, and semi-structured data are all essential ingredients in
today’s marketing mix; traditional systems cannot handle this
• Big Data systems: cluster-based, commodity priced, distributed
computing database management system
• Most often based on Hadoop, but usable without MapReduce programming skills
• Key features: linear scalability, parallel computing, node redundancy, and
centralized access to data
• Server clusters behave like a massive single mainframe: What traditional
databases do in months, a Big Data management system can do in hours

INTERNET OF THINGS
&
PREDICTIVE ANALYTICS

INTERNET OF THINGS
• Each “thing” or connected device is part of the digital shadow of a person
• For there to be a market in the internet of things, two things must be true:
1) The “thing” in question must provide utility to the human, and
2) The digital shadow must provide value to an enterprise.

MARKET
• The “market” is made up of many parts :
➢From wearable to drivable to home and
➢Industrial sensors and controllers, and
• Each part is made up of segments :
➢Innovators,
➢Early adopters,
➢Pragmatists,
➢Conservatives, and
➢Laggards across many industries.

PREDICTIVE ANALYTICS
• From the data streams that implement the “digital shadows” of people, we
can use predictive analytics to understand their needs and behavior better
than ever before.
• Every new dimension of data increases the predictive power, enabling
enterprises to answer the question “what does the human want?”

INTERNET OF THINGS & PREDICTIVE ANALYTICS
• Transforming the internet of things and its sibling, predictive analytics, to be
programmable by the same labor pool that has developed the apps which drove
the mobile revolution makes basic economic sense.
• Types of data generated by the internet of things is coupled with :
➢data analysis
➢data discovery tools and
➢ techniques to help business leaders identify emerging developments such as machines that
might need maintenance :
to prevent costly breakdowns or
 sudden shifts in customer or
market conditions that might signal some action a company should take.

• The internet of things, the physical world will become a networked information system—
through sensors and actuators embedded in real physical objects and linked through
wired and wireless networks via the internet protocol.
• This holds special value for manufacturing:
➢The potential for connected physical systems to improve productivity in the production process and
➢The supply chain is huge.
• Consider processes that govern themselves, where smart products can take corrective
action to avoid damages and where individual parts are automatically replenished.
• Such technologies already exist and could drive the fourth industrial revolution—
following the steam engine, the conveyor belt (assembly line - think ford model t), and the
first phase of it and automation technology.

EXAMPLE 1 : AUTO INSURANCE
• The first-order vector was a connected accelerometer offered to drivers :
➢ to improve their insurance rates based on proven “safe driving” habits.
• Through this digital shadow, the insurance provider can make much better
actuarial predictions than through the coarse-grained data they had before
➢age,
➢gender, and
➢ traffic violations.
• This is interesting in the same way the blackberry was interesting - a basic
capability adopted for basic business improvement.

• The second-order vector is much stronger :
➢the ability to transform the insurance market to better meet the needs of customers while
changing the rules of competition.
➢based on real-time driving information insurance companies can :
▪ move to a real-time spot-pricing model driven by an exchange (not unlike the stock exchange),
▪ bidding on drivers and
▪ providing insurance on demand. Not driving today? Don’t pay for insurance. Need to drive fast
tomorrow? Pay a little more but don’t worry about your “permanent record”.
• These outcomes are all based on tying the internet of things to predictive
analytics.

EXAMPLE 2 : HEALTH CARE
• The first-order vector is similar, a wearable accelerometer offered to patients :
➢ To improve traceability of their compliance with their exercise prescription,
➢Enabling better outcomes for cardiac patients.
➢Unlike prescription refills, exercise compliance has been untraceable before, so this digital
shadow is a breakthrough for medicine.
• Similar developments exist in digestible sensors within medications :
➢which activate only on contact with stomach acid,
➢providing higher truth and
➢better granularity than a monthly refill.

• In second-order vector in healthcare ,the ability to combine multiple streams of
information that were previously invisible has the potential to drive better health
outcomes through provably higher patient compliance.
• Sorting these data streams at scale will allow health providers and health insurance
companies to rapidly iterate health protocols across a population of humans, augmenting
human expertise with predictive analytics.
• Outcome-based analysis based on predictive models built from data can reduce :
➢waste,
➢error rates, and
➢lawsuits while driving better margins.
• Larger exchanges of this type of data will tend to :
➢ perform better,
➢creating a more effective market and
➢ a better pool of empirical research for science.

EXAMPLE 3 : AUTO COMPANIES
• They have installed thousands of "black boxes" inside their prototype and field
testing vehicles to capture second by second data from the dozens of control units
which manage today's automobiles.
• These boxes simply plug into the vehicle's on-board diagnostic (obd) port which is
typically located under the front dashboard of all cars.
• They collect 500-750 different vehicle performance parameters that add up to
terabytes of data in hours!

• The intent of the automakers for installing these boxes is to collect data which their
engineers can later analyze to fix bugs and improve on existing designs.
• For example, one car manufacturer found out from this data that their minivan batteries
would end up in a recall.
➢The problem was an underpowered alternator - it was not able to fully recharge the batteries
because the most common drive cycle for this particular minivan was less than 3 miles.
➢As a result, there appeared to be a lot of complaints about dead batteries and the company was
potentially facing the recall of millions of minivans which had this alternator.
➢The boxes collect information about driving cycles and this data was really useful in understanding
the real reason behind the dead batteries.
➢The test vehicles which had short drive cycles were the ones which reported dead batteries! simply
changing the alternator to higher capacity could fix the problem.
➢Now it was an easy fix to extend this solution to the entire fleet.

ENDLESS OPPORTUNITY
The opportunities are literally endless,
➢Ranging from early fault detection (predicting when a particular component
is likely to fail)
➢To automatically adjusting driving route based on traffic pattern
predictions.
The ultimate test of predictive analytics in the internet of things is of course fully
autonomous systems, such as :
➢the nissan car of 2020 or
➢ the google self driving car of today.
In the end all autonomous systems will need the ability to build predictive
capabilities - in other words, machines must learn machine learning!

EXAMPLE 4 : GOOGLE’S SELF DRIVING CAR
Google claims that their self-driving car of today has logged more
than 300,000 miles with almost zero incidence of accidents.
The one time a minor crash did occur was when the car was rear-
ended by a human-driven car!
So, when the technology is fully mature, it is not just parking valets
who become obsolete, other higher paying professions such as
automotive safety systems experts may also need to look for other
options!
Predictive analytics is the enabler that will make this happen.

EXAMPLE 5 : JET AIRLINER
• A jet airliner generates 20 terabytes of diagnostic data per hour of flight.
• The average oil platform has 40,000 sensors, generating data 24/7.
• M2M is now generating enormous volumes of data and is testing the capabilities of
traditional database technologies.
• To extract rich, real-time insight from the vast amounts of machine-generated data,
companies will have to build a technology foundation with speed and scale because raw
data, whatever the source, is only useful after it has been transformed into knowledge
through analysis.
• Investigative analytics tools enable interactive, ad-hoc querying on complex big data sets
to identify patterns and insights and can perform analysis at massive scale with precision
even as machine-generated data grows beyond the petabyte scale

FINDING RIGHT ANALYTICS DATABASE TECHNOLOGY
• To find the right analytics database technology to capture, connect, and drive
meaning from data, companies should consider the following requirements:
➢ Real-time Analysis : Businesses can’t afford for data to get stale. Data solutions need to :
▪ load quickly and easily,
▪ and must dynamically query,
▪ analyze, and
▪ communicate m2m information in real-time, without huge investments in it administration, support, and tuning.
➢Flexible Querying And Ad-hoc Reporting : When intelligence needs to change quickly, analytic tools can’t
▪ be constrained by data schemas that limit the number and
▪ type of queries that can be performed.
This type of deeper analysis also cannot be constrained by tinkering or time-
consuming manual configuration (such as indexing and managing data partitions) to
create and change analytic queries.

➢Efficient Compression : Efficient data compression is key to enabling M2M data management within :
▪ A network node,
▪ Smart device, or
▪ Massive data center cluster.
Better compression allows :
▪ For less storage capacity overall,
▪ As well as tighter data sampling and
▪ Longer historical data sets,
▪ Increasing the accuracy of query results.
➢Ease Of Use And Cost : Data analysis must be :
▪ Affordable, Easy-to-use, and
▪ Simple to implement in order to justify the investment.
This demands low-touch solutions that are optimized to deliver :
▪ Fast analysis of large volumes of data,
▪ With minimal hardware, Administrative effort, and
▪ Customization needed to set up or
▪ Change query and reporting parameters.

EXAMPLE 6 : UNION PACIFIC RAILROAD
• The railroad is using sensor and analytics technologies to predict and prevent train derailments,
• For example, the company has placed infrared sensors on every 20 miles of its tracks to gather 20
million temperature readings of train wheels each day to look for signs of overheating, which is a
sign of impending failure.
• Meanwhile, trackside microphones are used to pick up “growling” bearings in the wheels.
• Data from such physical measurements are sent via fiber optic lines to union pacific’s data centers.
• Complex pattern-matching algorithms and analytics are used to identify irregularities, allowing
union pacific experts to determine within minutes of capturing the data whether a driver should
pull a train over for inspection or reduce its speed until it reaches the next station to be repaired.

HOW TO ANALYZE MACHINE AND SENSOR DATA
• Capture and refine data from heating, ventilation, and air conditioning (hvac) systems in
20 large buildings around the world using the hortonworks data platform, and how to
analyze the refined sensor data to maintain optimal building temperatures.
• Sensor data - A sensor is a device that measures a physical quantity and transforms it
into a digital signal. sensors are always on, capturing data at a low cost, and powering
the “internet of things.”
• Potential uses of sensor data
➢Sensors can be used to collect data from many sources, such as:
➢To monitor machines or infrastructure such as ventilation equipment, bridges, energy meters, or
airplane engines. This data can be used for predictive analytics, to repair or replace these items
before they break.
➢To monitor natural phenomena such as meteorological patterns, underground pressure during oil
extraction, or patient vital statistics during recovery from a medical procedure.

OUTLINE
• Architecture of Hadoop Distributed File System
• Hadoop usage
• Ideas for Hadoop related research

HADOOP, WHY?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
• Need common infrastructure
– Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
• Workloads are IO bound and not CPU bound

HIVE, WHY?
• Need a Multi Petabyte Warehouse
• Files are insufficient data abstractions
• Need tables, schemas, partitions, indices
• SQL is highly popular
• Need for an open data format
– RDBMS have a closed data format
– flexible schema
• Hive is a Hadoop subproject!

Hadoop
What is Hadoop?
 It's a framework for running applications on large clusters of
commodity hardware which produces huge data and to process it
Hadoop Includes
 HDFS a distributed filesystem
 Map/Reduce HDFS implements this programming model. It is an offline
computing engine
Concept
Moving computation is more efficient than moving large data

• Data intensive applications with Petabytes of data.
• Web pages - 20+ billion web pages x 20KB = 400+ terabytes
• One computer can read 30-35 MB/sec from disk ~four months to read the web
• same problem with 1000 machines, < 3 hours
• Difficulty with a large number of machines
• communication and coordination
• recovering from machine failure
• status reporting
• debugging
• optimization
• locality

WHO USES HADOOP?
• Facebook
• Amazon/A9
• Google
• IBM
• New York Times
• Yahoo!
• PowerSet

COMMODITY HARDWARE
Typically in 2 level architecture
– Nodes are commodity PCs
– 30-40 nodes/rack
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit

GOALS OF HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to where data
resides
– Provides very high aggregate bandwidth
• User Space, runs on heterogeneous OS

Secondary
NameNode
Client
HDFS Architecture
NameNode
DataNodes
Cluster Membership
Cluster Membership
NameNode : Maps a file to a file-id and list of MapNodes
DataNode : Maps a block-id to a physical location on disk
SecondaryNameNode: Periodic merge of Transaction log

DISTRIBUTED FILE SYSTEM
• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode

HDFS – HADOOP DISTRIBUTED FILE SYSTEM

HADOOP CLUSTER ARCHITECTURE
• Map/Reduce Master “Jobtracker”
• Accepts MR jobs submitted by users
• Assigns Map and Reduce tasks to
Tasktrackers
• Monitors task and tasktracker status,
reexecutes tasks upon failure
• Map/Reduce Slave “Tasktrackers”
• Run Map and Reduce tasks upon
instruction from the Jobtracker
• Manage storage and transmission of
intermediate output.

NAMENODE METADATA
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc

DATANODE
• A Block Server
– Stores data in the local file system
– Stores meta-data of a block
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes

• Files are broken in to large blocks.
– Typically 128 MB block size
– Blocks are replicated for reliability
• One replica on local node, another replica on a remote rack,
Third replica on local rack, Additional replicas are randomly placed
• Understands rack locality
– Data placement exposed so that computation can be migrated to data
• Client talks to both NameNode and DataNodes
– Data is not sent through the namenode, clients access data directly from
DataNode
– Throughput of file system scales nearly linearly with the number of nodes.
DATA MODEL

DATA CORRECTNESS
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas

NAMENODE FAILURE
• A single point of failure – new version has a secondary
namenode
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution

EXAMPLE - HADOOP AT FACEBOOK
• Production cluster
• 4800 cores, 600 machines, 16GB per machine – April 2009
• 8000 cores, 1000 machines, 32 GB per machine – July 2009
• 4 SATA disks of 1 TB each per machine
• 2 level network hierarchy, 40 machines per rack
• Total cluster size is 2 PB, projected to be 12 PB in Q3 2009
• Test cluster
• 800 cores, 16GB each

DATA FLOW
Web Servers
Scribe Servers
Network
Storage
Hadoop ClusterOracle RAC MySQL

HADOOP AND HIVE USAGE
• Statistics :
• 15 TB uncompressed data ingested per day
• 55TB of compressed data scanned per day
• 3200+ jobs on production cluster per day
• 80M compute minutes per day
• Barrier to entry is reduced:
• 80+ engineers have run jobs on Hadoop platform
• Analysts (non-engineers) starting to use Hadoop through Hive

BID DATA LEARNING PATH
• 1. Understand the difference between various data handling techniques like OLTP,
OLAP, Data Mining, Data Warehoue, Data Mart, etc.
• 2. Understand various visualization techniques like Bar Chart, Heat Map, Tree Map,
Density Map, etc.
• 3. Understand Data Mining / Analytics algorithms.
• 4. Identify various sources of data and identify data elements that need to be
focused upon for Analytics.
• 6. Understand data quality checks and ensure data quality. Without a quality data
analytics may deviate a lot from actual scenario.

BID DATA LEARNING PATH
• 7. Answer why I need a distributed system.
• 8. Study various data handling techniques provided by NoSQL databases like Mongo
DB, Cassandra, etc.
• 9. Find out how Hadoop or related Big Data techniques can be used for distributed
data by using horizontal scalability techniques.
• 10. Finalize algorithms that will run on top of this data and identify tools or develop
program for these algorithms.
• 11. Use visualization techniques learnt in step 2 above to present the output.
• 12. Keep on making the tool/program more and more intelligent as a continuous
process.
Note: A combination of Relational and NoSQL databases may be required for
performing required analytics and/or generating visualizations.

CONCLUSION
• Why commodity hardware ?
because cheaper
designed to tolerate faults
• Why HDFS ?
network bandwidth vs seek latency
• Why Map reduce programming model?
parallel programming
large data sets
moving computation to data
single compute + data cluster

• Hadoop Log Analysis
• Failure prediction and root cause analysis
• Hadoop Data Rebalancing
• Based on access patterns and load
• Best use of flash memory?
• Design new topology based on commodity hardware
MORE IDEAS FOR FURTHER DISCUSSION AND
RESEARCH

USEFUL LINKS
•HDFS Design:
• http://hadoop.apache.org/core/docs/current/hdfs_design.html
•Hadoop API:
• http://hadoop.apache.org/core/docs/current/api/
•Hive:
• http://hadoop.apache.org/hive/

Big data and analytics

More Related Content

What's hot (20)

Similar to Big data and analytics (20)

More from Bohitesh Misra, PMP (10)

Recently uploaded (20)

Big data and analytics