DM - Topic Five
DM - Topic Five
Technology
From one location storage to multiple storage
From serial processing and batch processing to parallel
processing
More on streaming and large data processing
From one technique to combined approach
Challenges
More speed
More size
More variety
More value
More uncertainty
Big data technologies and applications
BD technologies
Acquisition and storage
Analysis and modeling
Visualization
BD applications and case studies
6
Why Study Big Data Technologies?
Divide
Work
Combine
Results
Big Data Open Source Tools
Typical Big data management tools
Big Data Management Tools
Project Storm
for data stream analysis in which analysis made is real time
Apache Drill
for interactive and ad-hoc analysis
Apache Hadoop (for large scale processing )
MapReduce (It works based on the concept of splitting the data processing task into
two phases of mapping and reducing )
HDFS ( Hadoop Distributed File System )
Hbase (non-relational distributed db that sits on top of HDFS based on the Google’s Big Table)
Group by
key:
Collect all pairs
with same key
(Hash merge,
Shuffle, Sort,
Partition)
Reduce:
Collect all values
belonging to the
key and output
26
MapReduce Example - WordCount
Big Data Applications
Link analysis
Graph data processing
Data stream mining
Text mining
Large-scale machine learning through
Association mining
Classification
Clustering etc…
Review questions
Explain how mapReduce works?
Describe the motivations for big data technologies?
What are the major challenges in big data analysis?
29
Thank you