0% found this document useful (0 votes)
8 views

DM - Topic Five

Uploaded by

bahiran.esube
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

DM - Topic Five

Uploaded by

bahiran.esube
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Data Mining - Topic five

Current trends in data analytics (big data), DM


and BI
Tibebe B. (PhD)
Topics
 Fundamental concepts and the need for business intelligence,
data mining and its flavors , big data analysis
 BI/DM/BDA applications, DA models and frameworks
 Data and data warehousing
 Data mining techniques ;Association rule mining, Classification
and Cluster analysis
 Including web/text, opinion mining, BI technologies, applications, and
case studies
 Current trends in data analytics and BI
Current trends in data mining/big data and BI

 Use (recognized as ..)


 The basis of success for a data-driven business
 Acquiring and sustaining competitive advantage via data
science/data analytics
 Organizations are working hard to harness data to add intelligence
to their business
 Data
 From database to networked data, social media data etc.
 From smaller size to bigger
 From simple form to complex structure
Cont…

 Technology
 From one location storage to multiple storage
 From serial processing and batch processing to parallel
processing
 More on streaming and large data processing
 From one technique to combined approach
Challenges

 More speed
 More size
 More variety
 More value
 More uncertainty
Big data technologies and applications
 BD technologies
 Acquisition and storage
 Analysis and modeling
 Visualization
 BD applications and case studies

6
Why Study Big Data Technologies?

 The hottest topic in both research and industry


 Highly demanded in real world
 A promising future career
Research and development of big data systems: distributed systems (eg,
Hadoop), visualization tools, data warehouse, OLAP, data integration,
data quality control, …
Big data applications: social marketing, healthcare, …
Data analysis: to get values out of big data discovering and applying
patterns, predicative analysis, business intelligence, privacy and
security, …
Big data storage and management is challenging
 Data Volumes are massive
 Reliability of Storing PBs of data is challenging
 All kinds of failures: Disk/Hardware/Network Failures
 Probability of failures simply increase with the number of machines …
 Key challenges
 Challenges include analysis, capture, data search, sharing, storage,
transfer, visualization, querying, updating and information privacy
 While accessing or acquiring can be done relatively easer through
API’s like Twitter API, storage and Real-Time Analytics/Decision
Requirement still is challenging
Idea and Solution
 Issue: Copying data over a network takes time
 Idea:
 Bring computation close to the data
 Store files multiple times for reliability
 Map-reduce addresses these problems
 Google’s computational/data manipulation model
 Elegant way to work with big data
 Storage Infrastructure – File system
Google: GFS. Hadoop: HDFS (Hadoop Distributed file System)
 Programming model
Map-Reduce 9
Philosophy to Scale for Big Data Processing

Divide
Work

Combine
Results
Big Data Open Source Tools
Typical Big data management tools
 Big Data Management Tools
 Project Storm
 for data stream analysis in which analysis made is real time
 Apache Drill
 for interactive and ad-hoc analysis
 Apache Hadoop (for large scale processing )
 MapReduce (It works based on the concept of splitting the data processing task into
two phases of mapping and reducing )
 HDFS ( Hadoop Distributed File System )
 Hbase (non-relational distributed db that sits on top of HDFS based on the Google’s Big Table)

 Hive (batch-oriented data access layer on tope of HDFS and mapReduce)


 Pig,Mahout,Spark
12
What is Hadoop
 Open-source data storage and processing platform
 Before the advent of Hadoop, storage and processing of big data was a big
challenge
 Massively scalable, automatically parallelizable
 Based on work from Google
 Google: GFS + MapReduce + BigTable (Not open)
 Hadoop: HDFS + Hadoop MapReduce + HBase ( opensource)
 Named by Doug Cutting in 2006 (worked at Yahoo! at that time), after his son's
toy elephant.
Why Use Hadoop?
 Cheaper
 Scales to Petabytes or more easily
 Faster
 Parallel data processing
 Better
 Suited for particular types of big data problems
Hadoop is a set of Apache Frameworks and more…
 Data storage (HDFS)
 Runs on commodity hardware (usually Linux)
 Horizontally scalable
 Processing (MapReduce)
 Parallelized (scalable) processing
 Fault Tolerant
 Other Tools / Frameworks Monitoring & Alerting
 Data Access Tools & Libraries
 HBase, Hive, Pig, Mahout Data Access
 Tools MapReduce API
 Hue, Sqoop
 Hadoop Core - HDFS
Monitoring
 Greenplum, Cloudera
Companies Using Hadoop
MapReduce
 Typical big data problem (storage and processing )
 Iterate over a large number of records
 Extract
Map
something of interest from each
 Shuffle and sort intermediate results
 Aggregate intermediate results d u ce
Re
 Generate final output

Key idea: provide a functional abstraction for


these two operations
What is MapReduce
 MapReduce is a computing paradigm (originated from Google)
 Programming model for parallel data processing
 MapReduce: Simplified Data Processing on Large Clusters
 Hadoop MapReduce is an open-source software as an implementation of
MapReduce
 Hadoop can run MapReduce programs written in various languages:
e.g. Java, Ruby, Python, C++
 For large-scale data processing
 Exploits large set of commodity computers
 Executes process in distributed manner
 Offers high availability
Who is the father of MapReduce
 Jeffrey (Jeff) Dean
 He is currently a Google Senior Fellow in the Systems and
Infrastructure Group
 Designed MapReduce, BigTable, etc.
 One of the most genius engineer, programmer, computer scientist…
 Google “Who is Jeff Dean” and “Jeff Dean facts”
Data Structures in MapReduce
 Key-value pairs are the basic data structure in MapReduce
 Keys and values can be: integers, float, strings, raw bytes
 They can also be arbitrary data structures
 The design of MapReduce algorithms involves:
 Imposing the key-value structure on arbitrary datasets
E.g.: for a collection of Web pages, input keys may be URLs and
values may be the HTML content
 Insome algorithms, input keys are not used (e.g., wordcount), in
others they uniquely identify a record
 Keys can be combined in complex ways to design various algorithms
Map and Reduce Functions
 Programmers specify two functions:
 map (k1, v1) → list [<k2, v2>]
Map transforms the input into key-value pairs to process
 reduce (k2, list [v2]) → [<k3, v3>]
Reduce aggregates the list of values for each key
All values with the same key are sent to the same reducer
 list [<k2, v2>] will be grouped according to key k2 as (k2, list [v2])
 The MapReduce environment takes in charge of everything else…
 A complex program can be decomposed as a succession of Map and
Reduce tasks
Hadoop MapReduce Brief Data Flow
 1. Mappers read from HDFS
 2. Map output is partitioned by key and sent to Reducers
 3. Reducers sort input by key
 4. Reduce output is written to HDFS
 Intermediate results are stored on local FS of Map and Reduce workers
Example-Word Count
 We have a large file of words, one word to a line
 Count the number of times each distinct word appears in
the file
 Sample application: analyze web server logs to find
popular URLs
WordCount - Mapper
 Reads in input pair <k1,v1>
 Outputs a pair <k2, v2>
 Let’s count number of each word in user queries (or Tweets/Blogs)
 The input to the mapper will be <queryID, QueryText>:
<Q1,“The teacher went to the store. The store was closed; the
store opens in the morning. The store opens at 9am.” >
 The output would be:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store,1> <the, 1> <store,
1> <was, 1> <closed, 1> <the, 1> <store,1> <opens, 1> <in, 1> <the, 1>
<morning, 1> <the 1> <store, 1> <opens, 1> <at, 1> <9am, 1>
WordCount - Reducer
 Accepts the Mapper output (k2, v2), and aggregates values
on the key to generate (k3, v3)
 For our example, the reducer input would be:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1>
<the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store, 1>
<opens,1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1>
<opens, 1> <at, 1> <9am, 1>
 The output would be:
<The, 6> <teacher, 1> <went, 1> <to, 1> <store, 4> <was, 1>
<closed, 1> <opens, 2> <in, 1> <morning, 1> <at, 1> <9am, 1>
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs

Group by
key:
Collect all pairs
with same key
(Hash merge,
Shuffle, Sort,
Partition)

Reduce:
Collect all values
belonging to the
key and output
26
MapReduce Example - WordCount
Big Data Applications
 Link analysis
 Graph data processing
 Data stream mining
 Text mining
 Large-scale machine learning through
 Association mining
 Classification
 Clustering etc…
Review questions
 Explain how mapReduce works?
 Describe the motivations for big data technologies?
 What are the major challenges in big data analysis?

29
Thank you

You might also like