Sybca Bigdata
Sybca Bigdata
COLLEGE, BALEWADI,PUNE – 45
SUBJECT : BIGDATA
UNIT 1: Introduction to Big
Data
What is Big Data?
What makes data, “Big” Data?
Big Data Definition
◦ No single standard definition…
3
Characteristics of Big Data:
1-Scale (Volume)
◦ Data Volume
◦ 44x increase from 2009 2020
◦ From 0.8 zettabytes to 35zb
◦ Data volume is increasing exponentially
Exponential increase in
collected/generated
data
4
Characteristics of Big Data:
2-Complexity (Varity)
◦ Various formats, types, and structures
◦ Text, numerical, images, audio, video,
sequences, time series, social media data,
multi-dim arrays, etc…
◦ Static data vs. streaming data
◦ A single application can be
generating/collecting many types of data
5
Characteristics of Big Data:
3-Speed (Velocity)
◦ Data is begin generated fast and need to be processed fast
◦ Online Data Analytics
◦ Late decisions ➔ missing opportunities
◦ Examples
◦ E-Promotions: Based on your current location, your purchase
history, what you like ➔ send promotions right now for store next
to you
6
Big Data: 3V’s
7
Some Make it 4V’s
8
Harnessing Big Data
9
Who’s Generating Big Data
Mobile devices
(tracking all objects all the tim
10
The Model Has Changed…
◦ The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming dat
11
What’s driving Big Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
12
Value of Big Data Analytics
◦ Big data is more real-time in
nature than traditional DW
applications
◦ Traditional DW architectures
(e.g. Exadata, Teradata) are not
well-suited for big data apps
◦ Shared nothing, massively
parallel processing, scale out
architectures are well-suited for
big data apps
13
Challenges in Handling Big
Data
14
What Technology Do We Have
For Big Data ??
15
16
Big Data Technology
17
What You Will Learn…
◦ We focus on Hadoop/MapReduce technology
◦ Learn the platform (how it is designed and works)
◦ How big data are managed in a scalable, efficient way
◦ Learn writing Hadoop jobs in different languages
◦ Programming Languages: Java, C, Python
◦ High-Level Languages: Apache Pig, Hive
◦ Learn advanced analytics tools on top of Hadoop
◦ RHadoop: Statistical tools for managing big data
◦ Mahout: Data mining and machine learning tools over big data
◦ Learn state-of-art technology from recent research papers
◦ Optimizations, indexing techniques, and other extensions to Hadoop
18
Course Logistics
19
Course Logistics
◦ Web Page: http://web.cs.wpi.edu/~cs525/s13-MYE/
20
Textbook & Reading List
◦ No specific textbook
◦ Big Data is a relatively new topic (so no fixed syllabus)
◦ Reading List
◦ We will cover the state-of-art technology from research papers in big
conferences
◦ Many Hadoop-related papers are available on the course website
◦ Related books:
◦ Hadoop, The Definitive Guide [pdf]
21
Requirements & Grading
◦ Seminar-Type Course
◦ Students will read research papers and present them (Reading List)
◦ Hands-on Course Done in
teams of
◦ No written homework or exams
two
◦ Several coding projects covering the entire semester
22
Requirements & Grading
(Cont’d)
◦ Reviews
◦ When a team is presenting (not the instructor), the other students should prepare
a review on the presented paper
◦ Course website gives guidelines on how to make good reviews
23
Late Submission Policy
◦ For Projects
◦ One-day late → 10% off the max grade
◦ Two-day late → 20% off the max grade
◦ Three-day late → 30% off the max grade
◦ Beyond that, no late submission is accepted
◦ Submissions:
◦ Submitted via blackboard system by the due date
◦ Demonstrated to the instructor within the week after
◦ For Reviews
◦ No late submissions
◦ Student may skip at most 4 reviews
◦ Submissions:
◦ Given to the instructor at the beginning of class
24
More about Projects
◦ A virtual machine is created including the needed platform for the
projects
◦ Ubuntu OS (Version 12.10)
◦ Hadoop platform (Version 1.1.0)
◦ Apache Pig (Version 0.10.0)
◦ Mahout library (Version 0.7)
◦ Rhadoop
◦ In addition to other software packages
◦ Download it from the course website (link)
◦ Username and password will be sent to you
◦ Need Virtual Box (Vbox) [free]
25
Next Step from You…
1. Form teams of two
2. Visit the course website (Reading List), each team selects its first
paper to present (1st come 1st served)
◦ Send me your choices top 2/3 choices
3. You have until Jan 20th
◦ Otherwise, I’ll randomly form teams and assign papers
4. Use Blackboard “Discussion” forum for posts or for searching for
teammates
26
Course Output: What You Will
Learn…
◦ We focus on Hadoop/MapReduce technology
◦ Learn the platform (how it is designed and works)
◦ How big data are managed in a scalable, efficient way
◦ Learn writing Hadoop jobs in different languages
◦ Programming Languages: Java, C, Python
◦ High-Level Languages: Apache Pig, Hive
◦ Learn advanced analytics tools on top of Hadoop
◦ RHadoop: Statistical tools for managing big data
◦ Mahout: Analytics and data mining tools over big data
◦ Learn state-of-art technology from recent research papers
◦ Optimizations, indexing techniques, and other extensions to Hadoop
27
UNIT 2: Introduction to Data
Science
Topics
◦ databases and data architectures
◦ databases in the real world
◦ scaling, data quality, distributed
◦ machine learning/data mining/statistics
◦ information retrieval
◦ Data Science is currently a popular interest of
employers
◦ our Industrial Affiliates Partners say there is
high demand for students trained in Data
Science
◦ databases, warehousing, data architectures
◦ data analytics – statistics, machine learning
◦ Big Data – gigabytes/day or more
◦ Examples:
◦ Walmart, cable companies (ads linked to content,
viewer trends), airlines/Orbitz, HMOs, call centers,
Twitter (500M tweets/day), traffic surveillance
cameras, detecting fraud, identity theft...
◦ supports “Business Intelligence”
◦ quantitative decision-making and control
◦ finance, inventory, pricing/marketing, advertising
◦ need data for identifying risks, opportunities,
conducting “what-if” analyses
Data Architectures
◦ traditional databases (CSCE 310/608)
◦ tables, fields
◦ tuples = records or rows
◦ <yellowstone,WY,6000000 acres,geysers>
◦ key = field with unique values
◦ can be used as a reference from one table into another
◦ important for avoiding redundancy (normalization), which
risks inconsistency
◦ join – combining 2 tables using a key
◦ metadata – data about the data
◦ names of the fields, types (string, int, real, mpeg...)
◦ also things like source, date, size, completeness/sampling
Name HomeTown Grad school PhD teaches title
John Flaherty Houston, TX Rice 2005 CSCE 411 Design and Analysis of Algorithms
Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 121 Introduction to Computing in C++
Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 206 Programming in C
Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 314 Programming Languages
Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 206 Programming in C
Instructors:
Name HomeTown Grad school PhD
John Flaherty Houston, TX Rice 2005
Susan Jenkins Omaha, NE Univ of Michigan 2004
Bill Jones Pittsburgh, PA Carnegie Mellon 1999
TeachingAssignments:
Name teaches
John Flaherty CSCE 411
Susan Jenkins CSCE 121
Susan Jenkins CSCE 206
Courses:
Bill Jones CSCE 314 course title
Bill Jones CSCE 206 CSCE 411 Design and Analysis of Algorithms
CSCE 121 Introduction to Computing in C++
CSCE 314 Programming Languages
CSCE 206 Programming in C
◦ SQL: Structured Query Language
>SELECT Name,HomeTown FROM Instructors WHERE PhD<2000;
Bill Jones Pittsburgh, PA
>SELECT TeachingAssignments.Course
FROM Instructors JOIN TeachingAssignments
ON Instructors.Name=TeachingAssigmnents.Name
WHERE Instructor.PhD=“Carnegie Mellon”
CSCE 314
CSCE 206
because they were both taught by Bill Jones
◦ SQL servers
◦ centralized database, required for concurrent access by multiple users
◦ ODBC: Open DataBase Connectivity – protocol to connect to servers and do
queries, updates from languages like Java, C, Python
◦ Oracle, IBM DB2 - industrial strength SQL databases
◦ some efficiency issues with real databases
◦ indexing
◦ how to efficiently find all songs written by Paul Simon in a
database with 10,000,000 entries?
◦ data structures for representing sorted order on fields
◦ disk management
◦ databases are often too big to fit in RAM, leave most of it
on disk and swap in blocks of records as needed – could
be slow
◦ concurrency
◦ transaction semantics: either all updates happen en batch
or none (commit or rollback)
◦ like delete one record and simultaneously add another but
guarantee not to leave in an inconsistent state
◦ other users might be blocked till done
◦ query optimization
◦ the order in which you JOIN tables can drastically affect
the size of the intermediate tables
◦ Unstructured data
◦ raw text
◦ documents, digital libraries
◦ grep, substring indexing, regular expressions
◦ like find all instances of “[aA]g+ies” including “agggggies”
◦ Information Retrieval (CSCE 470)
◦ look for synonyms, similar words (like “car” and “auto”)
◦ tfIdf (term frequency/inverse doc frequency) – weighting for important words
◦ LSI (latent semantic indexing) – e.g. ‘dogs’ is similar to ‘canines’ because they are used
similarly (both near ‘bark’ and ‘bite’)
◦ Natural Language parsing
◦ extracting requirements from jobs postings
◦ Unstructured data
◦ images, video (BLOBs=binary large objects)
◦ how to extract features? index them? search them?
◦ color histograms
◦ convolutions/transforms for pattern matching
◦ looking for ICBM missiles in aerial photos of Cuba
◦ streams
◦ sports ticker, radio, stock quotes...
◦ XML files
◦ with tags indicating field names
<course>
<name>CSCE 411</name>
<title>Design and Analysis of Algorithms</title>
</course>
◦ Object databases
Texas A&M
College Station, TX
ClassOfferedAt Div 1A
53,299 students Instructor/Employee
CHEM 102
Intro to Chemistry TaughtBy Dr. Frank Smith
TR, 3:00-4:00 302 Miller St.
prereq: CHEM 101 PhD, Cornell
13 years experience
http://technet.microsoft.com/en-us/
– multi-dimensional tables of library/ms174587.aspx
aggregated sales in
different regions in recent
quarters, rather than “every
transaction”
– users can still look at
seasonal or geographic
trends in different product
categories
– project data onto 2D
spreadsheets, graphs
data warehouse:
nightly updates
every transaction
and summaries OLAP server
ever recorded
◦ data integrity
◦ missing values
◦ how to interpret? not available? 0? use the mean?
◦ duplicated values
◦ including partial matches (Jon Smith=John Smith?)
◦ inconsistency:
◦ multiple addresses for person
◦ out-of-date data
◦ inconsistent usage:
◦ does “destination” mean of first leg or whole flight?
◦ outliers:
◦ salaries that are negative, or in the trillions
◦ most database allow “integrity constraints” to be defined that validate newly
entered data
◦ Interoperability
◦ how can data from one database be compared or combined with another?
◦ what if fields are not the same, or not present, or used differently?
◦ think of medical or insurance records
◦ translation/mapping of terms
◦ standards
◦ units like ft/s, or gallons, etc.
◦ identifiers like SSN, UIN, ISBN
◦ “federated” databases – queries that combine information across multiple
servers
◦ “Data cleansing”
◦ filling in missing data (imputing values)
◦ detecting and removing outliers
◦ smoothing
◦ removing noise by averaging values together
◦ filtering, sampling
◦ keeping only selected representative values
◦ feature extraction
◦ e.g. in a photo database, which people are wearing glasses? which have more than
one person? which are outdoors?
Data Mining/Data
Analytics
◦ finding patterns in the data
◦ statistics
◦ machine learning (CSCE 633)
◦ Numerical data
◦ correlations
◦ multivariate regression
◦ fitting “models”
◦ predictive equations that fit the data
◦ from a real estate database of home sales, we get
◦ housing price = 100*SqFt - 6*DistanceToSchools + 0.1*AverageOfNeighborhood
◦ ANOVA for testing differences between groups
◦ R is one of the most commonly used software packages for doing statistical
analysis
◦ can load a data table, calculate means and correlations, fit distributions, estimate
parameters, test hypotheses, generate graphs and histograms
◦ clustering
◦ similar photos, documents, cases
◦ discovery of “structure” in the data
◦ example: accident database
◦ some clusters might be identified with “accidents involving a tractor trailer” or
“accidents at night”
◦ top-down vs. bottom-up clustering methods
◦ granularity: how many clusters?
◦ decision trees (classifiers)
◦ what factors, decisions, or treatments led to different
outcomes?
◦ recursive partitioning algorithms
◦ related methods
◦ “discriminant” analysis
◦ what factors lead to return of product?
◦ extract “association rules”
◦ boxers dogs tend to have congenital defects
◦ covers 5% of patients with 80% confidence
Toy Sales
from: Basic Statistics for Business and Economics, Lind et al (2009), Ch 16. credit: Frank Curriero
UNIT 3: Introduction to
Machine learning
What is machine learning?
◦ A branch of artificial intelligence, concerned with
the design and development of algorithms that
allow computers to evolve behaviors based on
empirical data.
Input Learnin
Sampl g
es Metho
d
Syste
m
Trainin
g
Training and testing
Data Practical
acquisition usage
Universal
set
(unobserv
ed)
Supervised Unsupervised
learning learning
57
Semi-supervised
Machine learning structure
◦ Supervised learning
Machine learning structure
◦ Unsupervised learning
What are we seeking?
◦ Supervised: Low E-out or maximize probabilistic terms
error
◦ Aggregation
◦ Bagging (bootstrap + aggregation), Adaboost, Random
forest
Learning techniques
• Linear
classifier
, where w is an d-dim vector (learned)
◦ Techniques:
◦ Perceptron
◦ Logistic regression
◦ Support vector machine (SVM)
◦ Ada-line
◦ Multi-layer perceptron (MLP)
Learning techniques
Using perceptron learning algorithm(PLA)
Trainin Testing
g rate:
Error Error rate:
0.10 0.156
Learning techniques
Using logistic regression
Trainin Testing
g rate:
Error Error rate:
0.11 0.145
Learning techniques
• Non-linear case
Linear Rectangular
◦Descriptive Statistics
◦Statistical Modeling
◦ Regressions: Linear and Logistic
◦ Probit, Tobit Models
◦ Time Series
◦Multivariate Functions
◦Inbuilt Packages, contributed
packages
Descriptive Statistics