0% found this document useful (0 votes)

45 views

Sybca Bigdata

Uploaded by

kingalladin46

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

Sybca Bigdata

Uploaded by

kingalladin46

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 97

DNYANSAGAR ARTS AND COMMERCE

COLLEGE, BALEWADI,PUNE – 45

SUBJECT : BIGDATA
UNIT 1: Introduction to Big
Data
What is Big Data?
What makes data, “Big” Data?
Big Data Definition
◦ No single standard definition…

“Big Data” is data whose scale, diversity, and complexity require

new architecture, techniques, algorithms, and analytics to manage
it and extract value and hidden knowledge from it…

3
Characteristics of Big Data:
1-Scale (Volume)
◦ Data Volume
◦ 44x increase from 2009 2020
◦ From 0.8 zettabytes to 35zb
◦ Data volume is increasing exponentially

Exponential increase in
collected/generated
data
4
Characteristics of Big Data:
2-Complexity (Varity)
◦ Various formats, types, and structures
◦ Text, numerical, images, audio, video,
sequences, time series, social media data,
multi-dim arrays, etc…
◦ Static data vs. streaming data
◦ A single application can be
generating/collecting many types of data

To extract knowledge➔ all these

types of data need to linked together

5
Characteristics of Big Data:
3-Speed (Velocity)
◦ Data is begin generated fast and need to be processed fast
◦ Online Data Analytics
◦ Late decisions ➔ missing opportunities
◦ Examples
◦ E-Promotions: Based on your current location, your purchase
history, what you like ➔ send promotions right now for store next
to you

◦ Healthcare monitoring: sensors monitoring your activities and

body ➔ any abnormal measurements require immediate
reaction

6
Big Data: 3V’s

7
Some Make it 4V’s

8
Harnessing Big Data

◦ OLTP: Online Transaction Processing (DBMSs)

◦ OLAP: Online Analytical Processing (Data Warehousing)
◦ RTAP: Real-Time Analytics Processing (Big Data Architecture &
technology)

9
Who’s Generating Big Data

Mobile devices
(tracking all objects all the tim

Social media and networks Scientific instruments

(all of us are generating data) (collecting all sorts of data)

Sensor technology and

networks
(measuring all kinds of data)
◦ The progress and innovation is no longer hindered by the ability to collect data
◦ But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion

10
The Model Has Changed…
◦ The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming dat

New Model: all of us are generating data, and all of us are

consuming data

11
What’s driving Big Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time

- Ad-hoc querying and reporting

- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

12
Value of Big Data Analytics
◦ Big data is more real-time in
nature than traditional DW
applications
◦ Traditional DW architectures
(e.g. Exadata, Teradata) are not
well-suited for big data apps
◦ Shared nothing, massively
parallel processing, scale out
architectures are well-suited for
big data apps

13
Challenges in Handling Big
Data

◦ The Bottleneck is in technology

◦ New architecture, algorithms, techniques are needed
◦ Also in technical skills
◦ Experts in using the new technology and dealing with big data

14
What Technology Do We Have
For Big Data ??

15
16
Big Data Technology

17
What You Will Learn…
◦ We focus on Hadoop/MapReduce technology
◦ Learn the platform (how it is designed and works)
◦ How big data are managed in a scalable, efficient way
◦ Learn writing Hadoop jobs in different languages
◦ Programming Languages: Java, C, Python
◦ High-Level Languages: Apache Pig, Hive
◦ Learn advanced analytics tools on top of Hadoop
◦ RHadoop: Statistical tools for managing big data
◦ Mahout: Data mining and machine learning tools over big data
◦ Learn state-of-art technology from recent research papers
◦ Optimizations, indexing techniques, and other extensions to Hadoop

18
Course Logistics

19
Course Logistics
◦ Web Page: http://web.cs.wpi.edu/~cs525/s13-MYE/

◦ Electronic WPI system: blackboard.wpi.edu

◦ Lectures
◦ Tuesday, Thursday: (4:00pm - 5:20pm)

20
Textbook & Reading List
◦ No specific textbook
◦ Big Data is a relatively new topic (so no fixed syllabus)

◦ Reading List
◦ We will cover the state-of-art technology from research papers in big
conferences
◦ Many Hadoop-related papers are available on the course website

◦ Related books:
◦ Hadoop, The Definitive Guide [pdf]

21
Requirements & Grading
◦ Seminar-Type Course
◦ Students will read research papers and present them (Reading List)
◦ Hands-on Course Done in
teams of
◦ No written homework or exams
two
◦ Several coding projects covering the entire semester

22
Requirements & Grading
(Cont’d)
◦ Reviews
◦ When a team is presenting (not the instructor), the other students should prepare
a review on the presented paper
◦ Course website gives guidelines on how to make good reviews

◦ Reviews are done individually

23
Late Submission Policy
◦ For Projects
◦ One-day late → 10% off the max grade
◦ Two-day late → 20% off the max grade
◦ Three-day late → 30% off the max grade
◦ Beyond that, no late submission is accepted
◦ Submissions:
◦ Submitted via blackboard system by the due date
◦ Demonstrated to the instructor within the week after

◦ For Reviews
◦ No late submissions
◦ Student may skip at most 4 reviews
◦ Submissions:
◦ Given to the instructor at the beginning of class

24
More about Projects
◦ A virtual machine is created including the needed platform for the
projects
◦ Ubuntu OS (Version 12.10)
◦ Hadoop platform (Version 1.1.0)
◦ Apache Pig (Version 0.10.0)
◦ Mahout library (Version 0.7)
◦ Rhadoop
◦ In addition to other software packages
◦ Download it from the course website (link)
◦ Username and password will be sent to you
◦ Need Virtual Box (Vbox) [free]

25
Next Step from You…
1. Form teams of two
2. Visit the course website (Reading List), each team selects its first
paper to present (1st come 1st served)
◦ Send me your choices top 2/3 choices
3. You have until Jan 20th
◦ Otherwise, I’ll randomly form teams and assign papers
4. Use Blackboard “Discussion” forum for posts or for searching for
teammates

26
Course Output: What You Will
Learn…
◦ We focus on Hadoop/MapReduce technology
◦ Learn the platform (how it is designed and works)
◦ How big data are managed in a scalable, efficient way
◦ Learn writing Hadoop jobs in different languages
◦ Programming Languages: Java, C, Python
◦ High-Level Languages: Apache Pig, Hive
◦ Learn advanced analytics tools on top of Hadoop
◦ RHadoop: Statistical tools for managing big data
◦ Mahout: Analytics and data mining tools over big data
◦ Learn state-of-art technology from recent research papers
◦ Optimizations, indexing techniques, and other extensions to Hadoop

27
UNIT 2: Introduction to Data
Science
Topics
◦ databases and data architectures
◦ databases in the real world
◦ scaling, data quality, distributed
◦ machine learning/data mining/statistics
◦ information retrieval
◦ Data Science is currently a popular interest of
employers
◦ our Industrial Affiliates Partners say there is
high demand for students trained in Data
Science
◦ databases, warehousing, data architectures
◦ data analytics – statistics, machine learning
◦ Big Data – gigabytes/day or more
◦ Examples:
◦ Walmart, cable companies (ads linked to content,
viewer trends), airlines/Orbitz, HMOs, call centers,
Twitter (500M tweets/day), traffic surveillance
cameras, detecting fraud, identity theft...
◦ supports “Business Intelligence”
◦ quantitative decision-making and control
◦ finance, inventory, pricing/marketing, advertising
◦ need data for identifying risks, opportunities,
conducting “what-if” analyses
Data Architectures
◦ traditional databases (CSCE 310/608)
◦ tables, fields
◦ tuples = records or rows
◦ <yellowstone,WY,6000000 acres,geysers>
◦ key = field with unique values
◦ can be used as a reference from one table into another
◦ important for avoiding redundancy (normalization), which
risks inconsistency
◦ join – combining 2 tables using a key
◦ metadata – data about the data
◦ names of the fields, types (string, int, real, mpeg...)
◦ also things like source, date, size, completeness/sampling
Name HomeTown Grad school PhD teaches title
John Flaherty Houston, TX Rice 2005 CSCE 411 Design and Analysis of Algorithms
Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 121 Introduction to Computing in C++
Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 206 Programming in C
Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 314 Programming Languages
Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 206 Programming in C

Instructors:
Name HomeTown Grad school PhD
John Flaherty Houston, TX Rice 2005
Susan Jenkins Omaha, NE Univ of Michigan 2004
Bill Jones Pittsburgh, PA Carnegie Mellon 1999

TeachingAssignments:
Name teaches
John Flaherty CSCE 411
Susan Jenkins CSCE 121
Susan Jenkins CSCE 206
Courses:
Bill Jones CSCE 314 course title
Bill Jones CSCE 206 CSCE 411 Design and Analysis of Algorithms
CSCE 121 Introduction to Computing in C++
CSCE 314 Programming Languages
CSCE 206 Programming in C
◦ SQL: Structured Query Language
>SELECT Name,HomeTown FROM Instructors WHERE PhD<2000;
Bill Jones Pittsburgh, PA

>SELECT Course,Title FROM Courses ORDER BY Course;

CSCE 121 Introduction to Computing in C++
CSCE 206 Programming in C
CSCE 314 Programming Languages
CSCE 411 Design and Analysis of Algorithms

can also compute sums, counts, means, etc.

example of JOIN: find all courses taught by someone from CMU:

>SELECT TeachingAssignments.Course
FROM Instructors JOIN TeachingAssignments
ON Instructors.Name=TeachingAssigmnents.Name
WHERE Instructor.PhD=“Carnegie Mellon”
CSCE 314
CSCE 206
because they were both taught by Bill Jones
◦ SQL servers
◦ centralized database, required for concurrent access by multiple users
◦ ODBC: Open DataBase Connectivity – protocol to connect to servers and do
queries, updates from languages like Java, C, Python
◦ Oracle, IBM DB2 - industrial strength SQL databases
◦ some efficiency issues with real databases
◦ indexing
◦ how to efficiently find all songs written by Paul Simon in a
database with 10,000,000 entries?
◦ data structures for representing sorted order on fields
◦ disk management
◦ databases are often too big to fit in RAM, leave most of it
on disk and swap in blocks of records as needed – could
be slow
◦ concurrency
◦ transaction semantics: either all updates happen en batch
or none (commit or rollback)
◦ like delete one record and simultaneously add another but
guarantee not to leave in an inconsistent state
◦ other users might be blocked till done
◦ query optimization
◦ the order in which you JOIN tables can drastically affect
the size of the intermediate tables
◦ Unstructured data
◦ raw text
◦ documents, digital libraries
◦ grep, substring indexing, regular expressions
◦ like find all instances of “[aA]g+ies” including “agggggies”
◦ Information Retrieval (CSCE 470)
◦ look for synonyms, similar words (like “car” and “auto”)
◦ tfIdf (term frequency/inverse doc frequency) – weighting for important words
◦ LSI (latent semantic indexing) – e.g. ‘dogs’ is similar to ‘canines’ because they are used
similarly (both near ‘bark’ and ‘bite’)
◦ Natural Language parsing
◦ extracting requirements from jobs postings
◦ Unstructured data
◦ images, video (BLOBs=binary large objects)
◦ how to extract features? index them? search them?
◦ color histograms
◦ convolutions/transforms for pattern matching
◦ looking for ICBM missiles in aerial photos of Cuba
◦ streams
◦ sports ticker, radio, stock quotes...
◦ XML files
◦ with tags indicating field names
<course>
<name>CSCE 411</name>
<title>Design and Analysis of Algorithms</title>
</course>
◦ Object databases

Texas A&M
College Station, TX
ClassOfferedAt Div 1A
53,299 students Instructor/Employee

CHEM 102
Intro to Chemistry TaughtBy Dr. Frank Smith
TR, 3:00-4:00 302 Miller St.
prereq: CHEM 101 PhD, Cornell
13 years experience

In a database with millions of objects,

how do you efficiently do queries (i.e. follow pointers)
and retrieve information?
◦ Real-world issues with databases
◦ it’s all about scaling up to many records (and many users)
◦ data warehousing:
◦ full database is stored in secure, off-site location
◦ slices, snapshots, or views are put on interactive query servers for fast user access
(“staging”)
◦ might be processed or summarized data
◦ databases are often distributed
◦ different parts of the data held in different sites
◦ some queries are local, others are “corporate-wide”
◦ how to do distributed queries?
◦ how to keep the databases synchronized?
◦ CSCE 438 – Distributed Object Programming
◦ OLAP: OnLine Analytical Processing

http://technet.microsoft.com/en-us/
– multi-dimensional tables of library/ms174587.aspx

aggregated sales in
different regions in recent
quarters, rather than “every
transaction”
– users can still look at
seasonal or geographic
trends in different product
categories
– project data onto 2D
spreadsheets, graphs

data warehouse:
nightly updates
every transaction
and summaries OLAP server
ever recorded
◦ data integrity
◦ missing values
◦ how to interpret? not available? 0? use the mean?
◦ duplicated values
◦ including partial matches (Jon Smith=John Smith?)
◦ inconsistency:
◦ multiple addresses for person
◦ out-of-date data
◦ inconsistent usage:
◦ does “destination” mean of first leg or whole flight?
◦ outliers:
◦ salaries that are negative, or in the trillions
◦ most database allow “integrity constraints” to be defined that validate newly
entered data
◦ Interoperability
◦ how can data from one database be compared or combined with another?
◦ what if fields are not the same, or not present, or used differently?
◦ think of medical or insurance records
◦ translation/mapping of terms
◦ standards
◦ units like ft/s, or gallons, etc.
◦ identifiers like SSN, UIN, ISBN
◦ “federated” databases – queries that combine information across multiple
servers
◦ “Data cleansing”
◦ filling in missing data (imputing values)
◦ detecting and removing outliers
◦ smoothing
◦ removing noise by averaging values together
◦ filtering, sampling
◦ keeping only selected representative values
◦ feature extraction
◦ e.g. in a photo database, which people are wearing glasses? which have more than
one person? which are outdoors?
Data Mining/Data
Analytics
◦ finding patterns in the data
◦ statistics
◦ machine learning (CSCE 633)
◦ Numerical data
◦ correlations
◦ multivariate regression
◦ fitting “models”
◦ predictive equations that fit the data
◦ from a real estate database of home sales, we get
◦ housing price = 100*SqFt - 6*DistanceToSchools + 0.1*AverageOfNeighborhood
◦ ANOVA for testing differences between groups
◦ R is one of the most commonly used software packages for doing statistical
analysis
◦ can load a data table, calculate means and correlations, fit distributions, estimate
parameters, test hypotheses, generate graphs and histograms
◦ clustering
◦ similar photos, documents, cases
◦ discovery of “structure” in the data
◦ example: accident database
◦ some clusters might be identified with “accidents involving a tractor trailer” or
“accidents at night”
◦ top-down vs. bottom-up clustering methods
◦ granularity: how many clusters?
◦ decision trees (classifiers)
◦ what factors, decisions, or treatments led to different
outcomes?
◦ recursive partitioning algorithms
◦ related methods
◦ “discriminant” analysis
◦ what factors lead to return of product?
◦ extract “association rules”
◦ boxers dogs tend to have congenital defects
◦ covers 5% of patients with 80% confidence

Veterinary database - dogs treated for disease

breed gender age drug sibsp outcome
terrier F 10 methotrexate 4.0 died
spaniel M 5 cytarabine 2.3 survived
doberman F 7 doxorubicin 0.1 died
◦ other types of data
◦ time series and forecasting:
◦ model the price of gas using autoregression
◦ a function of recent prices, demand, geopolitics...
◦ de-trend: factor out seasonal trends
◦ GIS (geographic information systems)
◦ longitude/latitude coordinates in the database
◦ objects: city/state boundaries, river locations, roads
◦ find regions in B/CS with an excess of coffee shops

Toy Sales

from: Basic Statistics for Business and Economics, Lind et al (2009), Ch 16. credit: Frank Curriero
UNIT 3: Introduction to
Machine learning
What is machine learning?
◦ A branch of artificial intelligence, concerned with
the design and development of algorithms that
allow computers to evolve behaviors based on
empirical data.

◦ As intelligence requires knowledge, it is necessary for

the computers to acquire knowledge.
Learning system model
Testing

Input Learnin
Sampl g
es Metho
d
Syste
m
Trainin
g
Training and testing

Data Practical
acquisition usage

Universal
set
(unobserv
ed)

Training Testing set

set (unobserv
(observed) ed)
Training and testing
◦ Training is the process of making the system able to
learn.

◦ No free lunch rule:

◦ Training set and testing set come from the same distribution
◦ Need to make some assumptions or bias
Performance
◦ There are several factors affecting the performance:
◦ Types of training provided
◦ The form and extent of any initial background knowledge
◦ The type of feedback provided
◦ The learning algorithms used

◦ Two important factors:

◦ Modeling
◦ Optimization
Algorithms
◦ The success of machine learning system also
depends on the algorithms.

◦ The algorithms control the search to find and build

the knowledge structures.

◦ The learning algorithms should extract useful

information from training examples.
Algorithms
◦ Supervised learning ( )
◦ Prediction
◦ Classification (discrete labels), Regression (real values)
◦ Unsupervised learning ( )
◦ Clustering
◦ Probability distribution estimation
◦ Finding association (in features)
◦ Dimension reduction
◦ Semi-supervised learning
◦ Reinforcement learning
◦ Decision making (robot, chess machine)
Algorithms

Supervised Unsupervised
learning learning

57
Semi-supervised
Machine learning structure
◦ Supervised learning
Machine learning structure
◦ Unsupervised learning
What are we seeking?
◦ Supervised: Low E-out or maximize probabilistic terms

E-in: for training

set
E-out: for testing
set

◦ Unsupervised: Minimum quantization error, Minimum

distance, MAP, MLE(maximum likelihood estimation)
What are we seeking?
Under-fitting VS. Over-fitting (fixed N)

error

(model = hypothesis + loss

functions)
Learning techniques
◦ Supervised learning categories and techniques
◦ Linear classifier (numerical functions)
◦ Parametric (Probabilistic functions)
◦ Naïve Bayes, Gaussian discriminant analysis (GDA), Hidden
Markov models (HMM), Probabilistic graphical models

◦ Non-parametric (Instance-based functions)

◦ K-nearest neighbors, Kernel regression, Kernel density
estimation, Local regression
◦ Non-metric (Symbolic functions)
◦ Classification and regression tree (CART), decision tree

◦ Aggregation
◦ Bagging (bootstrap + aggregation), Adaboost, Random
forest
Learning techniques
• Linear
classifier
, where w is an d-dim vector (learned)

◦ Techniques:
◦ Perceptron
◦ Logistic regression
◦ Support vector machine (SVM)
◦ Ada-line
◦ Multi-layer perceptron (MLP)
Learning techniques
Using perceptron learning algorithm(PLA)

Trainin Testing
g rate:
Error Error rate:
0.10 0.156
Learning techniques
Using logistic regression

Trainin Testing
g rate:
Error Error rate:
0.11 0.145
Learning techniques
• Non-linear case

◦ Support vector machine (SVM):

◦ Linear to nonlinear: Feature transform and kernel function
Learning techniques
◦ Unsupervised learning categories and techniques
◦ Clustering
◦ K-means clustering
◦ Spectral clustering
◦ Density Estimation
◦ Gaussian mixture model (GMM)
◦ Graphical models
◦ Dimensionality reduction
◦ Principal component analysis (PCA)
◦ Factor analysis
Applications
◦ Face detection
◦ Object detection and recognition
◦ Image segmentation
◦ Multimedia event detection
◦ Economical and commercial usage
UNIT 4: Data Analytics with
R/Weka Machine learning
We’ll Cover
◦ What is R
◦ How to obtain and install R
◦ How to read and export data
◦ How to do basic statistical analyses
◦ Econometric packages in R
What is R
◦ Software for Statistical Data Analysis
◦ Based on S
◦ Programming Environment
◦ Interpreted Language
◦ Data Storage, Analysis, Graphing
◦ Free and Open Source Software
Obtaining R
◦ Current Version: R-2.0.0
◦ Comprehensive R Archive Network:
http://cran.r-project.org
◦ Binary source codes
◦ Windows executables
◦ Compiled RPMs for Linux
◦ Can be obtained on a CD
Installing R
◦ Binary (Windows/Linux): One step process
◦ exe, rpm (Red Hat/Mandrake), apt-get (Debian)
◦ Linux, from sources:

$ tar –zxvf “filename.tar.gz”

$ cd filename
$ ./configure
$ make
$ make check
$ make install
Starting R

Windows, Double-click on Desktop Icon

$R Linux, type R at command prompt

Strengths and Weaknesses
◦ Strengths
◦ Free and Open Source
◦ Strong User Community
◦ Highly extensible, flexible
◦ Implementation of high end statistical methods
◦ Flexible graphics and intelligent defaults
◦ Weakness
◦ Steep learning curve
◦ Slow for large datasets
Basics
◦Highly Functional
◦ Everything done through functions
◦ Strict named arguments
◦ Abbreviations in arguments OK
(e.g. T for TRUE)
◦Object Oriented
◦ Everything is an object
◦ “<-” is an assignment operator
◦ “X <- 5”: X GETS the value 5
Getting Help in R
◦ From Documentation:
◦ ?WhatIWantToKnow
◦ help(“WhatIWantToKnow”)
◦ help.search(“WhatIWantToKnow”)
◦ help.start()
◦ getAnywhere(“WhatIWantToKnow”)
◦ example(“WhatIWantToKnow”)
◦ Documents: “Introduction to R”
◦ Active Mailing List
◦ Archives
◦ Directly Asking Questions on the List
Data Structures
◦ Supports virtually any type of data
◦ Numbers, characters, logicals (TRUE/ FALSE)
◦ Arrays of virtually unlimited sizes
◦ Simplest: Vectors and Matrices
◦ Lists: Can Contain mixed type variables
◦ Data Frame: Rectangular Data Set
Data Structure in R

Linear Rectangular

All Same Type VECTORS MATRIX*

Mixed LIST DATA FRAME

Running R
◦Directly in the Windowing System
(Console)
◦Using Editors
◦ Notepad, WinEdt, Tinn-R: Windows
◦ Xemacs, ESS (Emacs speaks Statistics)
◦On the Editor:
◦ source(“filename.R”)
◦ Outputs can be diverted by using
◦ sink(“filename.Rout”)
R Working Area

This is the area where all

commands are issued, and
non-graphical outputs
observed when run
interactively
In an R Session…
◦ First, read data from other sources
◦ Use packages, libraries, and functions
◦ Write functions wherever necessary
◦ Conduct Statistical Data Analysis
◦ Save outputs to files, write tables
◦ Save R workspace if necessary (exit prompt)
Specific Tasks
◦ To see which directories and data are loaded, type: search()
◦ To see which objects are stored, type: ls()
◦ To include a dataset in the searchpath for analysis, type:
attach(NameOfTheDataset, expression)
◦ To detach a dataset from the searchpath after analysis, type:
detach(NameOfTheDataset)
Reading data into R
◦ R not well suited for data preprocessing
◦ Preprocess data elsewhere (SPSS, etc…)
◦ Easiest form of data to input: text file
◦ Spreadsheet like data:
◦ Small/medium size: use read.table()
◦ Large data: use scan()
◦ Read from other systems:
◦ Use the library “foreign”: library(foreign)
◦ Can import from SAS, SPSS, Epi Info
◦ Can export to STATA
Reading Data: summary
◦ Directly using a vector e.g.: x <- c(1,2,3…)
◦ Using scan and read.table function
◦ Using matrix function to read data matrices
◦ Using data.frame to read mixed data
◦ library(foreign) for data from other programs
Accessing Variables
◦edit(<mydataobject>)
◦Subscripts essential tools
◦ x[1] identifies first element in vector x
◦ y[1,] identifies first row in matrix y
◦ y[,1] identifies first column in matrix y
◦$ sign for lists and data frames
◦ myframe$age gets age variable of
myframe
◦ attach(dataframe) -> extract by variable
name
Subset Data
◦Using subset function
◦ subset() will subset the dataframe
◦Subscripting from data frames
◦ myframe[,1] gives first column of myframe
◦Specifying a vector
◦ myframe[1:5] gives first 5 rows of data
◦Using logical expressions
◦ myframe[myframe[,1], < 5,] gets all rows of
the first column that contain values less than
5
Graphics
◦ Plot an object, like: plot(num.vec)
◦ here plots against index numbers
◦ Plot sends to graphic devices
◦ can specify which graphic device you want
◦ postscript, gif, jpeg, etc…
◦ you can turn them on and off, like: dev.off()
◦ Two types of plotting
◦ high level: graphs drawn with one call
◦ Low Level: add additional information to existing
graph
High Level: generated with plot()
Low Level: Scattergram with Lowess
Programming in R

◦Functions & Operators typically

work on entire vectors
◦Expressions surrounded by {}
◦Codes separated by newlines, “;”
not necessary
◦You can write your own functions
and use them
Statistical Functions in R

◦Descriptive Statistics
◦Statistical Modeling
◦ Regressions: Linear and Logistic
◦ Probit, Tobit Models
◦ Time Series
◦Multivariate Functions
◦Inbuilt Packages, contributed
packages
Descriptive Statistics

◦Has functions for all common statistics

◦summary() gives lowest, mean,
median, first, third quartiles, highest for
numeric variables
◦stem() gives stem-leaf plots
◦table() gives tabulation of categorical
variables
Statistical Modeling
◦Over 400 functions
◦ lm, glm, aov, ts
◦Numerous libraries & packages
◦ survival, coxph, tree (recursive trees), nls, …
◦Distinction between factors and regressors
◦ factors: categorical, regressors: continuous
◦ you must specify factors unless they are
obvious to R
◦ dummy variables for factors created
automatically
◦Use of data.frame makes life easy
How to model
◦Specify your model like this:
◦ y ~ xi+ci, where
◦ y = outcome variable, xi = main explanatory
variables, ci = covariates, + = add terms
◦ Operators have special meanings
◦ + = add terms, : = interactions, / = nesting, so
on…
◦Modeling -- object oriented
◦ each modeling procedure produces
objects
◦ classes and functions for each object
Synopsis of Operators

Operator Usually means In Formula means

+ or - add or subtract add or remove terms
* multiplication main effect and interactions
/ division main effect and nesting
: sequence interaction only
^ exponentiation limiting interaction depths
%in% no specific nesting only
Modeling Example: Regression
carReg <- lm(speed~dist, data=cars)
carReg = becomes an object
to get summary of this regression, we type
summary(carReg)
to get only coefficients, we type
coef(carReg), or carReg$coef
don’t want intercept? add 0, so
carReg <- lm(speed~0+dist, data=cars)

4-2 Bda PPTS
No ratings yet
4-2 Bda PPTS
114 pages
COMP9313: Big Data Management
No ratings yet
COMP9313: Big Data Management
79 pages
20IT503 - Big Data Analytics - Unit1
No ratings yet
20IT503 - Big Data Analytics - Unit1
59 pages
Big Data Analytics-Digital Notes
No ratings yet
Big Data Analytics-Digital Notes
86 pages
Unit 1 - BD - Introduction To Big Data
100% (1)
Unit 1 - BD - Introduction To Big Data
90 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
All in one
No ratings yet
All in one
362 pages
Unit 1 - BD - Introduction To Big Data (1) - 2
No ratings yet
Unit 1 - BD - Introduction To Big Data (1) - 2
85 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
No ratings yet
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
117 pages
Unit 1 - BD - Introduction To Big Data
No ratings yet
Unit 1 - BD - Introduction To Big Data
89 pages
Unit 1 - BD - Introduction To Big Data
No ratings yet
Unit 1 - BD - Introduction To Big Data
83 pages
Big Data Syllabus
No ratings yet
Big Data Syllabus
6 pages
Unit 1 - BD - Introduction To Big Data
No ratings yet
Unit 1 - BD - Introduction To Big Data
83 pages
Unit 1 - BD - Introduction To Big Data
No ratings yet
Unit 1 - BD - Introduction To Big Data
75 pages
B.Tech. CS_CE and CSE Syllabus 3rd Year 2024-25
No ratings yet
B.Tech. CS_CE and CSE Syllabus 3rd Year 2024-25
2 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Data-Intensive Computing: CSE487/587 Bina Ramamurthy (Bina@Buffalo - Edu)
No ratings yet
Data-Intensive Computing: CSE487/587 Bina Ramamurthy (Bina@Buffalo - Edu)
10 pages
4th Sem Syllabus
No ratings yet
4th Sem Syllabus
12 pages
IV Yr II Sem Lesson Plans
No ratings yet
IV Yr II Sem Lesson Plans
19 pages
MCA 3rd semester Big Data Analytics syllabus
No ratings yet
MCA 3rd semester Big Data Analytics syllabus
15 pages
BDA Syllabus - Sem VII - Mumbai University
No ratings yet
BDA Syllabus - Sem VII - Mumbai University
3 pages
Big Data - 2 Marks-1
No ratings yet
Big Data - 2 Marks-1
1 page
Big Data Analytics
No ratings yet
Big Data Analytics
131 pages
Course Outline PDF
No ratings yet
Course Outline PDF
4 pages
Big Data Analytics Digital Notes
No ratings yet
Big Data Analytics Digital Notes
119 pages
Digital Notes of Big Data Analytics Dated 5.1.2024
No ratings yet
Digital Notes of Big Data Analytics Dated 5.1.2024
175 pages
BDA U1
No ratings yet
BDA U1
80 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
COMP9313: Big Data Management: Course Web Site: HTTP://WWW - Cse.unsw - Edu.au/ cs9313
No ratings yet
COMP9313: Big Data Management: Course Web Site: HTTP://WWW - Cse.unsw - Edu.au/ cs9313
76 pages
Unit 1
No ratings yet
Unit 1
19 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Big Data Engineer Course (2) (1)
No ratings yet
Big Data Engineer Course (2) (1)
31 pages
BD Course Handout
No ratings yet
BD Course Handout
5 pages
20ai402 Data Analytics Unit-1
No ratings yet
20ai402 Data Analytics Unit-1
52 pages
Data and Analytics Syllabus
No ratings yet
Data and Analytics Syllabus
4 pages
20ai402 Data Analytics Unit-2
No ratings yet
20ai402 Data Analytics Unit-2
72 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
1. Introduction of Subject
No ratings yet
1. Introduction of Subject
28 pages
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
100% (1)
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
75 pages
CC ZG522 Course Handout
No ratings yet
CC ZG522 Course Handout
6 pages
DA Full
No ratings yet
DA Full
738 pages
Big Data Analytics Syllabus_22UAI603C_204_2025
No ratings yet
Big Data Analytics Syllabus_22UAI603C_204_2025
2 pages
BDA_DIGITAL NOTES
No ratings yet
BDA_DIGITAL NOTES
85 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
Big Data Analytics Comp Syllabus Sem7
No ratings yet
Big Data Analytics Comp Syllabus Sem7
4 pages
2171607
No ratings yet
2171607
3 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
Big Data Technologies Course Outline
No ratings yet
Big Data Technologies Course Outline
2 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Data Science and Big Data Analytics_ Unit_1
No ratings yet
Data Science and Big Data Analytics_ Unit_1
47 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Big Data Analytics
No ratings yet
Big Data Analytics
19 pages
Data Mining and Analytics
No ratings yet
Data Mining and Analytics
2 pages
CCS334 UPDATED 05-05-2025
No ratings yet
CCS334 UPDATED 05-05-2025
19 pages
Trackpad Ver. 1.0 Class 8: Windows 7 & MS Office 2010
From Everand
Trackpad Ver. 1.0 Class 8: Windows 7 & MS Office 2010
Nidhi Arora
No ratings yet
Touchpad Plus Ver. 1.1 Class 5: Windows 7 & MS Office 2010
From Everand
Touchpad Plus Ver. 1.1 Class 5: Windows 7 & MS Office 2010
Nisha Batra
No ratings yet
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
7th Semester - Project Report
No ratings yet
7th Semester - Project Report
28 pages
Face Recognition-1
No ratings yet
Face Recognition-1
8 pages
CV - Deep Convolutional Neural Networks
No ratings yet
CV - Deep Convolutional Neural Networks
55 pages
Mini Project
No ratings yet
Mini Project
16 pages
The Digital Agricultural Revolution : Innovations and Challenges in Agriculture through Technology Disruptions 1st Edition Roheet Bhatnagarinstant download
100% (2)
The Digital Agricultural Revolution : Innovations and Challenges in Agriculture through Technology Disruptions 1st Edition Roheet Bhatnagarinstant download
50 pages
DWDM PPT
No ratings yet
DWDM PPT
35 pages
STATISTA in Depth Report AI Copia 4
No ratings yet
STATISTA in Depth Report AI Copia 4
207 pages
Lecture 4
No ratings yet
Lecture 4
31 pages
park-choo-2024-generative-ai-prompt-engineering-for-educators-practical-strategies
No ratings yet
park-choo-2024-generative-ai-prompt-engineering-for-educators-practical-strategies
8 pages
20_apr
No ratings yet
20_apr
8 pages
42.image Stitching Using Machine Learning
No ratings yet
42.image Stitching Using Machine Learning
4 pages
Road Accident Prediction Model Using Data Mining Techniques
100% (1)
Road Accident Prediction Model Using Data Mining Techniques
6 pages
The Prediction of Outpatient No-Show Visits by Usi
No ratings yet
The Prediction of Outpatient No-Show Visits by Usi
8 pages
Abstract
No ratings yet
Abstract
4 pages
The Design of Cross-Border E-Commerce Recommendation System Based On Big Data Technology
No ratings yet
The Design of Cross-Border E-Commerce Recommendation System Based On Big Data Technology
4 pages
Abhishek_Kacher_Resume
No ratings yet
Abhishek_Kacher_Resume
1 page
Data Science Nigeria Machine and Deep Learning Study Guide
No ratings yet
Data Science Nigeria Machine and Deep Learning Study Guide
78 pages
Quiz 2 - Dimensionality reduction_ Machine Learning 3 - Ravi
No ratings yet
Quiz 2 - Dimensionality reduction_ Machine Learning 3 - Ravi
5 pages
Cursor Manipulation
No ratings yet
Cursor Manipulation
22 pages
Management Reporting Notes
No ratings yet
Management Reporting Notes
12 pages
DShip OctNov21 LOW RES
No ratings yet
DShip OctNov21 LOW RES
58 pages
2020 07 Artificial Intelligence Opportunities To Improve Food Safety at Retail
No ratings yet
2020 07 Artificial Intelligence Opportunities To Improve Food Safety at Retail
7 pages
Industrial Ai
No ratings yet
Industrial Ai
4 pages
Application of Machine Learning Techniques in Phased Array Antenna Synthesis: A Comprehensive Mini Review
No ratings yet
Application of Machine Learning Techniques in Phased Array Antenna Synthesis: A Comprehensive Mini Review
14 pages
Addressing IoT Security Challenges through AI Solutions
No ratings yet
Addressing IoT Security Challenges through AI Solutions
6 pages
317196811_CSE-848,Generative AI and Applications
No ratings yet
317196811_CSE-848,Generative AI and Applications
3 pages
Hazardous in Underground Mines
No ratings yet
Hazardous in Underground Mines
26 pages
BDA1-4 bunits
No ratings yet
BDA1-4 bunits
113 pages
Psg College of Arts & Science.pdf
No ratings yet
Psg College of Arts & Science.pdf
6 pages
The 2 International Conference On Embedded Systems and Artificial Intelligence
No ratings yet
The 2 International Conference On Embedded Systems and Artificial Intelligence
9 pages

Sybca Bigdata

Uploaded by

Sybca Bigdata

Uploaded by

DNYANSAGAR ARTS AND COMMERCE

“Big Data” is data whose scale, diversity, and complexity require

To extract knowledge➔ all these

◦ Healthcare monitoring: sensors monitoring your activities and

◦ OLTP: Online Transaction Processing (DBMSs)

Social media and networks Scientific instruments

Sensor technology and

New Model: all of us are generating data, and all of us are

- Ad-hoc querying and reporting

◦ The Bottleneck is in technology

◦ Electronic WPI system: blackboard.wpi.edu

◦ Reviews are done individually

>SELECT Course,Title FROM Courses ORDER BY Course;

can also compute sums, counts, means, etc.

example of JOIN: find all courses taught by someone from CMU:

In a database with millions of objects,

Veterinary database - dogs treated for disease

◦ As intelligence requires knowledge, it is necessary for

Training Testing set

◦ No free lunch rule:

◦ Two important factors:

◦ The algorithms control the search to find and build

◦ The learning algorithms should extract useful

E-in: for training

◦ Unsupervised: Minimum quantization error, Minimum

(model = hypothesis + loss

◦ Non-parametric (Instance-based functions)

◦ Support vector machine (SVM):

$ tar –zxvf “filename.tar.gz”

Windows, Double-click on Desktop Icon

$R Linux, type R at command prompt

All Same Type VECTORS MATRIX*

Mixed LIST DATA FRAME

This is the area where all

◦Functions & Operators typically

◦Has functions for all common statistics

Operator Usually means In Formula means

You might also like