Course Introduction: Prof. Sourav Saha
Course Introduction: Prof. Sourav Saha
Some mysterious
processing
Output
Input
Pick your choice
What is a Program
• Data Structures + Algorithms
• Data Structure = A Container stores Data
• Algorithm = Logic + Control
Data Types & Data Structures
• Applications/programs read data, store data temporarily, process it and
finally output results.
• What is data? Numbers, Characters, etc.
21 3.14 ‘a’
Data Types & Data Structures
• Compound Data or Structured Data types: can be broken into component
parts. E.g. an object, array, set, file, etc. Example: a student object.
Name A H M A D
Age 20
Branch C S C
A Component part
Data Types & Data Structures
• A data structure is a data type whose values
• (i) can be decomposed into a set of component elements each of which is
either simple (atomic) or another data structure
• (ii) include a structure involving the component parts.
More Data Structure
Possible Structures: Set, Linear, Tree, Graph.
LINEAR
SET
TREE
GRAPH
15
Functions of Data Structures
• Add
• Index
• Key
• Position
• Priority
• Retrieve
• Modify
• Delete
Which Data Structure or Algorithm is better?
• Must Meet Requirement
• High Performance
• Low System footprint
• Easy to implement
Which Package to Use?
Agenda
Present an Overview of what packages or solutions are available in the
market for data analysis
Understanding as to what is popular today and what are the trends for
tomorrow
Overview of some individual software packages
Assess their demand and few features
Some Definitions
• SPSS: Statistical Package for the Social Sciences (IBM SPSS Statistics these
days)
• SAS: Statistical Application Systems (Just SAS these days)
• Minitab
• Excel
• SPSS, SAS, and Minitab are statistical packages while Excel is a spreadsheet
Available Options for Statistical Analysis
MINITAB Python
SAS
Weka
Eviews
Gretl
Stata
What people are using
Con Pro
Costs Money at $1,395 per single Easy to learn and use
user
Often taught in schools in
Unsuitable for very complicated introductory statistics courses
statistical computation and analysis
Widely used in engineering for
Not often used in academic process improvement
research
SPSS
SPSS
COST PRO
From $1000 to $12000 per license One of the most widely used statistical
depending on license type. packages in academia and industry
More powerful then Minitab that is also
CON easy to learn and use
Very expensive Has a command line interface in addition
to menu driven user interface
Not adequate for modeling and cutting
edge statistical analysis
Complicated – too many options
SAS
SAS
COST PRO
Complicated pricing model Widely accepted as the leader in statistical
analysis and modeling
$8,500 first year license fee
Widely used in the industry and academia
CON
Very flexible and very powerful
Very very expensive
Not user friendly
Steep learning curve
Relatively poor graphics capabilities
R
R
PRO COST
Widely used and accepted in industry Free / Open Source
and academia
CON
Very powerful and flexible
Not user friendly
Very large user base
Requires steep learning curve
Lots of books and manuals
Several User Interface Shells available
R brownies
Functionality:
• R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-
series analysis, classification, clustering, and others.
• R is easily extensible through functions and the R community is noted for its active contributions in terms of packages.
• Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made.
Packages:
• R is highly extensible through the use of user-submitted packages for specific functions or specific features
• R has stronger object-oriented programming facilities than most statistical computing languages.
• Extending R is also eased by its permissive lexical scoping rules.
Graphics:
• Another strength of R is static graphics, which can produce publication-quality graphs, including mathematical symbols.
• Dynamic and interactive graphics are available through additional packages
Installing R
Screenshots
How Easy to Use?
• Excel very easy to use!
• SPSS and Minitab are relatively easy to use
• SAS a bit more difficult to use but made easier via Enterprise Guide (EG)
• R & Python are mainly command based and requires active learning
• In general most software are easy to use if you learn how to use them! Getting
Started Guides are available for R, SPSS, Minitab, Excel and SAS.
Descriptive (Summary) Statistics Options
SPSS Minitab
SAS EG
Excel
Advanced Statistics: Model Building
Excel
SPSS Minitab
SAS
20.0%
Percent
15.0% SPSS SAS
10.0%
5.0%
0.0%
CONDO RANCH SPLIT TWOSTORY
style
Chart of Style
30 Percents
25 30
25
20
20
Percent
15 15 Percents
10
10
5
0
5
RY
T
LI
ND
NC
O
SP
0
ST
CO
RA
CONDO RANCH SPLIT TWOSTORY
O
Style
TW
Percent within all data.
Minitab Excel
Package Assessment
Usage In Industry Vs Academia
• SPSS and Minitab heavily used in Academia, used in Industry but not a lot
• SAS not heavily used in Academia, heavily used in Industry (most clinical
trials use SAS)
• Excel heavily used in both Academia and Industry
• Both R & Python usages picking up in Academia & Industry fast
Operating Systems
• R runs on Windows, Macintosh & Linux
• SPSS runs on Windows, Macintosh, and Unix
• Python runs on Linux, Macintosh, Windows and Unix
• Minitab runs mainly on Windows and Macintosh
• SAS runs on Windows, Macintosh, Unix, Linux
• Excel runs mainly on Windows and Macintosh
Pivot Tables
Very good for displaying information online. Can be very interactive
• Minitab: static, not interactive
• Excel: interactive
• SPSS: interactive
• SAS: interactive
• R: ??
• Python: Absent
Text Analytics
• SPSS
• SPSS Modeler (different from IBM SPSS Statistics)
• SAS
• SAS Enterprise Miner (different SAS Base and Enterprise Guide)
• R
• With external packages
• Python
• With external libraries
• Minitab
• Not Available
• Excel
• With extension
Statistical Modelling
A statistical model is a class of mathematical model, which embodies a set of assumptions
concerning the generation of some sample data, and similar data from a larger population.
A statistical model represents, often in considerably idealized form, the data-generating
process”
• There are three purposes for a statistical model:
• Predictions
• Extraction of information
• Description of stochastic structures
Post- Latin
Product (Software) One-Way Two-Way MANOVA GLM hoc Squares
Tests Analysis
Product (Software) OLS WLS 2SLS NLLS Logistic GLM LAD Stepwise
SAS
Yes Yes Yes Yes Yes Yes Yes Yes
Excel Yes
SPSS Yes Yes Yes Yes Yes Yes No Yes
Ordinary Least Square (OLS); Weighted Least Square (WLS); Two Stage Least Square 2SLS;
Non-Linear Least Square NLLS); General Linear Model (GLM); Least Absolute Deviation regression (LAD)
Time Series Analysis
Cointegration Multivariat
Product ARIMA GARCH Unit root test VAR
test e GARCH
Minitab Yes No No No No
R Yes Yes Yes Yes Yes
https://sites.google.com/site/r4statistics/popularity
http://en.freestatistics.info/
http://lib.stat.cmu.edu/
http://www.comfsm.fm/~dleeling/statistics/notes000.html
Questions