0% found this document useful (0 votes)
87 views

Course Introduction: Prof. Sourav Saha

Here are the key points about advanced statistical analysis options in different software: - SPSS, SAS, Stata: Very powerful for advanced stats like regression, ANOVA, multivariate, time series etc. Widely used in research. - R: As powerful as SPSS/SAS but requires coding. Huge library of packages. Popular for cutting edge methods. - Python: Growing popularity for data science. Requires coding but has scikit-learn, statsmodels for many standard methods. - Minitab: Good for basic to intermediate stats but limited for very advanced techniques. - Excel: Basic stats only. Not suitable for serious statistical analysis beyond simple summaries/charts. So in summary,

Uploaded by

Siddharth Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views

Course Introduction: Prof. Sourav Saha

Here are the key points about advanced statistical analysis options in different software: - SPSS, SAS, Stata: Very powerful for advanced stats like regression, ANOVA, multivariate, time series etc. Widely used in research. - R: As powerful as SPSS/SAS but requires coding. Huge library of packages. Popular for cutting edge methods. - Python: Growing popularity for data science. Requires coding but has scikit-learn, statsmodels for many standard methods. - Minitab: Good for basic to intermediate stats but limited for very advanced techniques. - Excel: Basic stats only. Not suitable for serious statistical analysis beyond simple summaries/charts. So in summary,

Uploaded by

Siddharth Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Course Introduction

Prof. Sourav Saha


Expectations from the Course?
• Learn tools
• Learn R, Python, etc
• Become programmers
• Become analysts
• Become Data Scientists
Tools you will expect to encounter

Tool name & Version Functionality & Usability


• Microsoft Excel 2016 • Spreadsheet – formulae based
• SPSS 23.0 • Statistical Tool – GUI based
• R version 3.4
• Programming & Analytics Tool –
• Python version 2.7 command based
• SAS 18
• Enterprise Analytics Tool – Menu based
• Tableau
• Power BI
• Visualization Tool – GUI based
Assessment
• Class Problems & Assignments • 20%
• Mid-Term (system) • 20%
• Group Project • 20%
• End-Term Examination • 40%
Computer Program?

Some mysterious
processing
Output

Input
Pick your choice
What is a Program
• Data Structures + Algorithms
• Data Structure = A Container stores Data
• Algorithm = Logic + Control
Data Types & Data Structures
• Applications/programs read data, store data temporarily, process it and
finally output results.
• What is data? Numbers, Characters, etc.

Instructions / Logic / Algorithms / Output Data


Input Data
Programs / Applications / Deduction
Data Types & Data Structures
• Data is classified into data types. e.g. char, float, integer, etc.
• A data type is (i) a domain of allowed values and (ii) a set of
operations on these values.
• System signals an error if wrong operation is performed on data
of a certain type. For example, char x,y,z; z = x*y is
not allowed.
Data Types & Data Structures
• Examples
Data Type Domain Operations
boolean 0,1 and, or, =, etc.
char ASCII =, <>, <, etc.
integer -maxint to +, -, =, ==, <>,
+maxint <, etc.
Data Types & Data Structures
• int i,j; i, j can take only integer values and only integer operations can
be carried out on i, j.
• Built-in types: defined within the language e.g. int,float, etc.
• User-defined types: defined and implemented by the user e.g. using
typedef or class.
Data Types & Data Structures
• Simple Data types: also known as Atomic data types  have no
component parts. E.g. int, char, float, etc.

21 3.14 ‘a’
Data Types & Data Structures
• Compound Data or Structured Data types: can be broken into component
parts. E.g. an object, array, set, file, etc. Example: a student object.

Name A H M A D
Age 20
Branch C S C
A Component part
Data Types & Data Structures
• A data structure is a data type whose values
• (i) can be decomposed into a set of component elements each of which is
either simple (atomic) or another data structure
• (ii) include a structure involving the component parts.
More Data Structure
Possible Structures: Set, Linear, Tree, Graph.

LINEAR
SET

TREE

GRAPH

15
Functions of Data Structures
• Add
• Index
• Key
• Position
• Priority
• Retrieve
• Modify
• Delete
Which Data Structure or Algorithm is better?
• Must Meet Requirement
• High Performance
• Low System footprint
• Easy to implement
Which Package to Use?
Agenda
 Present an Overview of what packages or solutions are available in the
market for data analysis
 Understanding as to what is popular today and what are the trends for
tomorrow
 Overview of some individual software packages
 Assess their demand and few features
Some Definitions
• SPSS: Statistical Package for the Social Sciences (IBM SPSS Statistics these
days)
• SAS: Statistical Application Systems (Just SAS these days)
• Minitab
• Excel
• SPSS, SAS, and Minitab are statistical packages while Excel is a spreadsheet
Available Options for Statistical Analysis

Proprietary Free Software


 Excel
 R
 SPSS

 MINITAB  Python
 SAS
 Weka
 Eviews
 Gretl
 Stata
What people are using

R (Blue) & SAS (Orange) R (Blue) & Python (Orange)


Scholarly Articles / Research?
No. of Jobs

Python & SQL are in most demand


Microsoft Excel
MS Excel
COST PRO
 Individual License for Microsoft Office  Nearly ubiquitous and is often pre-installed on
Professional $350 new computers
 Microsoft Office University Student License: $99  User friendly
 Volume Discounts available for large  Very good for basic descriptive statistics, charts
organizations and universities and plots
 Free Starter Version available on some new PCs CON
 Costs money
 Not sufficient for anything beyond the most
basic statistical analysis
MINITAB
Minitab

Con Pro
 Costs Money at $1,395 per single  Easy to learn and use
user
 Often taught in schools in
 Unsuitable for very complicated introductory statistics courses
statistical computation and analysis
 Widely used in engineering for
 Not often used in academic process improvement
research
SPSS
SPSS
COST PRO
 From $1000 to $12000 per license  One of the most widely used statistical
depending on license type. packages in academia and industry
 More powerful then Minitab that is also
CON easy to learn and use
 Very expensive  Has a command line interface in addition
to menu driven user interface
 Not adequate for modeling and cutting
edge statistical analysis
 Complicated – too many options
SAS
SAS
COST PRO
 Complicated pricing model  Widely accepted as the leader in statistical
analysis and modeling
 $8,500 first year license fee
 Widely used in the industry and academia
CON
 Very flexible and very powerful
 Very very expensive
 Not user friendly
 Steep learning curve
 Relatively poor graphics capabilities
R
R
PRO COST
 Widely used and accepted in industry  Free / Open Source
and academia
CON
 Very powerful and flexible
 Not user friendly
 Very large user base
 Requires steep learning curve
 Lots of books and manuals
 Several User Interface Shells available
R brownies
Functionality:
• R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-
series analysis, classification, clustering, and others.
• R is easily extensible through functions and the R community is noted for its active contributions in terms of packages.
• Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made.
Packages:
• R is highly extensible through the use of user-submitted packages for specific functions or specific features
• R has stronger object-oriented programming facilities than most statistical computing languages.
• Extending R is also eased by its permissive lexical scoping rules.
Graphics:
• Another strength of R is static graphics, which can produce publication-quality graphs, including mathematical symbols.
• Dynamic and interactive graphics are available through additional packages
Installing R
Screenshots
How Easy to Use?
• Excel very easy to use!
• SPSS and Minitab are relatively easy to use
• SAS a bit more difficult to use but made easier via Enterprise Guide (EG)
• R & Python are mainly command based and requires active learning
• In general most software are easy to use if you learn how to use them! Getting
Started Guides are available for R, SPSS, Minitab, Excel and SAS.
Descriptive (Summary) Statistics Options

SPSS Minitab

SAS EG

Excel
Advanced Statistics: Model Building

Excel

SPSS Minitab
SAS

Fertility: Average number of kids.


Infant mortality: deaths per 1000 live births
30.0% Bar Charts
25.0%

20.0%

Percent
15.0% SPSS SAS

10.0%

5.0%

0.0%
CONDO RANCH SPLIT TWOSTORY
style

Chart of Style
30 Percents

25 30
25
20
20
Percent

15 15 Percents
10
10
5
0
5

RY
T
LI
ND

NC

O
SP
0

ST
CO

RA
CONDO RANCH SPLIT TWOSTORY

O
Style

TW
Percent within all data.

Minitab Excel
Package Assessment
Usage In Industry Vs Academia
• SPSS and Minitab heavily used in Academia, used in Industry but not a lot
• SAS not heavily used in Academia, heavily used in Industry (most clinical
trials use SAS)
• Excel heavily used in both Academia and Industry
• Both R & Python usages picking up in Academia & Industry fast
Operating Systems
• R runs on Windows, Macintosh & Linux
• SPSS runs on Windows, Macintosh, and Unix
• Python runs on Linux, Macintosh, Windows and Unix
• Minitab runs mainly on Windows and Macintosh
• SAS runs on Windows, Macintosh, Unix, Linux
• Excel runs mainly on Windows and Macintosh
Pivot Tables
Very good for displaying information online. Can be very interactive
• Minitab: static, not interactive
• Excel: interactive
• SPSS: interactive
• SAS: interactive
• R: ??
• Python: Absent
Text Analytics
• SPSS
• SPSS Modeler (different from IBM SPSS Statistics)
• SAS
• SAS Enterprise Miner (different SAS Base and Enterprise Guide)
• R
• With external packages
• Python
• With external libraries
• Minitab
• Not Available
• Excel
• With extension
Statistical Modelling
A statistical model is a class of mathematical model, which embodies a set of assumptions
concerning the generation of some sample data, and similar data from a larger population.
A statistical model represents, often in considerably idealized form, the data-generating
process”
• There are three purposes for a statistical model:
• Predictions
• Extraction of information
• Description of stochastic structures

R, SAS, SPSS and Minitab are good for statistical modelling


ANOVA

Post- Latin
Product (Software) One-Way Two-Way MANOVA GLM hoc Squares
Tests Analysis

Minitab Yes Yes Yes Yes Yes Yes


R Yes Yes Yes Yes Yes
SAS Yes Yes Yes Yes Yes Yes
SPSS Yes Yes Yes Yes Yes Yes
Excel Yes Add on Add on Add on
Regression

Product (Software) OLS WLS 2SLS NLLS Logistic GLM LAD Stepwise

Minitab Yes Yes No Yes Yes Yes No Yes


R Yes Yes Yes Yes Yes Yes Yes Yes

SAS
Yes Yes Yes Yes Yes Yes Yes Yes
Excel Yes
SPSS Yes Yes Yes Yes Yes Yes No Yes

Ordinary Least Square (OLS); Weighted Least Square (WLS); Two Stage Least Square 2SLS;
Non-Linear Least Square NLLS); General Linear Model (GLM); Least Absolute Deviation regression (LAD)
Time Series Analysis

Cointegration Multivariat
Product ARIMA GARCH Unit root test VAR
test e GARCH

Minitab Yes No No No No
R Yes Yes Yes Yes Yes

SAS Yes Yes Yes Yes Yes Yes


Excel No No No No No No
SPSS Yes Yes No No No No
Big Data Analytics
• Python
• SAS Enterprise Miner
• IBM SPSS Modeller
• R
Summary
• DOE: Minitab or SAS
• Power / Sample size calculation: Minitab or SAS
• Best way to store your data: Notepad
• Automatic Update of Output: Minitab or Excel
• SAS very popular in industry
• Pivot table: SAS, SPSS or Excel
• Modelling: SAS, SPSS or Minitab
• Summary statistics: SAS, SPSS or Minitab
References

https://sites.google.com/site/r4statistics/popularity
http://en.freestatistics.info/
http://lib.stat.cmu.edu/
http://www.comfsm.fm/~dleeling/statistics/notes000.html
Questions

You might also like