Python for data science

Python for Data Science
Sankalp Gabbita
Graduate Student-Data Science and Business Analytics
UNC Charlotte

How is Data used?
 The extensive use of data, statistical and quantitative analysis, explanatory
and predictive models, and fact-based management to drive decisions and
actions. (Davenport and Harris 2007)
Data
Analytical Tools Actionable
Knowledge

Collecting
Cleaning Explore
Transform
ModellingEvaluate
Inference

Agenda
 Anaconda – Spyder
 Review of NumPy,Pandas- Basic data munging
 Using Matplotlib to make visualizations
 Regression concepts
 Regression – Application( Scikit-Learn)
 Clustering concept
 Clustering Application( K- mean clustering using Scikit-Learn)

SPYDER -Scientific Python Development Environment
 Spyder is an interactive development environment for the Python
language with advanced editing, live testing, and a numerical
computing environment
 Spyder also includes the popular Python library NumPy for linear
algebra, MatPlotLib for interactive 2D/3D graphs, Pandas for
dataset manipulation, and SciKit-Learn for machine learning.
 Code line by line
 Interact and alter scripts
 Code directly in the console
 Spyder is accessible through Anaconda
 https://www.continuum.io/downloads

NumPy- Numerical Computing
 Similar to creation of Matlab array objects
 N-dimensional array objects
 Used for linear algebra, fourier transform, and random number capabilities
 Capable of matrix operations, string operations, and binary operations
 Easy to install and import with single line
 Import numpy as np
 The above code fetches numpy package and it can be used with it’s alias as np
eg., np.array([(2,3),(4,5)])

Pandas- Dataframes
 Creates an efficient dataframe object for data manipulation with integrated
indexing
 Takes input data in many formats: CSV, Excel, SQL databases
 Handles messy and missing data easily
 Slicing, dicing and indexing of large datasets
 Very useful for cleaning the data before applying any algorithm
 Can be imported with single line
 Import pandas as pd
 Eg : pd.read_table(‘—file path in local machine-’)

Matplotlib-Visualization
 Python 2D plotting library to generate quality figures
 Generates plots, histograms, bar charts, scatterplots, etc.,
 Uses NumPy NDArrays to plot graphs
 Full control of font styles , line properties , axes properties, etc.
 Easy to install and import using single line
 Import matplotlib
 Pyplot module is used for simple plotting and provides good interface when
combined with Ipython

Regression
One Dependent Variable Y
Independent Variables X1,X2,X3,...
Y = ß0 + ß1 X(1) + ß2 X(2) + ß3 X(3) + ... + ßk X(k) + E
 Estimate the ß's in multiple regression using least squares
 Sizes of the coefficients not good indicators of importance of X variables

Simple Linear Regression Model

Key Assumptions for Linear Regression
 Linearity
 The dependent variable is a linear combination of independent variables
 Homoscedasticity
 Constant variance in errors
 Normality
 Independence of errors

Logistic Regression
Binary target: linear regression does not work due to
unbounded results

Key Assumptions for Logistic Regression
 Linearity
 Linearity of independent variables and log odds
 Homoscedasticity: no
 Normality: no
 Highly skewed independent variables can still be problematic
 Independence of errors: yes

Clustering
 Cluster analysis is the generic name for a wide variety of procedures that can
be used to create a classification of entities/objects
 It has been referred to as Q analysis, typology construction, classification
analysis, unsupervised pattern recognition, and numerical taxonomy
 A deck of 52 cards can be grouped as:
 26 red and 26 black cards
 13 each of Spades, Hearts, Diamonds, and Clubs
 4 each of Aces, Kings, Queens, and Jacks

A Geometrical view of an ideal pattern
Importance of Price
ImportanceofQuality

Reality
Importance of Price
ImportanceofQuality

How to group them?
Importance of Price
ImportanceofQuality
Importance of Price
ImportanceofQuality
Importance of Price
ImportanceofQuality

Similarity and Distance
 To identify natural groups, we must first define a measure of similarity
(proximity) between objects/entities.
 Assume variables (axes in space) are numeric.
 Then, if two things are similar, they should be close to each other in the
space.
That is, the distance between them should be small.
 But, if two things are dissimilar, they should be well separated from each
other in the space.
That is, the distance between them should be large.
 A collection of similar things would therefore likely result in more
cohesive (homogenous) groups than a collection of dissimilar things.

Dimension1
A
B
K
E
Dimension 2
F
G
H
I
J
D
C
K- Means Clustering
1. Select k cluster centers.
2. Assign cases to closest center.
3. Update cluster centers.
4. Re-assign cases.
5. Repeat steps 3 and 4 until convergence.
Dimension1
A
B
K
E
Dimension 2
F
G
H
I
J
D
C
Dimension1
A
B
K
E
Dimension 2
F
G
H
I
J
D
C
Dimension1
A
B
K
E
Dimension 2
F
G
H
I
J
D
C

Python for data science

Recommended

More Related Content

What's hot (14)

Similar to Python for data science (20)

More from botsplash.com (15)

Recently uploaded (20)

Python for data science