0% found this document useful (0 votes)
57 views

DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling

The document discusses various topics related to data wrangling in Python including exploring data, dealing with missing values, reshaping data, filtering data, tasks of data wrangling like discovery, structuring, cleaning, enriching, validating and publishing. It also talks about scikit-learn which is a popular machine learning library in Python, the different types of machine learning algorithms like classification, regression, clustering. Classification algorithms can be linear or non-linear and regression deals with predicting continuous values.

Uploaded by

Aryan bariya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling

The document discusses various topics related to data wrangling in Python including exploring data, dealing with missing values, reshaping data, filtering data, tasks of data wrangling like discovery, structuring, cleaning, enriching, validating and publishing. It also talks about scikit-learn which is a popular machine learning library in Python, the different types of machine learning algorithms like classification, regression, clustering. Classification algorithms can be linear or non-linear and regression deals with predicting continuous values.

Uploaded by

Aryan bariya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 91

Python for Data Science

Unit 5 : Data Wrangling

Dr Kruti Dangarwala
CSE & IT Department
SVMIT
Unit 5
 Wrangling Data: (Chapter 12)
 Exploring Data Analysis: (Chapter 13)

E-book online reading URL:


Book:
Python for Data Science For Dummies, 2nd Edition by John Paul Mueller, Luca
Massaron

URL: https://ebookreading.net/view/book/EB9781119547624_12.html#
Wrangling Data: (Chapter 12)
• Playing with Scikit-learn,
• Understanding classes in Scikit-learn,
• Defining applications for data science,
• Performing the Hashing Trick,
• Using hash functions,
• Demonstrating the hashing trick,
• Working with deterministic selection,
• Considering Timing and Performance,
• Benchmarking, with timeit,
• Working with the memory profiler,
• Running in Parallel on Multiple Cores,
• Performing multicore parallelism,
• Demonstrating multiprocessing.
Data Wrangling -Introduction
 Data wrangling—also called data cleaning, data remediation, or data
munging—refers to a variety of processes designed to transform raw data
into more readily used formats.
 The exact methods differ from project to project depending on the data
you’re leveraging and the goal you’re trying to achieve.

Some examples of data wrangling include:


• Merging multiple data sources into a single dataset for analysis
• Identifying gaps in data (for example, empty cells in a spreadsheet) and
either filling or deleting them
• Deleting data that’s either unnecessary or irrelevant to the project you’re
working on
• Identifying extreme outliers in data and either explaining the discrepancies
or removing them so that analysis can take place
Data wrangling in Python deals with the below functionalities:

• Data exploration: In this process, the data is studied, analyzed, and


understood by visualizing representations of data.
• Dealing with missing values: Most of the datasets having a vast amount of
data contain missing values of NaN, they are needed to be taken care of by
replacing them with mean, mode, the most frequent value of the column, or
simply by dropping the row having a NaN value.
• Reshaping data: In this process, data is manipulated according to the
requirements, where new data can be added or pre-existing data can be
modified.
• Filtering data: Some times datasets are comprised of unwanted rows or
columns which are required to be removed or filtered
• Other: After dealing with the raw dataset with the above functionalities we
get an efficient dataset as per our requirements and then it can be used for a
required purpose like data analyzing, machine learning, data visualization,
model training etc.
Tasks of Data Wrangling/steps
1. Discovery
• Discovery refers to the process of familiarizing yourself with data so you can
conceptualize how you might use it. You can liken it to looking in your
refrigerator before cooking a meal to see what ingredients you have at your
disposal.
• During discovery, you may identify trends or patterns in the data, along with
obvious issues, such as missing or incomplete values that need to be
addressed. This is an important step, as it will inform every activity that
comes afterward.
2. Structuring
• Raw data is typically unusable in its raw state because it’s either incomplete
or misformatted for its intended application. Data structuring is the process of
taking raw data and transforming it to be more readily leveraged. The form
your data takes will depend on the analytical model you use to interpret it.
3. Cleaning
• Data cleaning is the process of removing inherent errors in data that
might distort your analysis or render it less valuable.
• Cleaning can come in different forms, including deleting empty cells or
rows, removing outliers, and standardizing inputs.
• The goal of data cleaning is to ensure there are no errors (or as few as
possible) that could influence your final analysis.

4. Enriching
• Once you understand your existing data and have transformed it into a
more usable state, you must determine whether you have all of the
data necessary for the project at hand.
• If not, you may choose to enrich or augment your data by
incorporating values from other datasets. For this reason, it’s important
to understand what other data is available for use.
• If you decide that enrichment is necessary, you need to repeat the
steps above for any new data.
5. Validating
• Data validation refers to the process of verifying that your data is both
consistent and of a high enough quality.
• During validation, you may discover issues you need to resolve or
conclude that your data is ready to be analyzed. Validation is typically
achieved through various automated processes and requires
programming.

6. Publishing
• Once your data has been validated, you can publish it. This involves
making it available to others within your organization for analysis.
• The format you use to share the information—such as a written report
or electronic file—will depend on your data and the organization’s
goals.
Scikit-learn-Introduction
• Scikit-learn is the package for machine learning and data science
experimentation favored by most data scientists.
• It contains a wide range of well-established learning algorithms, error
functions, and testing procedures.
• Scikit-learn (Sklearn) is the most useful and robust library for machine learning
in Python.
• It provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality
reduction via a consistence interface in Python
• This library, which is largely written in Python, is built upon NumPy, SciPy and
Matplotlib.

• Package:
• Import sklearn
Install scikit-learn
• Using pip : Following command can be used to
install scikit-learn via pip −

• pip install -U scikit-learn


• Using conda

• Following command can be used to install scikit-


learn via conda −
• conda install scikit-learn
Understanding classes in scikit-learn

Four types of Classes:


• Classifying
• Regression
• Grouping by clusters

• Transforming data
Classification
• Classification is a process of categorizing data or objects into predefined
classes or categories based on their features or attributes. In machine learning,
classification is a type of supervised learning technique where an algorithm is
trained on a labeled dataset to predict the class or category of new, unseen
data.
• The main objective of classification is to build a model that can accurately
assign a label or category to a new observation based on its features. For
example, a classification model might be trained on a dataset of images labeled
as either dogs or cats and then used to predict the class of new, unseen images
of dogs or cats based on their features such as color, texture, and shape.

Example:
suppose we want to predict the possibility of the rain in some regions on the
basis of some parameters. Then there would be two labels rain and no rain
under which different regions can be classified.
Types of Classification
Classification is of two types:
• Binary Classification: In binary classification, the goal is to classify the
input into one of two classes or categories. Example – On the basis of
the given health conditions of a person, we have to determine whether
the person has a certain disease or not.
• Multiclass Classification: In multi-class classification, the goal is to
classify the input into one of several classes or categories. For Example –
On the basis of data about different species of flowers, we have to
determine which specie our observation belongs to.
Types of classification algorithms
1) Linear Classifiers: Linear models create a linear decision boundary between classes. They are
simple and computationally efficient. Some of the linear classification models are as follows:
– Logistic Regression
– Support Vector Machines having kernel = ‘linear’
– Single-layer Perceptron
– Stochastic Gradient Descent (SGD) Classifier

2) Non-linear Classifiers: Non-linear models create a non-linear decision boundary between classes.
They can capture more complex relationships between the input features and the target variable.
Some of the non-linear classification models are as follows:
– K-Nearest Neighbours
– Kernel SVM
– Naive Bayes
– Decision Tree Classification
– Ensemble learning classifiers:
• Random Forests,
• AdaBoost,
• Bagging Classifier,
• Voting Classifier,
• ExtraTrees Classifier
– Multi-layer Artificial Neural Networks
Regression
• Process of finding a model or function for distinguishing the
data into continuous real values instead of using classes.
Mathematically, with a regression problem, one is trying to
find the function approximation with the minimum error
deviation. In regression, the data numeric dependency is
predicted to distinguish it.
• Let’s take the similar example in regression also, where we
are finding the possibility of rain in some particular regions
with the help of some parameters. In this case, there is a
probability associated with the rain. Here we are not
classifying the regions within rain and no rain labels instead
we are classifying them with their associated probability.
Classification model Evaluations
• Classification Accuracy: The proportion of correctly classified instances over the total number
of instances in the test set. It is a simple and intuitive metric but can be misleading in
imbalanced datasets where the majority class dominates the accuracy score.
• Confusion matrix: A table that shows the number of true positives, true negatives, false
positives, and false negatives for each class, which can be used to calculate various evaluation
metrics.
• Precision and Recall: Precision measures the proportion of true positives over the total number
of predicted positives, while recall measures the proportion of true positives over the total
number of actual positives. These metrics are useful in scenarios where one class is more
important than the other, or when there is a trade-off between false positives and false negatives.
• F1-Score: The harmonic mean of precision and recall, calculated as 2 x (precision x recall) /
(precision + recall). It is a useful metric for imbalanced datasets where both precision and recall
are important.
• ROC curve and AUC: The Receiver Operating Characteristic (ROC) curve is a plot of the true
positive rate (recall) against the false positive rate (1-specificity) for different threshold values of
the classifier’s decision function. The Area Under the Curve (AUC) measures the overall
performance of the classifier, with values ranging from 0.5 (random guessing) to 1 (perfect
classification).
• Cross-validation: A technique that divides the data into multiple folds and trains the model on
each fold while testing on the others, to obtain a more robust estimate of the model’s
performance.
Dataset loading:
Boston dataset
from sklearn.datasets import The Boston Housing Dataset
is a widely used dataset in
load_boston machine learning and
predictive analytics.
boston =load_boston() It contains housing
information for various
X, y= boston.data, boston.target neighborhoods in Boston
print("x=%s y:%s" %( X.shape,
y.shape))

Output:
X:( 506,13) y:(506,)
Information :Boston Housing Dataset
• The Boston Housing Dataset is a derived from information collected by the U.S. Census
Service concerning housing in the area of Boston MA. The following describes the dataset
columns:

• CRIM - per capita crime rate by town


• ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
• INDUS - proportion of non-retail business acres per town.
• CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
• NOX - nitric oxides concentration (parts per 10 million)
• RM - average number of rooms per dwelling
• AGE - proportion of owner-occupied units built prior to 1940
• DIS - weighted distances to five Boston employment centres
• RAD - index of accessibility to radial highways
• TAX - full-value property-tax rate per $10,000
• PTRATIO - pupil-teacher ratio by town
• B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
• LSTAT - % lower status of the population
• MEDV - Median value of owner-occupied homes in $1000's
Example- Classification
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score
from sklearn import datasets
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
# import the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# splitting X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3,
random_state=1)
# GAUSSIAN NAIVE BAYES
gnb = GaussianNB()
# train the model
gnb.fit(X_train, y_train)
# make predictions
gnb_pred = gnb.predict(X_test)
# print the accuracy
print("Accuracy of Gaussian Naive Bayes: ",
accuracy_score(y_test, gnb_pred))
# print other performance metrics
print("Precision of Gaussian Naive Bayes: ",
precision_score(y_test, gnb_pred, average='weighted'))
print("Recall of Gaussian Naive Bayes: ",
recall_score(y_test, gnb_pred, average='weighted'))
print("F1-Score of Gaussian Naive Bayes: ",
f1_score(y_test, gnb_pred, average='weighted'))
Output
• Accuracy of Gaussian Naive Bayes:
0.9333333333333333
• Precision of Gaussian Naive Bayes:
0.9352007469654529
• Recall of Gaussian Naive Bayes:
0.9333333333333333
• F1-Score of Gaussian Naive Bayes:
0.933615520282187
Regression
• Regression is a statistical method used in
finance, investing, and other disciplines that
attempts to determine the strength and
character of the relationship between one
dependent variable (usually denoted by Y) and
a series of other variables (known as
independent variables).
What Is Regression?
• Regression searches for relationships among variables. For example,
you can observe several employees of some company and try to
understand how their salaries depend on their features, such as
experience, education level, role, city of employment, and so on.
• This is a regression problem where data related to each employee
represents one observation. The presumption is that the experience,
education, role, and city are the independent features, while the salary
depends on them.
• Similarly, you can try to establish the mathematical dependence of
housing prices on area, number of bedrooms, distance to the city
center, and so on.
• In other words, you need to find a function that maps some features
or variables to others sufficiently well.
• The dependent features are called the dependent variables, outputs,
or responses. The independent features are called the independent
variables, inputs, regressors, or predictors.
Regression Example

from sklearn.linear_model import LinearRegression


from sklearn.datasets import load_boston
boston =load_boston()
X, y= boston.data, boston.target
print("x=%s y:%s" %( X.shape, y.shape))
hypothesis=LinearRegression(normalize=True)
hypothesis.fit(X,y)
print(hypothesis.score(X,y))
print(hypothesis.coef_)
Hashing
• Hashing is a technique or process of mapping keys, and values into the
hash table by using a hash function. It is done for faster access to
elements. The efficiency of mapping depends on the efficiency of the hash
function used.
• Let a hash function H(x) maps the value x at the index x%10 in an Array.
For example if the list of values is [11,12,13,14,15] it will be stored at
positions {1,2,3,4,5} in the array or Hash table respectively.
Performing Hashing
Hash functions:
• Transform any input to output whose
characteristics are predictable
• Hash function like secrete code, transform
everything into number .(like secret code)
Example:
Print(hash(‘python’))
Output:1142331976
Hashing Trick
• In machine learning, feature hashing (also called hashing trick) is an
efficient way to encode categorical features. It is based on hashing
functions in computer science that map data of variable sizes to data of a
fixed (and usually smaller) size. It is easier to understand feature hashing
through an example.
• Let's say we have three features—gender, site_domain,
and device_model, for example:
• With one-hot encoding, this will become feature
vectors of size 9, which come from 2 (from gender) +
4 (from site_domain) + 3 (from device_model). With
feature hashing, we want to obtain a feature vector
of size 4.
Creating hashing trick
1. Define range for hash function outputs.
All your feature vectors will use that range.
Here we use range of values 0 to 24
2. Compute index for each word in your
string using hash function
3. Assign a unit value to vector’s positions
according to word indexes.
Example
str1='Python for data science'
str2='Python for machine learning‘
def hashing_trick (input_string,vector_size=20):
feature_vector=[0]*vector_size
for word in input_string.split(' '):
index=abs(hash(word))%vector_size
feature_vector[index]=1
return feature_vector
print(hashing_trick(str1,vector_size=20))
print(hashing_trick(str2,vector_size=20))
• When viewing the feature vectors, you should notice that:
• ✓✓You don’t know where each word is located. When it’s important to
be
– able to reverse the process of assigning words to indexes, you must
– store the relationship between words and their hashed value separately
– (for example, you can use a dictionary where the keys are the hashed
– values and the values are the words).
• ✓✓For small values of the vector_size function parameter (for example,
– vector_size=10), many words overlap in the same positions in the list
– representing the feature vector. To keep the overlap to a minimum, you
– must create hash function boundaries that are greater than the number
– of elements you plan to index later.
• The feature vectors in this example are made mostly of zero entries,
representing a waste of memory when compared to the more memory‐
efficient one‐hot‐encoding. One of the ways in which you can solve this
problem is to rely on sparse matrices.
Sparse Matrix- Working with deterministic
selection
• Sparse Matrix – when dealing with data that has few values, i.e most of
matrix values are zeros.
• Sparse Matrix store just co-ordinates of the cells and their values.
• When an application request data from an empty cell, the sparse matrix
will return a zero value after looking for the co-ordiantes and not finding
them.
Program:
from scipy.sparse import csc_matrix
print(csc_matrix([1,0,0,0,0,1]))
OUTPUT:
(0, 0) 1
(0, 5) 1
HashingVectorizer vs. CountVectorizer

• HashingVectorizer and CountVectorizer are meant to do the same thing.


Which is to convert a collection of text documents to a matrix of token
occurrences. The difference is that HashingVectorizer does not store the
resulting vocabulary (i.e. the unique tokens).

Figure : How HashingVectorizer Works


How it all works?

1. Fix the number of dimensions n_features ('Divide


by')
2. Use hash function on text ( 'MurmurHash3')
3. Now we need to return a number from 0 to
n_features-1, example just use h(MurmurHash3)
mod n_features
• If we feed the same input to a hash function, it will
always give the same output. Hash functions may
output the same value for different inputs
(collision).
Program

import sklearn.feature_extraction.text as txt


from sklearn.feature_extraction.text import HashingVectorizer
import pandas as pd
text = ['The sky is blue and beautiful',
'The king is old and the queen is beautiful',
'Love this beautiful blue sky',
'The beautiful queen and the old king']
vectorizer = HashingVectorizer(n_features=8,norm =
None,stop_words='english')
X = vectorizer.fit_transform(text)
matrix = pd.DataFrame(X.toarray())
print(matrix)
Example:Toy Dataset and Imports
from sklearn.feature_extraction.text import HashingVectorizer
# dataset
cat_in_the_hat_docs=[ "One Cent, Two Cents, Old Cent, New Cent: All About Money
(Cat in the Hat's Learning Library",
"Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)",
"Oh, The Things You Can Do That Are Good for You: All About Staying Healthy (Cat in
the Hat's Learning Library)",
"On Beyond Bugs: All About Insects (Cat in the Hat's Learning Library)",
"There's No Place Like Space: All About Our Solar System (Cat in the Hat's Learning
Library)" ]
# Compute raw counts using hashing vectorizer
# Small numbers of n_features can cause hash collisions
hvectorizer = HashingVectorizer(n_features=10000,norm=None,alternate_sign=False)
# compute counts without any term frequency normalization
X = hvectorizer.fit_transform(cat_in_the_hat_docs)
# print populated columns of first document
# format: (doc id, pos_in_matrix) raw_count
print(X[0])
Output:HashingVectorizer
• (0, 93) 3.0
• (0, 689) 1.0
• (0, 717) 1.0 NOTE:
• (0, 1664) 1.0 Now if you check the shape, you
should see:
• (0, 2759) 1.0
(5, 10000) 5 documents, and a
• (0, 3124) 1.0 10,000 column matrix. In this
• (0, 4212) 1.0 example, most of the columns
• (0, 4380) 1.0 would be empty as the toy dataset
• (0, 5044) 1.0 is really small.
• (0, 7353) 1.0 15 unique tokens, one with a count
• (0, 8903) 1.0 of 3 and the rest all 1. Notice that
• (0, 8958) 1.0 the position ranges from 0 to 9999
• (0, 9376) 1.0
• (0, 9402) 1.0
• (0, 9851) 1.0
Example : CountVectorizer
#Countvectorizer
from sklearn.feature_extraction.text import
CountVectorizer
cvectorizer = CountVectorizer()
# compute counts without any term frequency
normalization
X = cvectorizer.fit_transform(cat_in_the_hat_docs)
# print populated columns of first document
# format: (doc id, pos_in_matrix) raw_count
print(X[0])
Output:
(0, 28) 1
(0, 8) 3 If you print the shape, you will see:
(0, 40) 1 (5, 43)
(0, 9) 1 Notice that instead of (5,10000) as in the
(0, 26) 1 HashingVectorizer example, you see (5,43).
This is because we did not force a matrix size
(0, 23) 1
with CountVectorizer.
(0, 1) 1 The matrix size is based on how many unique
(0, 0) 1 tokens were found in your vocabulary, where in
(0, 22) 1 this case it is 43.
(0, 7) 1
(0, 16) 1
(0, 37) 1
(0, 13) 1
(0, 19) 1
(0, 20) 1
• If you are using a large dataset for your machine
learning tasks and you have no use for the resulting
dictionary of tokens, then HashingVectorizer would be a
good candidate.
• However, if you worry about hash collisions (which is
bound to happen if the size of your matrix is too small),
then you might want to stick to CountVectorizer until
you feel that you have maxed out your computing
resources and it’s time to optimize. Also, if you need
access to the actual tokens, then again CountVectorizer
is the more appropriate choice.
Scikit-learn offers HashingVectorizer, a class that rapidly tranforms any collections of text into sparse matrix
using the hasing trick.

import sklearn.feature_extraction.text as txt


htrick=txt.HashingVectorizer(n_features=20,binary=True,norm=None)
hashed_text=htrick.transform(['Python for data science ','Python for machine
learning'])
print(hashed_text)
Note:
Output:
HashingVectorizer is the perfect
(0, 3) 1.0 function to use when your data can
(0, 5) 1.0 not fit into memory and its features
(0, 13) 1.0 are not fixed.
(0, 15) 1.0
(1, 2) 1.0
(1, 3) 1.0
(1, 4) 1.0
(1, 5) 1.0
Considering Timing and Performance
 What is Benchmarking?
Benchmarking aims at evaluating something by comparison with a standard.
 However, the question that arises here is that what would be the
benchmarking and why we need it in case of software programming.
Benchmarking the code means how fast the code is executing and where the
bottleneck is. One major reason for benchmarking is that it optimizes the code.
 How does benchmarking work?
If we talk about the working of benchmarking, we need to start by
benchmarking the whole program as one current state then we can combine
micro benchmarks and then decompose a program into smaller programs.
In order to find the bottlenecks within our program and optimize it.
In other words, we can understand it as breaking the big and hard problem into
series of smaller and a bit easier problems for optimizing them.
Python module for benchmarking

• In Python, default module for benchmarking is called timeit.


• With the help of the timeit module, we can measure the
performance of small bit of Python code within our main program.
• Magic funtions:
• %timeit – calculates the best performance time for instruction
• %%timeit : Calculates the best time performance for all the
instructions in a cell, apart from the one placed on the same cell
line as the cell magic.
• Both magic functions report the best performance in r trials for n
loops.
• When add –r and –n parameters , the notebook chooses the
number automatically in order to provide a fast answer.
Example: Determine the time required to assign list
10**6 ordinal values by using list comprehension
import timeit
%timeit l =[k for k in range(10**6)]
%timeit -n 20 -r 5 l =[k for k in range(10**6)]

• OUTPUT:
231 ms ± 9.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
267 ms ± 1.42 ms per loop (mean ± std. dev. of 5 runs, 20 loops each)
Cell magic function %%timeit
%%timeit
l=list()
for k in range(10**6)
l.append(k)
OUTPUT:
198 ms ± 6.62 ms per loop (mean ± std. dev. of 7 runs, 10 loop each)
Working with memory profiler
#importing memory-profiler module in the program
from memory_profiler import profile
# Profile Decorator class
@profile
#Notice the @profile this is a decorator. Any function which is decorated by this decorator, that function
will be tracked.
# A default function to check memory usage
def defFunc():
# Some random variables
var1 = [1] * (6 ** 4)
var2 = [1] * (2 ** 3)
var3 = [2] * (4 * 6 ** 3)
# Operations on variable
del var3
del var1
return var2

# Print confirmation message


print("We have successfully inspected memory usage from the default function!")

defFunc() # Call function


OUTPUT
ine # Mem usage Increment Occurrences Line Contents
========================================================
=
1185 99.3 MiB 99.3 MiB 1 @wraps(wrapped=func)
1186 def wrapper(*args,
**kwargs):
1187 99.3 MiB 0.0 MiB 1 prof = get_prof()
1188 99.3 MiB 0.0 MiB 1 val = prof(func)(*args,
*kwargs)
1189 99.3 MiB 0.0 MiB 1
show_results_bound(prof)
1190 99.3 MiB 0.0 MiB 1 return val
Multiprocessing
from sklearn.datasets import load_digits
digits=load_digits()
x,y=digits.data,digits.target
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
%timeit single_core=cross_val_score(SVC(),x,y,cv=20,n_jobs=1)
%timeit multi_core=cross_val_score(SVC(),x,y,cv=20,n_jobs=1)
OUTPUT:
7.49 s ± 465 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.43 s ± 492 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Chapter 13: Exploring Data Analysis
• The EDA Approach,
• Defining Descriptive Statistics for Numeric Data,
• Measuring central tendency,
• Measuring variance and range ,
• Working with percentiles,
• Defining measures of normality, Counting for Categorical Data, Understanding
frequencies,
• Creating contingency tables
• Creating Applied Visualization for EDA
• Inspecting boxplots, Performing t-tests after boxplots
• Observing parallel coordinates, Graphing distributions, Plotting
scatterplots ,Understanding Correlation, Using covariance and correlation
• Using nonparametric correlation, Considering the chi-square test for
tables ,Modifying Data Distributions,
• Using different statistical distributions, Creating a Z-score standardization,
Transforming other notable distributions.
Introdution-EDA
• EDA is a phenomenon under data analysis used for gaining a better
understanding of data aspects like:
– main features of data
– variables and relationships that hold between them
– identifying which variables are important for our problem

• Various exploratory data analysis methods like:

• Descriptive Statistics, which is a way of giving a brief overview of the dataset


we are dealing with, including some measures and features of the sample
• Grouping data [Basic grouping with group by]
• ANOVA, Analysis Of Variance, which is a computational method to divide
variations in an observations set into different components.
• Correlation and correlation methods
Descriptive Statistics

• Descriptive statistics is a helpful way to understand


characteristics of your data and to get a quick
summary of it. Pandas in python provide an
interesting method describe().
• The describe function applies basic statistical
computations on the dataset like extreme values,
count of data points standard deviation etc.
• Any missing value or NaN value is automatically
skipped. describe() function gives a good picture of
distribution of data.
Central Tendency
• Mathematically central tendency means measuring the center or
distribution of location of values of a data set.
• It gives an idea of the average value of the data in the data set and also
an indication of how widely the values are spread in the data set.
• That in turn helps in evaluating the chances of a new input fitting into
the existing data set and hence probability of success.
• There are three main measures of central tendency which can be
calculated using the methods in pandas python library.
• Mean - It is the Average value of the data which is a division of sum of
the values with the number of values.
• Median - It is the middle value in distribution when the values are
arranged in ascending or descending order.
• Mode - It is the most commonly occurring value in a distribution
Calculating Central Tendency

• mean — is average value of given numeric


values
• #median — is middle most value of given values
• #mode — is most frequently occurring value of
given numeric variables
• data[‘A’].mean()
• data[‘A’].median()
• data[‘A’].mode()
Central Tendency- Mean, Mode, Median
#Program to find mean and median
import pandas as pd

#Create a Dictionary of series


d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','Chanchal','Gasper','Naviya','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print("dataframe ")
df.describe()
print("Mean Values in the Distribution")
print(df.mean())
print("*******************************")
print("Median Values in the Distribution")
print(df.median())
print("Mode in the Distribution")
print(df.mode())
Output:
• Mean Values in the Distribution
Age 31.833333
Rating 3.743333
dtype: float64
*******************************
Median Values in the Distribution
Age 29.50
Rating 3.79
dtype: float64
Calculating Mode:
Mode in the Distribution
Name Age Rating
0 Andres 23.0 2.56
1 Chanchal 25.0 2.98
2 Gasper 30.0 3.20
3 Jack NaN 3.24
4 James NaN 3.65
5 Lee NaN 3.78
6 Naviya NaN 3.80
7 Ricky NaN 3.98
8 Smith NaN 4.10
9 Steve NaN 4.23
Dispersion

• Dispersion is used to define variation present in given variable.


Variation means how values are close or away from the mean
value.
• Variance — its gives average deviation from mean value
• Standard Deviation — it is square root of variance
• Range — it gives difference between max and min value
• InterQuartile Range(IQR) — it gives difference between Q3 and
Q1, where Q3 is 3rd Quartile value and Q1 is 1st Quartile value.
• data[‘A’].var()
• data[‘A’].std()
• data[‘A’].max()-data[‘A’].min()
• data[‘A’].quantile([.25,.5,.75])
Skewness

• Skewness is used to measure symmetry of data along with


the mean value. Symmetry means equal distribution of
observation above or below the mean.
• skewness = 0: if data is symmetric along with mean
• skewness = Negative: if data is not symmetric and right side
tail is longer than left side tail of density plot.
• skewness = Positive: if data is not symmetric and left side
tail is longer than right side tail in density plot.
• We can find skewness of given variable by below given
formula.
• data[‘A’].skew()
Practical-61-Central Tendency
import pandas as pd
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','Chanchal','Gasper','Naviya','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print("dataframe ")
df.describe()
print("Mean Values in the Distribution")
print(df.mean())
print("*******************************")
print("Median Values in the Distribution")
print(df.median())
print("Mode in the Distribution")
print(df.mode())
Continue..
#Calculate the standard deviation
print("standard Deviation")
print(df.std())
print("skewness")
print(df.skew())
print("variance")
print(df.var())
print("range")
print(df.max(numeric_only=True)-df.min(numeric_only=True))
OUTPUT:
Mean Values in the Distribution standard Deviation
Age 31.833333
Age 9.232682
Rating 3.743333
dtype: float64 Rating 0.661628
******************************* dtype: float64
Median Values in the Distribution
Age 29.50
Rating 3.79 skewness
dtype: float64 Age 1.135089
Mode in the Distribution
Rating -0.153629
Name Age Rating dtype: float64
0 Andres 23.0 2.56
1 Chanchal 25.0 2.98
2 Gasper 30.0 3.20
Variance
3 Jack NaN 3.24 Age 85.242424
4 James NaN 3.65 Rating 0.437752
5 Lee NaN 3.78
6 Naviya NaN 3.80
dtype: float64
7 Ricky NaN 3.98
8 Smith NaN 4.10 range
9 Steve NaN 4.23
10 Tom NaN 4.60 Age 28.00
11 Vin NaN 4.80 Rating 2.24
dtype: float64
Counting of Categorical data
• Pandas cut and qcut function are used to convert a numerical column
into a categorical one, perhaps to make it better suited for a machine
learning model (in case of a fairly skewed numerical column), or just for
better analyzing the data at hand.
• Example :
• We’ll be using the CarDekho dataset, containing data about used cars
listed on the platform. - CAR DETAILS FROM CAR DEKHO.csv

• ‘Year’ is the year in which the car was purchased.


• ‘Selling_Price’ is the price the owner wants to sell the car at.
• ‘Present_Price’ is the current ex-showroom price of the car.
• ‘Owner’ defines the number of owners the car has previously had, before
this car was put up on the platform.
cut functions:
cut – expects series of edge vaues to cut the
mesurements or an interger number of groups
used to cut the variables into equal width bins.
Import pandas as pd
pd.cut()
• We can use the ‘cut’ function in broadly 2 ways:
by specifying the number of bins directly and let
pandas do the work of calculating equal-sized
bins for us, or we can manually specify the bin
edges as we desire.
Example: https://www.geeksforgeeks.org/how-to-use-
pandas-cut-and-qcut/
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('E:\\CAR DETAILS FROM CAR DEKHO.csv')
print(df.head())
#print(df.info())
df.loc[0, 'year'] = np.nan
# these are the 'unique' years in
# the data
print(np.array(sorted(df.year.unique())))
print(pd.cut(df.year,bins=3, right=False).head())
print(pd.cut(df.year, bins=3,labels=['old', 'medium', 'new']).head())
df['Yr_cut'] = pd.cut(df.year, bins=3,
labels=['old', 'medium', 'new'])
print(df.head())
print(pd.cut(df.year, bins=3, labels=False).head())
#Understanding frequencies
print(df['Yr_cut'].value_counts())
name year ... transmission owner
0 Maruti 800 AC 2007 ... Manual First Owner
1 Maruti Wagon R LXI Minor 2007 ... Manual First Owner
2 Hyundai Verna 1.6 SX 2012 ... Manual First Owner
3 Datsun RediGO T Option 2017 ... Manual First Owner
4 Honda Amaze VX i-DTEC 2014 ... Manual Second Owner

[5 rows x 8 columns]
[ nan 1992. 1995. 1996. 1997. 1998. 1999. 2000. 2001. 2002. 2003. 2004.
2005. 2006. 2007. 2008. 2009. 2010. 2011. 2012. 2013. 2014. 2015. 2016.
2017. 2018. 2019. 2020.]
0 NaN
1 [2001.333, 2010.667)
2 [2010.667, 2020.028)
3 [2010.667, 2020.028)
4 [2010.667, 2020.028)
Name: year, dtype: category
Categories (3, interval[float64]): [[1992.0, 2001.333) < [2001.333, 2010.667) < [2010.667, 2020.028)]
0 NaN
1 medium
2 new
3 new
4 new
Name: year, dtype: category
Categories (3, object): ['old' < 'medium' < 'new']
name year ... owner Yr_cut
0 Maruti 800 AC NaN ... First Owner NaN
1 Maruti Wagon R LXI Minor 2007.0 ... First Owner medium
2 Hyundai Verna 1.6 SX 2012.0 ... First Owner new
3 Datsun RediGO T Option 2017.0 ... First Owner new
4 Honda Amaze VX i-DTEC 2014.0 ... Second Owner new
[5 rows x 9 columns]
0 NaN
1 1.0
2 2.0
3 2.0
4 2.0
Name: year, dtype: float64
new 3292
medium 986
old 61
Name: Yr_cut, dtype: int64
pd.qcut()

• Qcut (quantile-cut) differs from cut in the sense that, in qcut, the
number of elements in each bin will be roughly the same, but this will
come at the cost of differently sized interval widths
• On the other hand, in cut, the bin edges were equal sized (when we
specified bins=3) with uneven number of elements in each bin or
group. Also, cut is useful when you know for sure the interval ranges
and the bins,
• For example, if binning an ‘age’ column, we know infants are between
0 and 1 years old, 1-12 years are kids, 13-19 are teenagers, 20-60 are
working class grownups, and 60+ senior citizens. So we can
appropriately set bins=[0, 1, 12, 19, 60, 140] and labels=[‘infant’, ‘kid’,
‘teenager’, ‘grownup’, ‘senior citizen’]. In qcut, when we specify q=5,
we are telling pandas to cut the Year column into 5 equal quantiles, i.e.
0-20%, 20-40%, 40-60%, 60-80% and 80-100% buckets/bins.
• pd.qcut(df.Year, q=5).head(7)
OUTPUT:
NaN
1 (1991.999, 2010.0]
2 (2010.0, 2013.0]
3 (2015.0, 2017.0]
4 (2013.0, 2015.0]
5 (1991.999, 2010.0]
6 (2015.0, 2017.0]
Name: year, dtype: category
Categories (5, interval[float64]): [(1991.999, 2010.0] < (2010.0, 2013.0]
< (2013.0, 2015.0] < (2015.0, 2017.0] < (2017.0, 2020.0]]
Understanding frequencies
• Frequency for each categorical variable of the
dataset, both for the predictive variable and for
outcome, by using following code:
• print(df['Yr_cut'].value_counts())
• OUTPUT:
• new 3292
• medium 986
• old 61
• Name: Yr_cut, dtype: int64
Creating contingency table
• Contingency Table is one of the techniques for
exploring two or even more variables. It is
basically a tally of counts between two or
more categorical variables.
Example
import numpy as np
import pandas as pd
data = pd.read_csv("loan_status.csv")
print (data.head(10))
print(data.describe())
data_crosstab = pd.crosstab(data['grade'],
data['loan_status'],
margins = False)
print(data_crosstab)
Creating Applied Visualization for EDA – t-
test
• Understanding T-test
• The T-test is the test that compares two averages, also known as means, and tells us whether
they differ from each other or not. The T-test is also known as Student's T-test, and it also
tells us how significant the differences are. In other terms, it provides us knowledge of
whether those differences could have occurred by chance.
• Thus, we can conclude that the following:
• A large T-score implies that the groups are different from each other.
• A small T-score implies that the groups are similar.
• Understanding T-values and P-values
• Every T-value contains a P-value to work with it. A P-value is referred to as the probability
that the outcomes from the sample data happened coincidentally. P-values have values
starting from 0% to 100%. They are generally written as a decimal. For instance, a P-value of
10% is 0.1. It is good to have low P-values. Lower P-values indicate that the data did not
happen coincidentally. For instance, a P-value of 0.1 indicates that there is only a 1%
probability that the experiment's outcomes occurred coincidentally. Generally, in many cases,
a P-value of 5%, that is 0.05, is accepted to mean the data is said to be valid.
Example:
• Let us consider an example, we are given two-
sample data, each containing heights of 15
students of a class. We need to check whether
two different class students have the same
mean height. There are three ways to conduct
a two-sample T-Test in Python.
Method : Using Scipy library

• Scipy stands for scientific python and as the name implies it is a


scientific python library and it uses Numpy under the cover. This library
provides a variety of functions that can be quite useful in data science.
Firstly, let’s create the sample data. Now let’s perform two sample T-
Test. For this purpose, we have ttest_ind() function in Python.
• Syntax: ttest_ind(data_group1, data_group2, equal_var=True/False)
• Here,
• data_group1: First data group
• data_group2: Second data group
• equal_var = “True”: The standard independent two sample t-test will be
conducted by taking into consideration the equal population variances.
• equal_var = “False”: The Welch’s t-test will be conducted by not taking
into consideration the equal population variances.
Example:
#Python program to demonstrate how to
# perform two sample T-test

# Import the library


import scipy.stats as stats

# Creating data groups


data_group1 = np.array([14, 15, 15, 16, 13, 8, 14, 17, 16, 14, 19, 20, 21, 15, 15, 16,
16, 13, 14, 12])

data_group2 = np.array([15, 17, 14, 17, 14, 8, 12,19, 19, 14, 17, 22, 24, 16,
13, 16, 13, 18, 15, 13])

# Perform the two sample t-test with equal variances


Print(stats.ttest_ind(a=data_group1, b=data_group2, equal_var=True))
Analyzing the result:

Ttest_indResult(statistic=-0.6337397070250238, pvalue=0.5300471010405257)

• Two sample t-test has the following hypothesis:


• H0 => µ1 = µ2 (population mean of dataset1 is equal to
dataset2)
• HA => µ1 ≠µ2 (population mean of dataset1 is different from
dataset2)
• Here, since the p-value (0.53004) is greater than alpha = 0.05
so we cannot reject the null hypothesis of the test. We do not
have sufficient evidence to say that the mean height of
students between the two data groups is different.
Iris dataset –Description (used in book for
all examples)
• The Iris Dataset contains four features (length and width of sepals and petals) of 50
samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). These
measures were used to create a linear discriminant model to classify the species.
The dataset is often used in data mining, classification and clustering examples and
to test algorithms.
• sepal length (cm) sepal width (cm) ... petal width (cm) group
• 0 5.1 3.5 ... 0.2 setosa
• 1 4.9 3.0 ... 0.2 setosa
• 2 4.7 3.2 ... 0.2 setosa
• 3 4.6 3.1 ... 0.2 setosa
• 4 5.0 3.6 ... 0.2 setosa
• .. ... ... ... ... ...
• 51 7.0 3.2 4.7 1.4 versicolor
• 52 6.4 3.2 4.5 1.5 versicolor
• 145 6.7 3.0 ... 2.3 virginica
• 146 6.3 2.5 ... 1.9 virginica
• 147 6.5 3.0 ... 2.0 virginica
• 148 6.2 3.4 ... 2.3 virginica
• 149 5.9 3.0 ... 1.8 virginica
Example: Performing boxplots
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
iris=load_iris()
iris_nparray=iris.data
iris_dataframe=pd.DataFrame(iris.data,columns=iris.feature_names)
iris_dataframe['group']=pd.Series([iris.target_names[k] for k in iris.target])
print(iris_dataframe)
print(iris_dataframe.mean(numeric_only=True))
print(iris_dataframe.std())
print(iris_dataframe.max(numeric_only=True)-iris_dataframe.min(numeric_only=True))
boxplots=iris_dataframe.boxplot(fontsize=9)
plt.show()
boxplots=iris_dataframe.boxplot(column='petal length (cm)',by='group',fontsize=10)
plt.subtitle("")
plt.show()
OUPUT:
• sepal length (cm) sepal width (cm) ... petal width (cm) group
• 0 5.1 3.5 ... 0.2 setosa
• 1 4.9 3.0 ... 0.2 setosa
• 2 4.7 3.2 ... 0.2 setosa
• 3 4.6 3.1 ... 0.2 setosa
• 4 5.0 3.6 ... 0.2 setosa
• .. ... ... ... ... ...
• 145 6.7 3.0 ... 2.3 virginica
• 146 6.3 2.5 ... 1.9 virginica
• 147 6.5 3.0 ... 2.0 virginica
• 148 6.2 3.4 ... 2.3 virginica
• 149 5.9 3.0 ... 1.8 virginica

• [150 rows x 5 columns]


• sepal length (cm) 5.843333
• sepal width (cm) 3.057333
• petal length (cm) 3.758000
• petal width (cm) 1.199333
• dtype: float64
• sepal length (cm) 0.828066
• sepal width (cm) 0.435866
• petal length (cm) 1.765298
• petal width (cm) 0.762238
• dtype: float64
• sepal length (cm) 3.6
• sepal width (cm) 2.4
• petal length (cm) 5.9
• petal width (cm) 2.4
• dtype: float64
Performing t-tests after boxplots
group0=iris_dataframe['group']=='setosa'
group1=iris_dataframe['group']=='versicolor'
group2=iris_dataframe['group']=='virginica'
variable=iris_dataframe['petal length (cm)']
print('var1 %0.3f var2 %0.3f' %(variable[group1].var(),variable[group2].var()))
t,pvalue=ttest_ind(variable[group1],variable[group2],axis=0,equal_var=False)
print('t statistic %0.3f p-value %0.3f' %(t,pvalue))

OUTPUT:
var1 0.221 var2 0.305
t statistic -12.604 p-value 0.000
NOTE: when p-value is below 0.05, then we can confirm that the groups means are
significantly different.
Observing parallel co-ordiantes
• from pandas.plotting import parallel_coordinates
• iris_dataframe['group']=iris.target
• iris_dataframe['labels']=[iris.target_names[k] for k in iris_dataframe['group']]
• p11=parallel_coordinates(iris_dataframe,'labels')
Graphing distributions- Complete
distributions of values
• cols=iris_dataframe.columns[:4]
• densityplot=iris_dataframe[cols].plot(kind='density')
Understanding coorelation
• Just as the relationship between variables is graphically representable, it is also
measurable by a statistical estimate. When working with numeric variables, the
estimate is a correlation, and the Pearson’s correlation is the most famous. The
Pearson’s correlation is the foundation for complex linear estimation models.
When you work with categorical variables, the estimate is an association, and
the chi‐square statistic is the most frequently used tool for
measuring association between features.
• Using covariance and correlation
• Covariance is the first measure of the relationship of two variables.
• It determines whether both variables have a coincident behavior with respect to
their mean. If the single values of two variables are usually above or below their
respective averages, the two variables have a positive association. It means that
they tend to agree, and you can figure out the behavior of one of the two by
looking at the other. In such a case, their covariance will be a positive number,
and the higher the number, the higher the agreement
Co-variance matrix
• print(iris_dataframe.cov())
Co-relation matrix
• print(iris_dataframe.corr())
Considering the chi-square test for tables-
non parametric test
• A chi-square test is a statistical test used to
compare observed results with expected
results. The purpose of this test is to
determine if a difference between observed
data and expected data is due to chance, or if
it is due to a relationship between the
variables you are studying.

You might also like