DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
Dr Kruti Dangarwala
CSE & IT Department
SVMIT
Unit 5
Wrangling Data: (Chapter 12)
Exploring Data Analysis: (Chapter 13)
URL: https://ebookreading.net/view/book/EB9781119547624_12.html#
Wrangling Data: (Chapter 12)
• Playing with Scikit-learn,
• Understanding classes in Scikit-learn,
• Defining applications for data science,
• Performing the Hashing Trick,
• Using hash functions,
• Demonstrating the hashing trick,
• Working with deterministic selection,
• Considering Timing and Performance,
• Benchmarking, with timeit,
• Working with the memory profiler,
• Running in Parallel on Multiple Cores,
• Performing multicore parallelism,
• Demonstrating multiprocessing.
Data Wrangling -Introduction
Data wrangling—also called data cleaning, data remediation, or data
munging—refers to a variety of processes designed to transform raw data
into more readily used formats.
The exact methods differ from project to project depending on the data
you’re leveraging and the goal you’re trying to achieve.
4. Enriching
• Once you understand your existing data and have transformed it into a
more usable state, you must determine whether you have all of the
data necessary for the project at hand.
• If not, you may choose to enrich or augment your data by
incorporating values from other datasets. For this reason, it’s important
to understand what other data is available for use.
• If you decide that enrichment is necessary, you need to repeat the
steps above for any new data.
5. Validating
• Data validation refers to the process of verifying that your data is both
consistent and of a high enough quality.
• During validation, you may discover issues you need to resolve or
conclude that your data is ready to be analyzed. Validation is typically
achieved through various automated processes and requires
programming.
6. Publishing
• Once your data has been validated, you can publish it. This involves
making it available to others within your organization for analysis.
• The format you use to share the information—such as a written report
or electronic file—will depend on your data and the organization’s
goals.
Scikit-learn-Introduction
• Scikit-learn is the package for machine learning and data science
experimentation favored by most data scientists.
• It contains a wide range of well-established learning algorithms, error
functions, and testing procedures.
• Scikit-learn (Sklearn) is the most useful and robust library for machine learning
in Python.
• It provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality
reduction via a consistence interface in Python
• This library, which is largely written in Python, is built upon NumPy, SciPy and
Matplotlib.
• Package:
• Import sklearn
Install scikit-learn
• Using pip : Following command can be used to
install scikit-learn via pip −
• Transforming data
Classification
• Classification is a process of categorizing data or objects into predefined
classes or categories based on their features or attributes. In machine learning,
classification is a type of supervised learning technique where an algorithm is
trained on a labeled dataset to predict the class or category of new, unseen
data.
• The main objective of classification is to build a model that can accurately
assign a label or category to a new observation based on its features. For
example, a classification model might be trained on a dataset of images labeled
as either dogs or cats and then used to predict the class of new, unseen images
of dogs or cats based on their features such as color, texture, and shape.
Example:
suppose we want to predict the possibility of the rain in some regions on the
basis of some parameters. Then there would be two labels rain and no rain
under which different regions can be classified.
Types of Classification
Classification is of two types:
• Binary Classification: In binary classification, the goal is to classify the
input into one of two classes or categories. Example – On the basis of
the given health conditions of a person, we have to determine whether
the person has a certain disease or not.
• Multiclass Classification: In multi-class classification, the goal is to
classify the input into one of several classes or categories. For Example –
On the basis of data about different species of flowers, we have to
determine which specie our observation belongs to.
Types of classification algorithms
1) Linear Classifiers: Linear models create a linear decision boundary between classes. They are
simple and computationally efficient. Some of the linear classification models are as follows:
– Logistic Regression
– Support Vector Machines having kernel = ‘linear’
– Single-layer Perceptron
– Stochastic Gradient Descent (SGD) Classifier
2) Non-linear Classifiers: Non-linear models create a non-linear decision boundary between classes.
They can capture more complex relationships between the input features and the target variable.
Some of the non-linear classification models are as follows:
– K-Nearest Neighbours
– Kernel SVM
– Naive Bayes
– Decision Tree Classification
– Ensemble learning classifiers:
• Random Forests,
• AdaBoost,
• Bagging Classifier,
• Voting Classifier,
• ExtraTrees Classifier
– Multi-layer Artificial Neural Networks
Regression
• Process of finding a model or function for distinguishing the
data into continuous real values instead of using classes.
Mathematically, with a regression problem, one is trying to
find the function approximation with the minimum error
deviation. In regression, the data numeric dependency is
predicted to distinguish it.
• Let’s take the similar example in regression also, where we
are finding the possibility of rain in some particular regions
with the help of some parameters. In this case, there is a
probability associated with the rain. Here we are not
classifying the regions within rain and no rain labels instead
we are classifying them with their associated probability.
Classification model Evaluations
• Classification Accuracy: The proportion of correctly classified instances over the total number
of instances in the test set. It is a simple and intuitive metric but can be misleading in
imbalanced datasets where the majority class dominates the accuracy score.
• Confusion matrix: A table that shows the number of true positives, true negatives, false
positives, and false negatives for each class, which can be used to calculate various evaluation
metrics.
• Precision and Recall: Precision measures the proportion of true positives over the total number
of predicted positives, while recall measures the proportion of true positives over the total
number of actual positives. These metrics are useful in scenarios where one class is more
important than the other, or when there is a trade-off between false positives and false negatives.
• F1-Score: The harmonic mean of precision and recall, calculated as 2 x (precision x recall) /
(precision + recall). It is a useful metric for imbalanced datasets where both precision and recall
are important.
• ROC curve and AUC: The Receiver Operating Characteristic (ROC) curve is a plot of the true
positive rate (recall) against the false positive rate (1-specificity) for different threshold values of
the classifier’s decision function. The Area Under the Curve (AUC) measures the overall
performance of the classifier, with values ranging from 0.5 (random guessing) to 1 (perfect
classification).
• Cross-validation: A technique that divides the data into multiple folds and trains the model on
each fold while testing on the others, to obtain a more robust estimate of the model’s
performance.
Dataset loading:
Boston dataset
from sklearn.datasets import The Boston Housing Dataset
is a widely used dataset in
load_boston machine learning and
predictive analytics.
boston =load_boston() It contains housing
information for various
X, y= boston.data, boston.target neighborhoods in Boston
print("x=%s y:%s" %( X.shape,
y.shape))
Output:
X:( 506,13) y:(506,)
Information :Boston Housing Dataset
• The Boston Housing Dataset is a derived from information collected by the U.S. Census
Service concerning housing in the area of Boston MA. The following describes the dataset
columns:
• OUTPUT:
231 ms ± 9.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
267 ms ± 1.42 ms per loop (mean ± std. dev. of 5 runs, 20 loops each)
Cell magic function %%timeit
%%timeit
l=list()
for k in range(10**6)
l.append(k)
OUTPUT:
198 ms ± 6.62 ms per loop (mean ± std. dev. of 7 runs, 10 loop each)
Working with memory profiler
#importing memory-profiler module in the program
from memory_profiler import profile
# Profile Decorator class
@profile
#Notice the @profile this is a decorator. Any function which is decorated by this decorator, that function
will be tracked.
# A default function to check memory usage
def defFunc():
# Some random variables
var1 = [1] * (6 ** 4)
var2 = [1] * (2 ** 3)
var3 = [2] * (4 * 6 ** 3)
# Operations on variable
del var3
del var1
return var2
#Create a DataFrame
df = pd.DataFrame(d)
print("dataframe ")
df.describe()
print("Mean Values in the Distribution")
print(df.mean())
print("*******************************")
print("Median Values in the Distribution")
print(df.median())
print("Mode in the Distribution")
print(df.mode())
Output:
• Mean Values in the Distribution
Age 31.833333
Rating 3.743333
dtype: float64
*******************************
Median Values in the Distribution
Age 29.50
Rating 3.79
dtype: float64
Calculating Mode:
Mode in the Distribution
Name Age Rating
0 Andres 23.0 2.56
1 Chanchal 25.0 2.98
2 Gasper 30.0 3.20
3 Jack NaN 3.24
4 James NaN 3.65
5 Lee NaN 3.78
6 Naviya NaN 3.80
7 Ricky NaN 3.98
8 Smith NaN 4.10
9 Steve NaN 4.23
Dispersion
[5 rows x 8 columns]
[ nan 1992. 1995. 1996. 1997. 1998. 1999. 2000. 2001. 2002. 2003. 2004.
2005. 2006. 2007. 2008. 2009. 2010. 2011. 2012. 2013. 2014. 2015. 2016.
2017. 2018. 2019. 2020.]
0 NaN
1 [2001.333, 2010.667)
2 [2010.667, 2020.028)
3 [2010.667, 2020.028)
4 [2010.667, 2020.028)
Name: year, dtype: category
Categories (3, interval[float64]): [[1992.0, 2001.333) < [2001.333, 2010.667) < [2010.667, 2020.028)]
0 NaN
1 medium
2 new
3 new
4 new
Name: year, dtype: category
Categories (3, object): ['old' < 'medium' < 'new']
name year ... owner Yr_cut
0 Maruti 800 AC NaN ... First Owner NaN
1 Maruti Wagon R LXI Minor 2007.0 ... First Owner medium
2 Hyundai Verna 1.6 SX 2012.0 ... First Owner new
3 Datsun RediGO T Option 2017.0 ... First Owner new
4 Honda Amaze VX i-DTEC 2014.0 ... Second Owner new
[5 rows x 9 columns]
0 NaN
1 1.0
2 2.0
3 2.0
4 2.0
Name: year, dtype: float64
new 3292
medium 986
old 61
Name: Yr_cut, dtype: int64
pd.qcut()
• Qcut (quantile-cut) differs from cut in the sense that, in qcut, the
number of elements in each bin will be roughly the same, but this will
come at the cost of differently sized interval widths
• On the other hand, in cut, the bin edges were equal sized (when we
specified bins=3) with uneven number of elements in each bin or
group. Also, cut is useful when you know for sure the interval ranges
and the bins,
• For example, if binning an ‘age’ column, we know infants are between
0 and 1 years old, 1-12 years are kids, 13-19 are teenagers, 20-60 are
working class grownups, and 60+ senior citizens. So we can
appropriately set bins=[0, 1, 12, 19, 60, 140] and labels=[‘infant’, ‘kid’,
‘teenager’, ‘grownup’, ‘senior citizen’]. In qcut, when we specify q=5,
we are telling pandas to cut the Year column into 5 equal quantiles, i.e.
0-20%, 20-40%, 40-60%, 60-80% and 80-100% buckets/bins.
• pd.qcut(df.Year, q=5).head(7)
OUTPUT:
NaN
1 (1991.999, 2010.0]
2 (2010.0, 2013.0]
3 (2015.0, 2017.0]
4 (2013.0, 2015.0]
5 (1991.999, 2010.0]
6 (2015.0, 2017.0]
Name: year, dtype: category
Categories (5, interval[float64]): [(1991.999, 2010.0] < (2010.0, 2013.0]
< (2013.0, 2015.0] < (2015.0, 2017.0] < (2017.0, 2020.0]]
Understanding frequencies
• Frequency for each categorical variable of the
dataset, both for the predictive variable and for
outcome, by using following code:
• print(df['Yr_cut'].value_counts())
• OUTPUT:
• new 3292
• medium 986
• old 61
• Name: Yr_cut, dtype: int64
Creating contingency table
• Contingency Table is one of the techniques for
exploring two or even more variables. It is
basically a tally of counts between two or
more categorical variables.
Example
import numpy as np
import pandas as pd
data = pd.read_csv("loan_status.csv")
print (data.head(10))
print(data.describe())
data_crosstab = pd.crosstab(data['grade'],
data['loan_status'],
margins = False)
print(data_crosstab)
Creating Applied Visualization for EDA – t-
test
• Understanding T-test
• The T-test is the test that compares two averages, also known as means, and tells us whether
they differ from each other or not. The T-test is also known as Student's T-test, and it also
tells us how significant the differences are. In other terms, it provides us knowledge of
whether those differences could have occurred by chance.
• Thus, we can conclude that the following:
• A large T-score implies that the groups are different from each other.
• A small T-score implies that the groups are similar.
• Understanding T-values and P-values
• Every T-value contains a P-value to work with it. A P-value is referred to as the probability
that the outcomes from the sample data happened coincidentally. P-values have values
starting from 0% to 100%. They are generally written as a decimal. For instance, a P-value of
10% is 0.1. It is good to have low P-values. Lower P-values indicate that the data did not
happen coincidentally. For instance, a P-value of 0.1 indicates that there is only a 1%
probability that the experiment's outcomes occurred coincidentally. Generally, in many cases,
a P-value of 5%, that is 0.05, is accepted to mean the data is said to be valid.
Example:
• Let us consider an example, we are given two-
sample data, each containing heights of 15
students of a class. We need to check whether
two different class students have the same
mean height. There are three ways to conduct
a two-sample T-Test in Python.
Method : Using Scipy library
data_group2 = np.array([15, 17, 14, 17, 14, 8, 12,19, 19, 14, 17, 22, 24, 16,
13, 16, 13, 18, 15, 13])
Ttest_indResult(statistic=-0.6337397070250238, pvalue=0.5300471010405257)
OUTPUT:
var1 0.221 var2 0.305
t statistic -12.604 p-value 0.000
NOTE: when p-value is below 0.05, then we can confirm that the groups means are
significantly different.
Observing parallel co-ordiantes
• from pandas.plotting import parallel_coordinates
• iris_dataframe['group']=iris.target
• iris_dataframe['labels']=[iris.target_names[k] for k in iris_dataframe['group']]
• p11=parallel_coordinates(iris_dataframe,'labels')
Graphing distributions- Complete
distributions of values
• cols=iris_dataframe.columns[:4]
• densityplot=iris_dataframe[cols].plot(kind='density')
Understanding coorelation
• Just as the relationship between variables is graphically representable, it is also
measurable by a statistical estimate. When working with numeric variables, the
estimate is a correlation, and the Pearson’s correlation is the most famous. The
Pearson’s correlation is the foundation for complex linear estimation models.
When you work with categorical variables, the estimate is an association, and
the chi‐square statistic is the most frequently used tool for
measuring association between features.
• Using covariance and correlation
• Covariance is the first measure of the relationship of two variables.
• It determines whether both variables have a coincident behavior with respect to
their mean. If the single values of two variables are usually above or below their
respective averages, the two variables have a positive association. It means that
they tend to agree, and you can figure out the behavior of one of the two by
looking at the other. In such a case, their covariance will be a positive number,
and the higher the number, the higher the agreement
Co-variance matrix
• print(iris_dataframe.cov())
Co-relation matrix
• print(iris_dataframe.corr())
Considering the chi-square test for tables-
non parametric test
• A chi-square test is a statistical test used to
compare observed results with expected
results. The purpose of this test is to
determine if a difference between observed
data and expected data is due to chance, or if
it is due to a relationship between the
variables you are studying.