0% found this document useful (0 votes)
19 views

Part A Assignment_No_1

Uploaded by

anaghasalunke44
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Part A Assignment_No_1

Uploaded by

anaghasalunke44
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

SNJB’s Late Sau. K. B. Jain College of Engineering, Chandwad.

Department of Artificial Intelligence and Data Science


Subject: Software Laboratory III (317534)

Assignment No. 1 (Group A)

Title: Data Wrangling I, Perform the operations using Python on any open source dataset (e.g.,
data.csv)

Date of Completion: __________________ Date of Submission: __________________

S.N Criteria (ACVT) Possible Obtained


. Marks Marks
5
1 Answers (25%)
5
2 Coding Efficiency (25 %)
5
3 Viva (25%)
Timely Completion (25%) 5
4
Total 20
5

Total

Prof. N.V. Sharma


(Subject Teacher)

Department of AI & DS Engineering 1


Assignment No. 1 (Group A)

Aim: Perform the following operations using Python on any open source dataset (e.g.,
data.csv)
1. Import all the required Python Libraries.
2. Locate open source data from the web (e.g., https://www.kaggle.com). Provide a clear
description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables etc.
Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the
data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data
set.
If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.

● Outcome: At end of this experiment, student will be able to do Data Wrangling on on


any open source dataset

● Hardware Requirement: Computer System



● Software Requirement: Any Linux OS & PyCharm IDE

Theory: Data wrangling involves processing the data in various formats like - merging,
grouping, concatenating etc. for the purpose of analyzing or getting them ready to be used
with another set of data. Python has built-in features to apply these wrangling methods to
various data sets to achieve the analytical goal. In this chapter we will look at a few
examples describing these methods.

Merging Data
The Pandas library in python provides a single function, merge, as the entry point for all
standard database join operations between DataFrame objects −

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,


left_index=False, right_index=False, sort=True)

Department of AI & DS Engineering 2


Let us now create two different DataFrames and perform the merging operations on
it.
# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right
Its output is as follows −

Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5

Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5

Grouping Data
Grouping data sets is a frequent need in data analysis where we need the result in terms of
various groups present in the data set. Panadas has in-built methods which can roll the data
into various groups.

In the below example we group the data by year and then get the result for a specific year.

Department of AI & DS Engineering 3


# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',


'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
print grouped.get_group(2014)
Its output is as follows −

Points Rank Team Year


0 876 1 Riders 2014
2 863 2 Devils 2014
4 741 3 Kings 2014
9 701 4 Royals 2014

Concatenating Data
Pandas provides various facilities for easily combining together Series, DataFrame, and
Panel objects. In the below example the concat function performs concatenation operations
along an axis. Let us create different objects and do concatenation.

import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])

Department of AI & DS Engineering 4


Its output is as follows −

Marks_scored Name subject_id


1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5
Conclusion: -
_________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________
_____________________________________________________________________________________________
-------------------------------------------------------------------------------------------------------------------------

Questions:

Q1. What is Data wrangling ?

Q2. What’s the different methods of Data Wrangling?

Q3. What is the difference between Data Wrangling and ETL?

Q4. Can you explain what a dirty data record is in context of data wrangling?

Q5. Can you explain what an outlier is? How do you deal with them?

Department of AI & DS Engineering 5


Assignment No.1

1. Importing Required Python Libraries:


python

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Description of Open Source Data:


The dataset used for this task is the "Iris" dataset which is available on the UCI Machine

Learning Repository website at https://archive.ics.uci.edu/ml/datasets/iris. This dataset


contains information about different species of Iris flowers, including measurements of the
length and width of their petals and sepals, as well as their species.

2. Loading the Dataset into pandas dataframe:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv(url, names=column_names)

3. Data Preprocessing:
Checking for missing values and initial statistics using pandas isnull() and describe()
functions:

df.isnull().sum() # no missing values found


df.describe() # provides initial statistics of the dataset

Variable Descriptions:

sepal_length: length of the sepal in centimeters


sepal_width: width of the sepal in centimeters

Department of AI & DS Engineering 6


petal_length: length of the petal in centimeters
petal_width: width of the petal in centimeters
class: species of Iris flower (setosa, versicolor, or virginica)

Checking the dimensions of the dataframe:

df.shape # (150, 5)

Data Formatting and Data Normalization:


Summarizing the types of variables by checking their data types:

df.dtypes

# Output:
# sepal_length float64
# sepal_width float64
# petal_length float64
# petal_width float64
# class object
# dtype: object
All variables are in the correct data type.

Turning Categorical Variables into Quantitative Variables:


In this dataset, the class variable is categorical. We can convert it into quantitative
variables by using one-hot encoding technique.

df = pd.get_dummies(df, columns=['class'])
df

Department of AI & DS Engineering 7

You might also like