Part A Assignment_No_1
Part A Assignment_No_1
Title: Data Wrangling I, Perform the operations using Python on any open source dataset (e.g.,
data.csv)
Total
Aim: Perform the following operations using Python on any open source dataset (e.g.,
data.csv)
1. Import all the required Python Libraries.
2. Locate open source data from the web (e.g., https://www.kaggle.com). Provide a clear
description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables etc.
Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the
data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data
set.
If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
Merging Data
The Pandas library in python provides a single function, merge, as the entry point for all
standard database join operations between DataFrame objects −
Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5
Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5
Grouping Data
Grouping data sets is a frequent need in data analysis where we need the result in terms of
various groups present in the data set. Panadas has in-built methods which can roll the data
into various groups.
In the below example we group the data by year and then get the result for a specific year.
grouped = df.groupby('Year')
print grouped.get_group(2014)
Its output is as follows −
Concatenating Data
Pandas provides various facilities for easily combining together Series, DataFrame, and
Panel objects. In the below example the concat function performs concatenation operations
along an axis. Let us create different objects and do concatenation.
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])
Questions:
Q4. Can you explain what a dirty data record is in context of data wrangling?
Q5. Can you explain what an outlier is? How do you deal with them?
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv(url, names=column_names)
3. Data Preprocessing:
Checking for missing values and initial statistics using pandas isnull() and describe()
functions:
Variable Descriptions:
df.shape # (150, 5)
df.dtypes
# Output:
# sepal_length float64
# sepal_width float64
# petal_length float64
# petal_width float64
# class object
# dtype: object
All variables are in the correct data type.
df = pd.get_dummies(df, columns=['class'])
df