Practical No - 1
Practical No - 1
Data Wrangling, I
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g. https://www.kaggle.com). Provide a clear
description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas insult(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables etc.
Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking the
data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set.
If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python. In addition to the codes and
outputs, explain every operation that you do in the above steps and explain everything that you
do to import/read/scrape the data set.
Python Code:
# 4. Data Preprocessing:
# Check for missing values using pandas info(), describe() functions.
print("\nInformation about the dataset:")
print(iris_df.info())
# Variable Descriptions:
# - Sepal Length, Sepal Width, Petal Length, Petal Width: Numeric variables.
# - Class: Categorical variable representing the species of iris flowers.
• The code starts by importing necessary libraries, including Pandas for data
manipulation and NumPy for numerical operations.
• The dataset URL is specified, and the read_csv function from Pandas is used to load the
dataset into a Pandas DataFrame.
• The info() and describe() functions are used to obtain initial statistics and check for
missing values.
• Variable descriptions are provided, and the dimensions of the DataFrame are printed.
• The data types of variables are displayed using dtypes.
• The 'class' variable is categorical, so one-hot encoding is applied using
pd.get_dummies() to convert it into quantitative variables.
• The updated DataFrame is displayed.
Output:
[5 rows x 6 columns]
Date :
Name &Signature of Instructor