0% found this document useful (0 votes)
11 views

DS FINAL

This document is a laboratory manual for a Data Science course (SSCA3021) in the Department of Computer Science and Application. It outlines practical exercises covering Python basics, data structures, libraries like Pandas and NumPy, exploratory data analysis, data munging, and building predictive models using various algorithms. The manual also includes a mini project for students to apply their learned skills in a group setting.

Uploaded by

Om Balar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

DS FINAL

This document is a laboratory manual for a Data Science course (SSCA3021) in the Department of Computer Science and Application. It outlines practical exercises covering Python basics, data structures, libraries like Pandas and NumPy, exploratory data analysis, data munging, and building predictive models using various algorithms. The manual also includes a mini project for students to apply their learned skills in a group setting.

Uploaded by

Om Balar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

School of LABORATORY MANUAL

v
Engineering
Department of Computer Science and Application

DATA SCIENCE
(SSCA3021) | BSCIT/BCA

Name of Student:

Enrollment No.: Academic Year:


ani School of Engineering
P P Sav

This is to certify that

Mr./Ms.

of Computer Science and Engineering having Enrollment No. has completed


his/her Term work in the subject of
DATA SCIENCE

(SSCA3021).

Marks Obtained: out of

Sign of Faculty Sign of Head of Department


Date: Date: ___________________
Technology Engineering Department.
CONTENTS
SR.
NAME OF PRACTICAL PAGE NO. DATE MARK SIGN
NO.
1. Basics of Python for Data Analysis

2. Why learn Python for data analysis?

3. Python 2.7 v/s 3.4


How to install Python? Running a few
4.
simple programs in Python
5. Python libraries and data structures

6. Python Data Structures


Python Iteration and Conditional
7.
Constructs, • Python Libraries
Exploratory analysis in Python using
8.
Pandas
9. Introduction to series and data frames
Analytics of dataset- Loan Prediction
10.
Problem
11. Data Munging in Python using Pandas
Building a Predictive Model in Python
12. Logistic Regression • Decision Tree •
Random Forest
Building a Predictive Model in Python
13. Logistic Regression • Decision Tree •
Random Forest
14. Mini Project
Practical - 1 , 2 ,3, 4

Aim : Introduction to Python Environment

Basics of Python for Data Science:

Understand the differences between Python 2.7 and Python 3.4

Learn how to install Python

Write and run simple Python programs

Installing Python:

Download Python 3.10 from the official website.


Follow the installation steps for your OS.
Verify the installation by running python --version in the terminal.
Creating Your First Python Project:

Conclusion:

You have successfully installed Python and executed basic Python code.
Practical - 5, 6

Aim: Python Libraries and Data Structures

Objectives:
Learn about Python's data structures.
Understand conditional constructs and iterations.
Explore important Python libraries.

Data Structures:

List:

Definition: A list is an ordered, mutable collection of items. It can contain elements of different types.

Syntax: Lists are defined by placing elements inside square brackets [].

Dictionaries:

Definition: A dictionary is a collection of key-value pairs. Each key is unique and maps to a value.

Dictionaries are mutable.


Syntax: Dictionaries are defined using curly braces {} with key-value pairs separated by a colon:

Sets:

Definition: A set is an unordered collection of unique elements. Sets are mutable but do not allow duplicate

elements.

Syntax: Sets are defined by placing elements inside curly braces {} or using the set() function.
Tuples:

Definition: A tuple is an ordered collection of items, but unlike lists, it is immutable, meaning it cannot be

changed after creation.

Syntax: Tuples are defined by placing elements inside parentheses ().

Summary of Key Differences:


Practical - 7

Aim: Iteration and Conditional Constructs:

Iteration:
Iteration refers to the repeated execution of a block of code. Common types of iteration constructs include:

For Loop: Used to repeat a block of code a specific number of times.

While Loop: Continues to execute as long as a specified condition is true.

Conditional Constructs:
Conditional constructs allow you to execute different blocks of code based on certain conditions. The main types
include:

If Statement: Executes a block of code if a condition is true.

If-Else Statement: Provides an alternative block of code if the condition is false.

Elif Statement: Allows checking multiple conditions.


Combining Iteration and Conditionals

Key Python Libraries:


NumPy
NumPy (Numerical Python) is a powerful library for numerical operations in Python. It provides support for
large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on
these arrays.

Key Features:
N-dimensional arrays: Efficient storage and manipulation of large datasets.
Mathematical functions: Functions for element-wise operations, linear algebra, statistical operations, etc.
Broadcasting: A powerful mechanism for performing operations on arrays of different shapes.

Pandas
Pandas is a library primarily used for data manipulation and analysis. It provides data structures like Series and
DataFrame, which allow for easy handling of structured data.

Key Features:
DataFrame: Two-dimensional size-mutable, potentially heterogeneous tabular data.
Data manipulation: Easy indexing, filtering, grouping, and aggregating of data.
Handling missing data: Built-in functionality for managing missing values.
Matplotlib for plotting

Matplotlib
Matplotlib is a plotting library that provides a flexible way to create static, animated, and interactive
visualizations in Python.

Key Features:
Versatile plotting: Support for line plots, bar plots, histograms, scatter plots, and more.
Customization: Extensive options for customizing plots (titles, labels, legends, etc.).
Integration: Can be easily integrated with NumPy and Pandas for plotting data.

Combining Them
Here’s a combined example using all three libraries:
Practical - 8, 9

Aim: Exploratory Data Analysis in Python Using Pandas


Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that helps summarize the main
characteristics of a dataset, often using visual methods.

Objective:
Learn to perform data exploration using Pandas, focusing on Series and DataFrames.

Import Libraries
First, you'll need to import the necessary libraries.

Load the Dataset


You can load a dataset from a CSV file or any other format supported by Pandas.

Data Overview
Get a basic understanding of the dataset.
Data Visualization
Visualizing data helps to identify patterns, trends, and anomalies.

a. Distribution of Age

b. Survival Rate by Gender

c. Box Plot for Fare by Class

Correlation Analysis
Examining relationships between numerical variables can provide insights.
Practical - 10

Aim: Loan Prediction Dataset Analytics: Perform basic operations on the Loan Prediction dataset
To perform basic analytics on a Loan Prediction dataset using Pandas, we’ll walk through the following steps:

• Import Libraries

• Load the Dataset

• Data Overview

• Data Cleaning

• Exploratory Data Analysis (EDA)

• Basic Operations and Insights

Step 1: Import Libraries

Step 2: Load the Dataset


Assuming you have a Loan Prediction dataset in CSV format, load it using Pandas. Here, we’ll use a
hypothetical URL for demonstration.
Step 3: Data Overview
Get a basic understanding of the dataset.

Step 4: Data Cleaning


Handle missing values and data types as necessary.

Step 5: Exploratory Data Analysis (EDA)


Visualize key aspects of the dataset.

a. Loan Status Distribution


b. Loan Amount Distribution

c. Correlation Analysis

Step 6: Basic Operations and Insights


Perform some basic analysis to derive insights.

a. Average Loan Amount by Gender

b. Loan Status by Education Level


c. Percentage of Loans Approved by Property Area
Practical- 11

Aim: Data Munging in Python Using Pandas

Data Munging (also called Data Wrangling) is the process of cleaning, transforming, and organizing raw data
into a format suitable for analysis. Pandas, being a powerful library, provides several functions and methods
to perform data munging efficiently.

Here’s a step-by-step guide on performing data munging using Pandas.

Steps for Data Munging

• Loading the Data

• Handling Missing Data

• Data Transformation

• Handling Duplicates

• Filtering and Sorting Data

• Feature Engineering

• Encoding Categorical Variables

• Combining DataFrames

1. Load Data

2. Handling Missing Data

Missing data is common in real-world datasets, and Pandas provides various methods to deal with it.
• Detecting Missing Data:

• Filling Missing Values: You can fill missing values with different strategies like the mean,
median, or mode.

• Dropping Rows with Missing Values: Sometimes it’s better to drop rows with too many
missing values.

3. Data Transformation: Data may need to be converted or transformed for consistency.

• Converting Data Types: You may need to change data types for analysis.

• Normalizing Numerical Data: Scaling numerical data to a common range.


4. Handling Duplicates: Duplicate data can skew analysis results, so it’s important to handle them.

• Identifying Duplicates:

• Removing Duplicates:

5. Filtering and Sorting Data: You often need to filter out unnecessary rows and sort the data for better
analysis.

• Filtering Rows:

• Sorting Data:

6. Feature Engineering: Feature engineering is the process of creating new features from the existing
data to improve the performance of models.

• Creating New Features:


7. Encoding Categorical Variables: Many machine learning algorithms require numerical input, so
converting categorical data to numerical form is often necessary.

• Label Encoding:

• One-Hot Encoding:

8. Combining data Frames: Sometimes, you’ll need to combine multiple datasets for analysis.

• Concatenating data Frames:

• Merging Data Frames:


Example Workflow: Putting It All Together
Practical - 12,13

Aim: Building a Predictive Model in Python

• Logistic Regression

• Decision Tree

• Random Forest

Logistic Regression

Logistic Regression is a linear model used for binary classification problems. It predicts the
probability of the target variable belonging to a particular class.

Steps:

• Import the necessary libraries and dataset.

• Preprocess the data (handle missing values, encode categorical variables, etc.).

• Train a Logistic Regression model.

• Evaluate the model.


Logistic Regression is commonly used for binary classification problems like spam
detection, loan default prediction, and medical diagnosis.

Decision Tree

A Decision Tree is a non-linear model that splits the data based on certain conditions,
creating a tree-like structure. It can be used for both classification and regression tasks.

Steps:

• Load the dataset and preprocess it.

• Train a Decision Tree model.

• Evaluate its performance


Decision Trees are used in tasks like fraud detection, customer churn prediction, and
diagnosing medical conditions.

Random Forest

A Random Forest is an ensemble method that builds multiple Decision Trees and merges
their results to improve accuracy and prevent overfitting.

Steps:

• Prepare the dataset.

• Train a Random Forest model.

• Evaluate the performance.


Random Forests are commonly used for tasks like image classification, risk assessment,
and sentiment analysis due to their robustness and ability to handle large datasets.
Practical - 14
• Students will create one mini project in group.

-----------------------------------------------------------------------------------------------------------------------------
22SS02IT113
OMBALAR

Practical - 14
• Students will create one mini project in group. •

Importing Required Libraries:-


Importing required libraries refers to including external code or modules at the
beginning of a program so that
their functions, methods, and classes can be used within the program. These
libraries provide pre-built solutions
to common tasks, making it easier to develop software without writing
everything from scratch.

Importing Dataset:-
Importing a dataset refers to the process of loading data from an external file
(such as CSV, Excel, or database) into a programming environment or software
(like Python, R, or Excel) for analysis or processing. It allows you to access and
manipulate the data for tasks like cleaning, visualization, or building models.

Data Pre-processing:-
22SS02IT113
OMBALAR
Data preprocessing is a technique in machine learning and data science used to
prepare raw data for analysis or modeling. It involves cleaning, transforming, and
organizing data to ensure it's in the right format for algorithms to process.

Data Information:-

Data Describe:-

Data Shape:-
22SS02IT113
OMBALAR

Data Visualization:-
Data visualization is the graphical representation of data using charts, graphs,
and maps. It helps to make complex data easier to understand by highlighting
patterns, trends, and insights.

Bar Chart:-
A bar chart is a data visualization tool that uses rectangular bars to represent
the values of different categories. Each bar's length is proportional to the
value it represents, making it easy to compare quantities across categories.
Bar charts are commonly used to display categorical data, such as sales
figures, survey results, or demographic information. They can be oriented
vertically or horizontally, with the x-axis typically representing the
categories and the y-axis showing the values. The clear visual distinction
between bars helps in quickly identifying trends, differences, and patterns in
the data.
22SS02IT113
OMBALAR

Scatterplot:-
A scatterplot is a data visualization technique that displays values for two
variables as points on a Cartesian coordinate system. Each point represents an
observation, with one variable plotted along the x-axis and the other along the y-
axis. Scatterplots are commonly used to identify relationships or correlations
between the two variables, such as trends, clusters, or outliers. The distribution
of points can indicate positive, negative, or no correlation, helping to reveal
patterns and insights in the data. They are particularly useful in exploratory data
analysis and can also be enhanced with additional elements like colors or sizes
to represent other dimensions of data.
22SS02IT113
OMBALAR

Machine Learning Models Implementation:-


Machine learning model implementation involves the process of applying
algorithms and statistical techniques to build models that can learn from data
and make predictions or decisions without explicit programming. This typically
includes data preprocessing, where raw data is cleaned and transformed; model
selection, where the appropriate algorithm is chosen based on the problem;
training, where the model learns patterns from the data; validation, to ensure the
model's performance on unseen data; and finally, deployment, where the model
is integrated into an application or system for practical use.

LinearRegression:-
22SS02IT113
OMBALAR

Chart:-
22SS02IT113
OMBALAR
22SS02IT113
OMBALAR

Decision Tree:-
Decision trees are a popular and versatile machine learning algorithm used for
classification and regression tasks. They model decisions and their possible
consequences, visualizing them as a tree-like structure of nodes and branches.
22SS02IT113
OMBALAR

Chart:-
22SS02IT113
OMBALAR
22SS02IT113
OMBALAR

Machine Learning Algorithms Differences:-


22SS02IT113
OMBALAR
22SS02IT113
OMBALAR
22SS02IT113
OMBALAR
22SS02IT113
OMBALAR

Which Machine Learning Model Is Best For These Dataset?


22SS02IT113
OMBALAR

You might also like