DS FINAL
DS FINAL
v
Engineering
Department of Computer Science and Application
DATA SCIENCE
(SSCA3021) | BSCIT/BCA
Name of Student:
Mr./Ms.
(SSCA3021).
Installing Python:
Conclusion:
You have successfully installed Python and executed basic Python code.
Practical - 5, 6
Objectives:
Learn about Python's data structures.
Understand conditional constructs and iterations.
Explore important Python libraries.
Data Structures:
List:
Definition: A list is an ordered, mutable collection of items. It can contain elements of different types.
Syntax: Lists are defined by placing elements inside square brackets [].
Dictionaries:
Definition: A dictionary is a collection of key-value pairs. Each key is unique and maps to a value.
Sets:
Definition: A set is an unordered collection of unique elements. Sets are mutable but do not allow duplicate
elements.
Syntax: Sets are defined by placing elements inside curly braces {} or using the set() function.
Tuples:
Definition: A tuple is an ordered collection of items, but unlike lists, it is immutable, meaning it cannot be
Iteration:
Iteration refers to the repeated execution of a block of code. Common types of iteration constructs include:
Conditional Constructs:
Conditional constructs allow you to execute different blocks of code based on certain conditions. The main types
include:
Key Features:
N-dimensional arrays: Efficient storage and manipulation of large datasets.
Mathematical functions: Functions for element-wise operations, linear algebra, statistical operations, etc.
Broadcasting: A powerful mechanism for performing operations on arrays of different shapes.
Pandas
Pandas is a library primarily used for data manipulation and analysis. It provides data structures like Series and
DataFrame, which allow for easy handling of structured data.
Key Features:
DataFrame: Two-dimensional size-mutable, potentially heterogeneous tabular data.
Data manipulation: Easy indexing, filtering, grouping, and aggregating of data.
Handling missing data: Built-in functionality for managing missing values.
Matplotlib for plotting
Matplotlib
Matplotlib is a plotting library that provides a flexible way to create static, animated, and interactive
visualizations in Python.
Key Features:
Versatile plotting: Support for line plots, bar plots, histograms, scatter plots, and more.
Customization: Extensive options for customizing plots (titles, labels, legends, etc.).
Integration: Can be easily integrated with NumPy and Pandas for plotting data.
Combining Them
Here’s a combined example using all three libraries:
Practical - 8, 9
Objective:
Learn to perform data exploration using Pandas, focusing on Series and DataFrames.
Import Libraries
First, you'll need to import the necessary libraries.
Data Overview
Get a basic understanding of the dataset.
Data Visualization
Visualizing data helps to identify patterns, trends, and anomalies.
a. Distribution of Age
Correlation Analysis
Examining relationships between numerical variables can provide insights.
Practical - 10
Aim: Loan Prediction Dataset Analytics: Perform basic operations on the Loan Prediction dataset
To perform basic analytics on a Loan Prediction dataset using Pandas, we’ll walk through the following steps:
• Import Libraries
• Data Overview
• Data Cleaning
c. Correlation Analysis
Data Munging (also called Data Wrangling) is the process of cleaning, transforming, and organizing raw data
into a format suitable for analysis. Pandas, being a powerful library, provides several functions and methods
to perform data munging efficiently.
• Data Transformation
• Handling Duplicates
• Feature Engineering
• Combining DataFrames
1. Load Data
Missing data is common in real-world datasets, and Pandas provides various methods to deal with it.
• Detecting Missing Data:
• Filling Missing Values: You can fill missing values with different strategies like the mean,
median, or mode.
• Dropping Rows with Missing Values: Sometimes it’s better to drop rows with too many
missing values.
• Converting Data Types: You may need to change data types for analysis.
• Identifying Duplicates:
• Removing Duplicates:
5. Filtering and Sorting Data: You often need to filter out unnecessary rows and sort the data for better
analysis.
• Filtering Rows:
• Sorting Data:
6. Feature Engineering: Feature engineering is the process of creating new features from the existing
data to improve the performance of models.
• Label Encoding:
• One-Hot Encoding:
8. Combining data Frames: Sometimes, you’ll need to combine multiple datasets for analysis.
• Logistic Regression
• Decision Tree
• Random Forest
Logistic Regression
Logistic Regression is a linear model used for binary classification problems. It predicts the
probability of the target variable belonging to a particular class.
Steps:
• Preprocess the data (handle missing values, encode categorical variables, etc.).
Decision Tree
A Decision Tree is a non-linear model that splits the data based on certain conditions,
creating a tree-like structure. It can be used for both classification and regression tasks.
Steps:
Random Forest
A Random Forest is an ensemble method that builds multiple Decision Trees and merges
their results to improve accuracy and prevent overfitting.
Steps:
-----------------------------------------------------------------------------------------------------------------------------
22SS02IT113
OMBALAR
Practical - 14
• Students will create one mini project in group. •
Importing Dataset:-
Importing a dataset refers to the process of loading data from an external file
(such as CSV, Excel, or database) into a programming environment or software
(like Python, R, or Excel) for analysis or processing. It allows you to access and
manipulate the data for tasks like cleaning, visualization, or building models.
Data Pre-processing:-
22SS02IT113
OMBALAR
Data preprocessing is a technique in machine learning and data science used to
prepare raw data for analysis or modeling. It involves cleaning, transforming, and
organizing data to ensure it's in the right format for algorithms to process.
Data Information:-
Data Describe:-
Data Shape:-
22SS02IT113
OMBALAR
Data Visualization:-
Data visualization is the graphical representation of data using charts, graphs,
and maps. It helps to make complex data easier to understand by highlighting
patterns, trends, and insights.
Bar Chart:-
A bar chart is a data visualization tool that uses rectangular bars to represent
the values of different categories. Each bar's length is proportional to the
value it represents, making it easy to compare quantities across categories.
Bar charts are commonly used to display categorical data, such as sales
figures, survey results, or demographic information. They can be oriented
vertically or horizontally, with the x-axis typically representing the
categories and the y-axis showing the values. The clear visual distinction
between bars helps in quickly identifying trends, differences, and patterns in
the data.
22SS02IT113
OMBALAR
Scatterplot:-
A scatterplot is a data visualization technique that displays values for two
variables as points on a Cartesian coordinate system. Each point represents an
observation, with one variable plotted along the x-axis and the other along the y-
axis. Scatterplots are commonly used to identify relationships or correlations
between the two variables, such as trends, clusters, or outliers. The distribution
of points can indicate positive, negative, or no correlation, helping to reveal
patterns and insights in the data. They are particularly useful in exploratory data
analysis and can also be enhanced with additional elements like colors or sizes
to represent other dimensions of data.
22SS02IT113
OMBALAR
LinearRegression:-
22SS02IT113
OMBALAR
Chart:-
22SS02IT113
OMBALAR
22SS02IT113
OMBALAR
Decision Tree:-
Decision trees are a popular and versatile machine learning algorithm used for
classification and regression tasks. They model decisions and their possible
consequences, visualizing them as a tree-like structure of nodes and branches.
22SS02IT113
OMBALAR
Chart:-
22SS02IT113
OMBALAR
22SS02IT113
OMBALAR