Foundation of Data Science (Bsc)
Foundation of Data Science (Bsc)
UNIT-1
Introduction to Data Science: Need for Data Science - What is Data Science - Evolution of Data
Science, Data Science Process - Business Intelligence and Data Science - Prerequisites for a Data
Scientist. Applications of Data Science in various fields - Data Security Issues
Data Collection Strategies: Data Pre-Processing Overview, Data Cleaning, Data Integration and
Transformation, Data Reduction, Data Discretization, Data Munging, Filtering
UNIT-II
Descriptive Statistics : Mean, Standard Deviation, Skewness and Kurtosis; Box Plots - Pivot Table
Heat Map - Correlation Statistics –ANOVA
No.SQL: Document Databases, Wide-column Databases and Graphical Databases
UNIT – III
Python for Data science -Python Libraries, Python integrated Development Environments (lDEXor
Data Science
NumPy Basics: Arrays and vectorized computation- The NumPy ndarray
creating ndarays- Data Types for ndarrays- Arithmetic with NumPy Arrays- Basic Indexing and
Slicing - Boolean Indexing-Transposing Arrays and Swapping Axes
Universal Functions: Fast Element-Wise Array Functions- Mathematical and Statistical Methods
Sorting
UNIT-IV
Introduction to pandas Data Structures: Series, Data Frame and Essential Functionality:
Dropping Entries- Indexing, Selection, and Filtering- Function Application and Mapping-
Sortingand Ranking. Summarizing and computing Descriptive Statistics- unique values, value
counts, and Membership. Reading and Writing Data in Text Fornat
UNIT-V
Data Cleaning and Preparation: Handling Missing Data - Data Transformation: Removing
Duplicates, Transforming Data Using a Function or Mapping, Replacing Values, Detecting and
Filtering Outliers.
Plotting with pandas: Line Plots, Bar Plots, Histograms and Density Plots, Scatter or Point Plots
UNIT 1
Introduction to Data Science: Data Science is an interdisciplinary field that uses
scientific methods, algorithms, processes, and systems to extract knowledge and insights from
structured and unstructured data. It combines elements of statistics, computer science, mathematics,
and domain expertise to analyse data and solve complex problems.
Key Components of Data Science:
Data Collection: Gathering data from various sources such as databases, sensors, or web scraping.
Data Cleaning: Preparing the data by handling missing values, removing duplicates, and correcting
errors.
Data Analysis: Using statistical methods and tools to identify patterns and trends in the data.
Data Visualization: Representing data through graphs, charts, and dashboards to make insights
more understandable.
Machine Learning: Employing algorithms that allow computers to learn from data and make
predictions or decisions.
Big Data Technologies: Working with tools and platforms designed for handling large and
complex data sets (Ex: Hadoop, Spark).
ii. Business Growth: Companies leverage data science for competitive advantage think
personalized marketing (e.g., Netflix recommendations) or supply chain optimization (e.g.,
Amazon’s logistics).
iii. Innovation: Data science fuels advancements in AI, machine learning, and automation.
Industries like healthcare use it for predictive diagnostics.
iv. Problem-Solving: It tackles complex challenges, from climate modeling to urban planning, by
processing vast datasets that humans can’t manually analyze.
v. Job Demand: The U.S. Bureau of Labor Statistics projects a 36% growth in data science jobs
from 2023 to 2033, far above average, with median earnings around $108,000 annually.
vi. Fraud Detection & Cyber security – Banks and organizations use data science to detect
anomalies and prevent fraud.
Data Collection: Gathering data from diverse sources, such as sensors, databases, and the web.
Data Processing: Cleaning, organizing, and preparing raw data for analysis.
Exploratory Data Analysis (EDA): Understanding data patterns, trends, and relationships.
Modelling and Algorithms: Employing machine learning, AI, or statistical techniques to create
predictive models.
2. The Age of Statistics (1900s–1950s): The early 20th century saw significant advancements in
statistical theory and mathematical modelling, with pioneers like Karl Pearson and Ronald Fisher .
3. The Rise of Computing (1960s–1970s): The invention of computers transformed data analysis,
enabling faster computations and handling of larger datasets.
4. Birth of Modern Data Science (1980s–1990s): The term "data science" started emerging as
computing power and software capabilities expanded.
5. Big Data Era (2000s): The explosion of data from the internet, social media, and mobile
technologies led to the era of "big data." Technologies like Hadoop and Apache Spark were
6. AI and Deep Learning Revolution (2010s–Present): Cloud computing platforms like AWS,
Google Cloud, and Azure made it easier to store and process data.
7.Future Directions: The future of data science includes ethical AI, explainable AI (XAI), and
real-time data analysis with technologies like IOT.
1. Problem Definition: Clearly define the problem or question that needs to be addressed.
Understand the objectives and desired outcomes.
2. Data Collection: Gather data from various sources, such as databases, APIs, sensors, or web
scraping. Ensure the data is relevant to the problem being addressed.
3. Data Cleaning: Handle missing values, remove duplicates, and correct errors in the data.
Convert data into a usable format, such as normalizing or encoding variables.
4. Exploratory Data Analysis (EDA): Analyse and visualize data to understand patterns, trends,
and relationships. Identify outliers or anomalies that may impact the analysis.
5. Feature Engineering: Select and transform relevant variables (features) to improve model
performance. Create new features that add value to the analysis.
6. Model Building: Choose appropriate algorithms based on the problem (e.g., regression,
classification, clustering). Train and validate the models using a subset of the data.
7. Model Evaluation: Assess the model's performance using metrics like accuracy, precision,
recall, or RMSE. Optimize and fine-tune the model if necessary.
8. Deployment: Deploy the model into production to make predictions or decisions based on live
data. Monitor its performance to ensure it meets the desired objectives.
9. Insights and Action: Interpret the results and translate them into actionable insights. Implement
decisions or strategies informed by the data analysis.
1.Definition:BI involves analysing historical and current data to generate insights, create reports,
2.Focus: Primarily concerned with descriptive analytics—what happened and why. Focuses on
reporting, dashboards, and visualization of data for stakeholders.
3.Tools: Common BI tools include Tableau, Power BI, QlikView, and Excel.
4.Use Case: Used for tracking key performance indicators (KPIs), generating operational insights,
and ensuring effective business processes.
5.Audience: Designed for business users and decision-makers who require insights to monitor and
plan their operations.
Data Science:
1.Definition: Data Science is a more advanced, interdisciplinary field that uses statistics, machine
learning, and computational methods to extract deeper insights, predict outcomes, and solve
complex problems.
2.Focus: Goes beyond historical analysis to predictive and prescriptive analytics—what will
happen and what actions should be taken. Involves building models, algorithms, and performing
hypothesis testing.
4.Use Case: Applied in tasks like customer segmentation, fraud detection, personalized
recommendations, and predictive maintenance.
5.Audience: Typically caters to data scientists, analysts, and technical teams skilled in coding and
algorithm development.
BI focuses on understanding and reporting what has already happened, making it valuable
for day-to-day operational decision-making.
Data Science dives deeper by predicting trends, automating processes, and uncovering
insights that are not immediately obvious.
Together, they empower organizations to make well-informed decisions by combining
historical analysis with future-oriented predictions.
Skills Required:
3.Data Wrangling and Cleaning: Ability to clean, pre process, and transform raw data into an
analysable format. Experience dealing with missing data, inconsistencies, and outliers.
4.Data Visualization: Skills in creating visual representations of data using tools like Matplotlib,
Seaborn, or Tableau. Ability to communicate insights through charts, graphs, and dashboards.
6.Domain Expertise: Understanding the specific industry or domain to provide context to data
analysis.
Tools Required:
2.Data Analysis and Visualization: Tools like Excel, Tableau, Power BI, Matplotlib, and Seaborn.
3.Big Data Technologies: Hadoop, Spark, and Hive for managing large-scale data.
5.Database Management: SQL-based systems (e.g., PostgreSQL, MySQL) and NoSQL databases
(e.g., MongoDB).
6.Version Control: Git and platforms like GitHub for collaborative coding and version
management.
7.Cloud Platforms: AWS, Google Cloud, and Azure for scalable computing and data storage.
ii. Personalized medicine and treatment plans through genomic data analysis.
2. Finance:
3. Transportation:
4. Education:
5. Manufacturing:
6. Environment:
ii. Monitoring air and water quality through IOT data analysis.
2.Privacy Concerns: Misuse of customer data collected for analysis can lead to ethical and legal
violations. Tracking and profiling users without their consent raises serious privacy issues.
3.Algorithm Bias: Poorly trained models may unintentionally expose sensitive data or favour
certain groups unfairly.
4.Cloud Security: Storing and processing data on cloud platforms introduces risks like hacking or
server vulnerabilities.
5.Regulatory Compliance: Ensuring compliance with data protection laws (e.g., GDPR, HIPAA)
can be challenging for global companies.
6.Data Ownership: Determining who owns the data and how it can be used is a complex issue.
7.Vulnerable Machine Learning Models: Attacks like adversarial inputs can manipulate models
to expose or misuse data.
8.Secure Data Sharing: Sharing data across platforms or teams can lead to unintended exposure if
not properly encrypted.
5. Leverage Social Media and Online Platforms: Monitor social media activity using tools like
Hootsuite or Sprinklr to collect sentiment analysis data. Use APIs (e.g., Twitter, Instagram) to
access data from online platforms.
6. Integrate Big Data Technologies: Use distributed systems like Hadoop or Spark for real-time data
ingestion from large datasets. Process structured and unstructured data from multiple sources
simultaneously.
7. Ensure Data Privacy and Compliance: Anonymize sensitive information to safeguard personal
data.
Handling Missing Values: Filling missing data using techniques like mean, median, or mode
imputation, or removing incomplete rows/columns if necessary.
Dealing with Duplicates: Identifying and removing duplicate records to maintain data integrity.
Outlier Detection: Detecting and managing anomalies or extreme values that could skew the
analysis.
2. Data Integration
Combining data from multiple sources (databases, files, APIs) to create a unified dataset.
3. Data Transformation
Normalization: Scaling data to a range (e.g., 0–1) to ensure uniformity and improve model
performance.
Encoding Categorical Data: Converting categorical variables into numerical formats (e.g., one-hot
encoding or label encoding).
4. Feature Engineering
Feature Selection: Identifying and retaining the most relevant features to improve model
performance.
Feature Creation: Creating new variables that enhance predictive power (e.g., combining or
5. Data Reduction
Dimensionality Reduction: Reducing the number of features using techniques like Principal
Component Analysis (PCA).
Sampling: Reducing the dataset size for faster processing without compromising data quality.
6. Data Formatting
Converting data into a consistent format (e.g., date/time parsing, text processing).
Imputation: Fill missing values using mean, median, or mode for numerical data.
2.Remove Duplicates
3.Correct Errors
i. Fix typos, incorrect formatting, and inconsistencies in the dataset (e.g., “Male” vs. “M”).
V SEMESTER Page 10 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE
ii. Use automated tools or scripts to detect and fix errors systematically.
i. Ensure consistent formats for dates, text, and numerical values across the dataset.
ii. Convert data units when necessary (e.g., Centi meters to meters).
ii. Decide whether to remove, transform, or keep outliers based on their significance.
i. Apply encoding techniques like one-hot encoding or label encoding for categorical variables.
ii. Use Standardization (e.g., transforming to a mean of 0 and a standard deviation of 1) for better
modelling.
8.Validate Data
i. Perform sanity checks to confirm that the cleaned data aligns with expectations.
ETL Tools: Tools like Talend and Alteryx for large-scale data transformation and cleaning.
1.Data Identification: Identify and locate relevant data sources, such as databases, APIs, files, or
cloud systems.
2.Data Extraction: Retrieve data from these sources using ETL (Extract, Transform, Load) tools or
manual methods.
3.Schema Alignment: Resolve schema conflicts by matching and standardizing data fields across
different datasets.
5.Data Validation: Check for inconsistencies, duplicates, or gaps during and after integration.
6.Data Storage: Store the unified dataset in a data warehouse or data lake for easy access.
Data Transformation involves converting raw data into a format suitable for analysis or modelling.
It ensures that data is clean, consistent, and ready for use.
Key Steps in Data Transformation:
1.Data Cleaning: Remove or correct errors, such as missing values and duplicates.
2.Data Normalization: Scale numerical data within a specific range (e.g., 0–1) for consistency.
3.Data Standardization: Ensure data follows a uniform format (e.g., standard date formats or units
of measurement).
4.Encoding Categorical Data: Convert categorical variables into numerical formats using one-hot
encoding, label encoding, or other methods.
6.Feature Engineering: Create new variables or modify existing ones to improve predictive
performance in modelling.
7.Data Reduction: Use methods like dimensionality reduction (e.g., PCA) to simplify large
Methods:
2.Data Compression: Representing data in a more compact form using encoding or aggregation
methods.
Techniques:
3.Sampling:
Techniques:
Stratified sampling: Ensuring the sample reflects the data distribution across groups.
Common in time-series data (e.g., aggregating daily sales into monthly totals).
5.Cluster Analysis: Grouping similar data points into clusters and representing each cluster with a
single value or prototype.
6.Feature Selection: Identifying and keeping the most relevant variables for analysis.
Methods:
Embedded methods: Feature selection performed during model training (e.g., Lasso regression).
1.Simplification: Reducing complexity in the dataset by grouping continuous values into discrete
bins.
3.Enhanced Algorithm Performance: Some algorithms perform better with categorical variables
Example: Splitting age data into intervals like 0–20, 21–40, 41–60, etc.
Simple but may result in uneven distribution of data points within bins.
2.Cluster-Based Discretization:
Uses clustering algorithms (e.g., K-Means) to group data points into clusters, which serve as
discrete categories.
Constructs a decision tree model and uses thresholds determined by the splits as bins.
4.Supervised Discretization:
Uses a target variable to influence the binning process, ensuring that the intervals optimize
predictive relationships.
5.Manual Discretization:
Example: Categorizing body mass index (BMI) into "Underweight," "Normal," "Overweight," and
"Obese."
Loss of Information: Continuous values are compressed, leading to potential loss of nuanced
information.
Determining the Optimal Number of Bins: Too few bins may oversimplify the data, while too
Applications:
Rule-Based Models: Helps generate rules that are interpretable and easy to understand.
It involves cleaning, organizing, and enriching the data so it can be effectively analysed or
modelled in data science workflows.
1.Data Exploration: Understand the raw dataset by inspecting its structure, quality, and
content.Identify inconsistencies, errors, or gaps.
2.Data Cleaning: Handle missing values by imputing them or removing incomplete records.
Remove duplicates and correct formatting issues. Detect and manage outliers that could skew the
analysis.
3.Data Structuring: Organize data into a logical format, such as tables or arrays. Standardize
columns, headers, and types (e.g., numerical, categorical).
enhance analysis.
5.Data Integration: Combine data from multiple sources or datasets to create a unified view.
Resolve schema conflicts and align data fields.
6.Data Reduction: Reduce the dataset size using sampling or dimensionality reduction
7.Validation and Enrichment: Verify data accuracy and consistency through validation checks.
Enrich data by adding external information, such as geographic details or metadata.
Programming Libraries: Python's Pandas and NumPy, or R's dplyr and tidyr.
ETL Tools: Talend, Alteryx, and Apache Nifi for automating wrangling processes.
Visualization: Tools like Tableau or Excel for inspecting and cleaning data interactively.
What is Filtering?
Filtering is a data processing technique used to extract specific subsets of data based on defined
criteria or conditions.
This ensures that only relevant and useful information is retained for further analysis.
Types of Filtering:
1.Value-Based Filtering: Selecting data rows or elements that match a specific value or fall within
a certain range.
2.Condition-Based Filtering: Using logical conditions (e.g., greater than, less than, equal to) to
filter data.
3.Column-Based Filtering: Choosing specific columns in a dataset that are relevant for analysis.
Example: Selecting only "Name" and "Sales" columns from a sales dataset.
Methods of Filtering:
1.Manual Filtering: Using tools like Excel to manually sort and filter data based on user-defined
criteria.
3.Query-Based Filtering: Using SQL queries to filter data directly from databases.
Applications of Filtering:
Real-Time Systems: Filtering streaming data for actionable insights (e.g., alerts based on criteria).
UNIT -II
Mean?
Mean is the sum of all the values in the data set divided by the number of values in the data set.
It is also called the Arithmetic Average.
The Mean is denoted as x̅ and is read as x bar.
Mean = Sum of observations
Number of observations
Mean (x̅ ) = Σ xi
n
Example: A cricketer's scores in five ODI matches are as follows: 12, 34, 45, 50, 24. To find his
average score in a match, we calculate the Mean of data?
Mean = Sum of all observations/Number of observations
Mean = (12 + 34 + 45 + 50 + 24)
5
Mean = 165
5
Mean = 33
Types of Data:
Data can be present in Raw Form or Tabular Form.
1.Raw Data
x1, x2, x3 , . . . , xn be n observations.
We can find the arithmetic mean using the mean formula.
Mean( x̄ )= (x1 + x2 + ... + xn)
n
Example: If the heights of 5 people are 142 cm, 150 cm, 149 cm, 156 cm, and 153 cm.Find the
mean height.
Mean height, (x̄ ) = (142 + 150 + 149 + 156 + 153)
5
= 750
5
Mean( x̄ ) = 150 cm
2. Tabular Form or Frequency Distribution Form
Mean ( x̄ )= (x1f1 + x2f2 + ... + xn fn)
(f1 + f2 + ... + fn
Example 1: Find the mean of the following distribution?
x f
4 5
6 10
9 10
10 7
1 8
Solution:
Calculation table for arithmetic mean:
V SEMESTER Page 19 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE
xi fi xi*fi
4 5 20
6 10 60
9 10 90
10 7 70
15 8 120
Σ fi = 40 ; Σ xi fi = 360
Mean(x̄ )= (Σxi fi)
(Σfi)
=360/40
Mean = 9
Example 2: The following table indicates the data on the number of patients visiting a hospital in a
month. The data is in the form of class intervals. Find the average number of patients visiting the
hospital in a day.
Solution:
Class mark = (lower limit + upper limit)/2
x1, x2, x3, . . ., xn be the class marks of the respective classes.
Σ fi = 30
Σ fixi = 860
V SEMESTER Page 20 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE
σ = Standard Deviation
x̄ = Mean
Mean = 5
Squared differences: 9, 1, 1, 1, 0, 0, 4, 16
Variance = (9 + 1 + 1 + 1 + 0 + 0 + 4 + 16) / 8 = 32 / 8 = 4
Standard deviation = 2
Skewness?
Skewness is an important statistical technique that helps to determine the asymmetrical behaviour
of the frequency distribution.
A distribution or dataset is symmetric if it looks the same to the left and right of the centre point.
Types of Skewness:
Kurtos
is?
Kurtosis defines the idea about the shape of a frequency distribution. It is also a characteristic of
the frequency distribution .
Types of Kurtosis
1.Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution.
In this curve, there is too much concentration of items near the central value.
2.Meso kurtic: Meso kurtic is a curve having a
normal peak than the normal curve. In this curve, there
is equal distribution of items around the central value.
3.Platykurtic: Platykurtic is a curve having a low
peak than the normal curve is called platykurtic. In
this curve, there is less concentration of items around
the central value.
Box plot?
V SEMESTER Page 22 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE
Box plot is also known as a whisker plot, box and whisker plot, or box and whisker diagram.
Box plot is a graphical representation of the distribution of a dataset.
Box plot displays key summary statistics such as the median, quartiles, and potential outliers in a
concise and visual manner.
Elements of Box Plot: A box plot gives a five-number summary of a set of data which is
1.Minimum – It is the minimum value in the dataset excluding the outliers.
2.First Quartile (Q1) – 25% of the data lies below the First (lower) Quartile.
3.Median (Q2) – It is the mid-point of the dataset. Half of the values lie below it and half above.
4.Third Quartile (Q3) – 75% of the data lies below the Third (Upper) Quartile.
5.Maximum – It is the maximum value in the dataset excluding the outliers.
It allows users to extract meaningful insights by grouping, aggregating, and filtering data without
altering the original dataset.
Pivot tables are commonly used in spreadsheet software like Microsoft Excel, Google Sheets, and
data analysis tools like Python (pandas) or SQL.
i. Rows: Categories or values displayed along the rows of the pivot table.
iii. Values: The data being summarized (e.g., sum, count, average) based on a numeric field.
iv. Filters: Criteria to include or exclude specific data points from the analysis.
ii. Insert Pivot Table: In Excel, go to Insert > Pivot Table; in Google Sheets, go to Data >
Pivot Table.
iii. Configure:
Example: Dataset:
300 0
North
South 0 450
Common Uses:
Benefits
Heat Map?
Heat Map data visualization is a powerful tool used to represent numerical data graphically,
where values are depicted using colours.
This method is particularly effective for identifying patterns, trends within large datasets.
Types of Heat Maps
1. Website Heat Maps: Website Heat Maps are used to visualize user behaviour on web pages.
Click Maps: Show where users click on a webpage, highlighting popular elements.
Scroll Maps: Indicate how far users scroll down a page, revealing engagement levels.
Mouse Tracking Heat Maps: Visualize mouse movements on a page, showing where
users are hovering and clicking.
Eye-Tracking Heat Maps: Track users' eye movements on a page, revealing attention
patterns.
2.Grid Heat Maps: Display data in a two-dimensional grid, representing relationships between
variables.
Clustered Heat Maps: Group similar rows and columns based on a chosen similarity
metric, revealing hierarchical structures in the data.
Correlogram Heat Maps: Show the correlation between different variables, often used in
statistical analysis.
Correlation Statistics?
Correlation is a statistical measure that describes the direction and strength of a relationship
between two variables.
Types of Correlation:
1.Positive Correlation : When one variable increases, the other also increases.
2.Negative Correlation : When one variable increases, the other decreases.
3.Zero Correlation : No relationship between variables.
What it measures:
Strength of the relationship: Correlation measures how tightly two variables are connected,
ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0
indicating no linear correlation.
Direction of the relationship: A positive correlation means that as one variable increases, the
other tends to increase as well, while a negative correlation means that as one variable
increases, the other tends to decrease.
Linearity: Correlation is most appropriate for variables that have a linear relationship,
meaning the relationship is a straight line when graphed.
ANOVA?
ANOVA (Analysis of Variance) is a statistical test in data science used to determine if significant
differences exist between the means of two or more groups
Types of ANOVA
One-way ANOVA: Used when there is one independent variable (factor) with multiple levels
(groups).
Two-way ANOVA: Used when there are two or more independent variables (factors), allowing
for the analysis of interaction effects between factors.
When to Use ANOVA:
More than Two Groups: When you want to
compare the means of more than two groups,
ANOVA is a suitable alternative to the t-test,
which is used for comparing the means of two
groups.
Independent Variables: ANOVA can be used to examine the impact of one or more independent
variables (factors) on a dependent variable (outcome).
Databases?
Database is a collection of structured data or information that is stored in a computer system and
can be accessed easily.
A database is usually managed by a Database Management System (DBMS ).
1.Document Databases
Instead of storing the data in rows and columns (tables), it uses the documents to store the
data in the database.
A document database stores data in JSON, BSON, or XML documents.
A Document Data Model is a lot different than other data models because it stores data in
JSON, BSON, or XML documents.
Popular Document Databases & Use Cases
3. Graph-Based Databases
Graph-Based Databases focus on the relationship between the elements.
It stores the data in the form of nodes in the database.
The connections between the nodes are called Links or Relationships.
Data is represented as nodes (objects) and edges (connections).
Popular Graph Databases & Use Cases
Database Use Case
UNIT III
Python Libraries?
A Library is a collection of pre-written code modules that contains sets of code, classes, values, and
templates. Some of the Python libraries in data science are:
i.NumPy: NumPy is a fundamental library for numerical computing and supports for
multidimensional arrays, matrices, statistics, linear algebra etc
iii.Matplotlib: Matplotlib works with Python scripts, Jupyter Notebook, web applications, and other
graphic user interfaces (GUI) to generate plots, which makes it a versatile visualisation tool for data
scientists.
iv. Seaborn: Seaborn is a library built on top of the Matplotlib library and helps make statistical
graphics more straightforward.
Install commands: conda install seaborn or pip install seaborn
v.TensorFlow: Open-source library developed by Google, widely used for deep learning and other
machine learning tasks.
Install commands: pip install tensorflow
vi.Plotly:Interactive graphing library for creating web-based visualizations.
Install commands: pip install plotly
i. Jupyter Notebook: Jupyter Notebook is an open-source IDE that is used to create Jupyter
documents that can be created and shared with live codes. It is a web-based interactive
computational environment. It can support various languages in data science such as Python,
Julia, Scala, R, etc.
ii. Spyder: Spyder is an open-source Python IDE that was originally created and developed by
Pierre Raybaut in 2009. It can be integrated with many different Python packages such as
NumPy, SymPy, SciPy, pandas, IPython, etc.
iii. Sublime text: Sublime text is a proprietary code editor and it supports a Python API. Some of
the features of Sublime Text are project-specific preferences, quick navigation, supportive
plugins for cross-platform, etc.
iv. Visual Studio Code: Visual Studio Code is a code editor that was developed by
Microsoft. Some of the features of Visual Studio Code are embedded Git control, intelligent code
completion, support for debugging, syntax highlighting, code refactoring, etc.
v. Pycharm: Pycharm is a Python IDE developed by JetBrains and created specifically for
Python. Pycharm is particularly useful in machine learning because it supports libraries such as
Pandas, Matplotlib, Scikit-Learn, NumPy, etc.
Defining a NumPy Array: Arrays are typically created using the np.array() function.Vectorization
in NumPy is a method of performing operations on entire arrays without explicit loops.
Example:
import numpy as np
a1 = np.array([2,4,6,8,10 ])
number= 2
result = a1 + number
print(result)
Output: [ 4 6 8 10 12]
Pratical Examples of Vectorization
i. Adding two arrays together with Vectorization
Example:
import numpy as np
a1 = np.array([1, 2, 3])
a2 = np.array([4, 5, 6])
result = a1 + a2
print(result)
Output: [5 7 9]
ii. Element Wise Multiplication with array
Example:
import numpy as np
a1 = np.array([1, 2, 3, 4])
result = a1 * 2
print(result)
Output: [2 4 6 8]
iii. Logical Operations on Arrays: Logical operations such as comparisons can be applied
directly to arrays.
Example:
import numpy as np
a1 = np.array([10, 20, 30])
result = a1 > 15
print(result)
Output: [False True True]
iv.Matrix Operations Using Vectorization:NumPy supports vectorized matrix operations like
dot products and matrix multiplications using functions such as np.dot and @.
Example:
import numpy as np
a1= np.array([[1, 2], [3, 4]])
a2 = np.array([[5, 6], [7, 8]])
result = np.dot(a1, a2)
print(result)
Output: [[19 22]
[43 50]]
Numpy ndarray?
ndarray is a short form for N-dimensional array which is a important component of NumPy.
It’s allows us to store and manipulate large amounts of data efficiently.
All elements in an nd array must be of same type making it a homogeneous array.
Example:
import numpy as np
arr1 = np.array([1, 2, 3, 4, 5])
print(arr1)
print(arr2)
print(arr3)
Output: [1 2 3 4 5] [[1 2 3] [4 5 6]] [[[1 2] [3 4]] [[5 6] [7 8]]]
Attributes of ndarray:
i. ndarray.shape: Returns a tuple representing the shape (dimensions) of the array.
ii. ndarray.ndim: Returns the number of dimensions (axes) of the array.
iii. ndarray.size: Returns the total number of elements in the array.
iv. ndarray.dtype: Provides the data type of the array elements.
v. ndarray.itemsize: Returns the size in bytes of each element
Example:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Shape:", arr.shape)
print("Dimensions:", arr.ndim)
print("Size:", arr.size)
print("Data type:", arr.dtype)
print("Item size:", arr.itemsize)
Output: Shape: (2, 3)
Dimensions: 2
Size: 6
Data type: int32
Item size: 4
Creating nd array?
The array object in NumPy is called ndarray.We can create a NumPy ndarray object by using
the array() function.
Example:
import numpy as np
print(arr)
V SEMESTER Page 32 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE
print(type(arr))
Output: [1 2 3 4 5]
<class 'numpy.ndarray'>
Note: type():This built in python function tells us the type of object passed to it.
To create an ndarray, we can pass a list, tuple object into the array() method, and it will be
converted into an ndarray.
Example:
import numpy as np
arr1 = np.array((1, 2, 3, 4, 5))
arr2 = np.array([1,2,3,4,5])
print(arr1)
print(arr2)
Output: [1 2 3 4 5]
[1 2 3 4 5]
ii. Boolean Data Type: bool_: Boolean value (True or False), stored as a single byte.
iii. String Data Types
Example:
import numpy as np
print(arr_int.dtype)
print(arr_float.dtype)
print(arr_str.dtype)
Output: int16
float32
<U10
ii. Subtraction of Arrays: This function subtracts each element of the second array from the
corresponding element in the first array using np.subtract() function.
iii.Multiplication of Arrays: Multiplication in NumPy can be done element-wise using
the np.multiply() function.
iv. Division of Arrays: This divides each element of the first array by the corresponding element
vi. Modulus Operation: It finds the remainder when one number is divided by another using
the np.mod() function.
Example:
import numpy as np
add_ans = np.add(a, b)
sub_ans = np.subtract(a,b)
mul_ans = np.multiply(a, b)
div_ans = np.divide(a, b)
pow_ans = np.power(a, b)
mod_ans = np.mod(a, b)
print(add_ans)
print(sub_ans)
print(mul_ans)
print(div_ans)
print(pow_ans)
print(mod_ans)
Output:
[ 9 76 47 300]
[ 5 58 -13 200]
[ 49 -364726493 632059105 0]
[1 4 17 0]
Example:
import numpy as np
print(arr1[0])
print(arr1[-1])
print(arr2[0, 1])
print(arr2[1, -1])
Output: 10
40
Slicing: Slicing extracts sub arrays using the syntax array[start:stop:step], which can be applied to
each dimension of a multi-dimensional array.
Example:
V SEMESTER Page 36 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE
import numpy as np
print(arr1[1:4])
print(arr1[:3])
print(arr1[::2])
print(arr1[::-1])
print(arr2[0:2, 1:3])
print(arr2[:, 1])
print(arr2[1:, :])
Output:
[20 30 40]
[10 20 30]
[10 30 50]
[50 40 30 20 10]
[[2 3] [5 6]]
[2 5 8]
[[4 5 6] [7 8 9]]
Boolean Indexing?
Boolean indexing in NumPy uses a Boolean array to select elements from an array that meets a
certain condition.
The boolean array acts as a filter, where True values indicate the elements to be selected,
and False values indicate the elements to be excluded.
Example:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
i. Transposing a 2D Array
import numpy as np
print(transposed)
Output:
[[1 4]
[2 5]
[3 6]]
ii.Transposing a 3D Array
import numpy as np
transposed_3d = array_3d.transpose()
print(transposed_3d.shape)
custom_transpose = array_3d.transpose(2, 1, 0)
print(custom_transpose.shape)
Output: (4, 3, 2)
(4, 3, 2)
Swapping Axis: It focuses on swapping only two specified axes while leaving the rest
unchanged, using the numpy swapaxes() function.
The syntax of numpy.swapaxes() is: numpy.swapaxes(array, axis1, axis2)
array : is the input array whose axes are to be swapped.
axis1 : The first axis to be swapped.
axis2 : The second axis to be swapped with axis1.
i.Swapping Axes in a 2D Array
Example:
import numpy as np
Output:
Original array:
[[1 2 3]
[4 5 6]]
[[1 4]
[2 5]
[3 6]]
arr = np.random.rand(2, 3, 4)
Output:
Example:
import numpy as np
swapped_axes = np.swapaxes(array_3d, 0, 1)
print(swapped_axes.shape)
print(array_3d.shape)
Output:
(3, 2, 4)
(2, 3, 4)
Mathematical Functions:
Arithmetic Operations: NumPy ufuncs support basic arithmetic like addition, subtraction,
multiplication, division, and exponentiation.
Rounding: Functions like round(), trunc(), and floor() are available for rounding values in arrays.
Trigonometric Functions: sin(), cos(), tan(), and their inverses are included for trigonometric
calculations.
Exponential and Logarithmic Functions: exp(), log(), log10() allow for exponential and
logarithmic operations.
Statistical Functions:
Probability Theory: NumPy can be used with probability theory to calculate probabilities,
distributions, and other statistical concepts.
i. Using np.sort() with ufuncs: The np.sort() function returns a sorted copy of an array.
Example:
import numpy as np
arr = np.array([5, 2, 8, 1, 9])
sorted_arr = np.sort(arr)
print(sorted_arr)
squared_arr = np.square(arr)
sorted_squared_arr = np.sort(squared_arr)
print(sorted_squared_arr)
Output:
[1 2 5 8 9]
[ 1 4 25 64 81]
Note: np.sort () for Multi Dimensional Arrays:
import numpy as np
(intersection).
np.setdiff1d(ar1, ar2, assume_unique=False): Returns elements in ar1 that are not in ar2
(difference).
np.setxor1d(ar1, ar2, assume_unique=False): Returns elements that are in either ar1 or ar2, but
not both (symmetric difference).
np.in1d(ar1, ar2, assume_unique=False): Tests whether each element of ar1 is in ar2 (returns a
boolean array).
UNIT IV
Series: Pandas Series is a fundamental data structure, it is a one-dimensional labelled array
capable of holding data of any type i.e., integers, strings, floating point numbers, Python objects,
etc.
series = pd.Series(data)
print(series)
Output: a 1
b 2
V SEMESTER Page 44 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE
c 3
dtype: int64
Example iv : Scalar values
import pandas as pd
series=pd.Series(10,index=['a','b','c'])
print(series)
Output: a 10
b 10
c 10
dtype: int64
import pandas as pd
data = [10, 20, 30, 40]
series = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series)
Output: a 10
b 20
c 30
d 40
dtype: int64
Data Frame: Pandas Data Frame is a two-dimensional size mutable tabular data structure
with labelled axes rows and columns.
Pandas Data Frame consists of three principal components Data, Rows, and Columns.
Creating a Pandas Data Frame
Example i: Using a single list or a list of lists.
import pandas as pd
df1=pd.DataFrame(data)
Output: Name Age City
0 A 25 New York
1 B 30 Los Angeles
2 C 35 Chicago
i. Column Selection: Access one or more columns by their names using square brackets
(e.g., df['column_name']).
Example: import pandas as pd
data = {'Name': ['A', 'B', 'C'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df= pd.DataFrame(data)
print(df['Name'])
Output: 0 A
1 B
2 C
Age 30
Dropping entries: Dropping entries refers to removing rows or columns by index or column
name from a Data Frame in Pandas. This is typically done using the drop() method.
i. Identify the entries to be dropped: Determine which rows or columns you want to remove.
iv. Set in place to True if you want to modify the original Data Frame: By
default, drop() returns a new Data Frame with the entries dropped.
v. Example:
import pandas as pd
data = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8], 'col3': [9, 10, 11, 12]}
df = pd.Data Frame(data, index=['a', 'b', 'c', 'd'])
print(df)
Output: col1
a 1
c 3
Indexing in Pandas :Indexing refers to selecting specific rows and columns from a Data
Frame.
It allows you to subset data in various ways, such as selecting all rows with specific columns,
some rows with all columns, or a subset of both rows and columns.
This technique is also known as Subset Selection.
Methods for indexing:
i.Square bracket ([]) indexing: It is the most basic way to access columns.
df = pd.DataFrame(data)
print(df['col_1']) # Returns the column named 'col_1'
Output: 0 3
1 2
2 1
3 0
Name: col_1, dtype: int64
ii.loc indexing: It is label-based, allowing access to rows and columns using their labels.
Example:
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
print(df.loc[0]) # Returns the row with index label 0
print(df.loc[:, 'col_2']) # Returns the column with label 'col_2'
Example:
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
print(df['col_1']) # Returns the column named 'col_1'
print(df.iloc[0]) # Returns the first row
print(df.iloc[:, 1]) # Returns the second column
print(df.iloc[0, 0]) # Returns the element at the first row and first column
Output: 0 3
1 2
2 1
3 0
Name: col_1, dtype: int64
col_1 3
col_2 a
Name: 0, dtype: object
0 a
1 b
2 c
3 d
Name: col_2, dtype: object
Example:
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
print(df[df['col_1'] > 1]) # Returns rows where 'col_1' is greater than 1
0 3 a
1 2 b
Examples: Ordering a list of names alphabetically, sorting numerical data from smallest to largest.
Python offers several built-in functions and libraries like sorted(), sort(), and Pandas for sorting and
ranking data in different ways.
Sorting in Python:
sorted(): This function takes an iterable (like a list or tuple) and returns a new sorted list without
modifying the original.
sort(): This method is used to sort lists directly, modifying the original list in place.
Example: i. Sorting a list of numbers.
numbers = [5, 2, 8, 1, 9, 4]
sorted_numbers = sorted(numbers)
print(sorted_numbers)
numbers.sort()
print( numbers)
Output: [1, 2, 4, 5, 8, 9]
[1, 2, 4, 5, 8, 9]
Example: ii. Sorting a list of strings.
strings = ["apple", "banana", "orange", "grape"]
sorted_strings = sorted(strings) ]
print(sorted_strings)
strings.sort()
print(strings)
Output: ['apple', 'banana', 'grape', 'orange']
['apple', 'banana', 'grape', 'orange']
Example: iii. Sorting by a custom key: This will arrange the words from shortest to longest.
Examples: Ranking search results based on relevance, ranking products based on sales volume.
Ranking in Python:
rank(): This method in Pandas provides assigning ranks based on the first occurrence of a value .
import pandas as pd
df = pd.DataFrame({ 'A': [120, 40, 40, 30, 60], 'B': [15, 35, 55, 75, 95], 'C': [10, 30, 50, 70, 90],
'D': [5, 25, 45, 65, 85], 'E': [1, 21, 41, 61, 81] })
df['ranked_df_A_min'] = df['A'].rank(method='min')
df['ranked_df_A_max'] = df['A'].rank(method='max')
df['ranked_df_A_first'] = df['A'].rank(method='first')
print(df)
Output:
A B C D E ranked_df_A_min ranked_df_A_max ranked_df_A_first
0 120 15 10 5 1 5.0 5.0 5.0
1 40 35 30 25 21 2.0 3.0 2.0
2 40 55 50 45 41 2.0 3.0 3.0
3 30 75 70 65 61 1.0 1.0 1.0
4 60 95 90 85 81 4.0 4.0 4.0
Unique Values: Unique Values refer to the distinct entries present in a dataset, column, or
array.
unique(): This method, applied to a Series, returns a NumPy array of unique values in the order
they appear. It is efficient and includes NaN values if present.
nunique(): This method counts the number of unique values in a Series or each column of a Data
Frame. It's useful for summarizing data.
Example:
import pandas as pd
data = {'col1': [1, 2, 2, 3, 4, 4, None]}
df = pd.DataFrame(data)
unique_values = df['col1'].unique()
print(unique_values)
num_unique = df['col1'].nunique()
print(num_unique)
Output: [1. 2. 3. 4. nan]
4
V SEMESTER Page 52 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE
Value Counts: Value count provides how many times each distinct value appears in the
series.
Example:
import pandas as pd
sr = pd.Series(['Apple', 'Cat', 'Lion', 'Rabbit', 'Cat', 'Lion'])
print(sr.value_counts()
Output: Cat 2
Lion 2
Apple 1
Rabbit 1
Name: count, dtype: int64
Membership: Membership testing involves checking if a specific value exists within a dataset
or collection.
The primary method for this is .isin().
It returns a boolean Series indicating whether each element in the original Series/DataFrame is
found in the specified set of values.
Example:
import pandas as pd
Create a sample Series
s = pd.Series(['apple', 'banana', 'cherry', 'guava'])
mask = s.isin(['apple', 'guava'])
print(mask)
Output: 0 True
1 False
2 False
3 True
dtype: bool
Example:
file.write('Hello, World!\n')
Hello, World!
Hello, World!
UNIT 5
Data Cleaning and Data Preparation?
Data Cleaning is the process of identifying and correcting errors, inconsistencies, and inaccuracies
in datasets to ensure data quality and reliability.
Data Preparation is the process of making raw data ready for after processing and analysis.
i. Remove Duplicates: Detect and delete duplicate rows that may skew analysis.
iii. Fix Structural Errors: Standardize formats (e.g., “USA” vs. “U.S.A”).
iv. Filter Outliers: Identify and handle extreme values using statistical methods (e.g., Z-score,
IQR).
v. Validate Data Types: Ensure correct data types (e.g., date fields are in date time format,
numeric columns are not stored as strings).
One-hot encoding.
Label encoding.
iii. Feature Engineering: Create new features from existing ones to better represent the underlying
problem.
iv. Split Data: Train/test split for ML workflows (e.g., 80/20 or 70/30 split).
v. Scaling and Transformation: Apply log, square root, or Box-Cox transformations if data is
skewed.
Data Transformation?
Data Transformation refers to converting data into a suitable format or structure to improve
analysis, modelling, and interpretation.
Normalization – Standardizing data to ensure uniformity in scale, format, or unit (e.g., converting
currency values to a standard denomination).
Filtering – Removing unwanted or irrelevant data to enhance quality and reduce noise.
Joining – Merging multiple datasets based on a common key to create enriched datasets.
Sorting – Organizing data in a specific order (ascending or descending) for better readability and
processing.
Encoding – Converting data into different formats such as categorical data to numerical
representations.
Removing Duplicates?
Removing Duplicates is the process of identifying and eliminating redundant records from a
dataset.
How to Remove Duplicates?
Several methods can be used to remove duplicates, depending on the data format and the tool used:
Python Pandas: The drop duplicates () function in Pandas can be used to remove duplicate rows
or columns from a Data Frame.
SQL: SQL databases offer various methods for removing duplicates, including
using DELETE statements with JOIN or GROUP BY clauses.
Excel: Excel's "Remove Duplicates" feature allows users to identify and eliminate duplicates
based on specified columns.
Manual Review and Editing: For small datasets or specific scenarios, manual inspection and
removal of duplicates may be necessary .
Efficiency and Resource Optimization: Removing duplicates reduces dataset size, leading to
faster processing times and lower computational costs.
Transforming data using a function or mapping is a powerful technique in data science to clean,
standardize, or create new features.
Function: This approach involves defining a function that takes an input, processes it, and returns a
transformed output.
Implementation: Can be implemented in various programming languages (e.g., Python, Java) and
can be very flexible for complex transformations.
Mapping: Mapping refers to the process of applying a function to each element in a collection,
producing a new collection of the same size.
Implementation: Often involves tools and interfaces for visual mapping and configuration.
Functions: When you need to apply a specific transformation rule or calculation to each data
element (e.g., data cleaning, normalization, feature engineering).
Mappings: When you need to align data between different systems or data sources, particularly
when dealing with data integration and ETL processes.
Replacing values?
Replacing values refers to modifying specific data points within a dataset whether due to errors,
missing values, inconsistencies, or the need for standardization.
Handling Missing Data: Filling in gaps using methods like mean, median, mode, or predictive
techniques.
Encoding Categories: Replacing text labels with numerical values for machine learning
V SEMESTER Page 58 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE
models.
Outlier is a data item/object that deviates significantly from the rest of the objects.
How Outliers are Caused?
Measurement errors : Errors in data collection or measurement processes can lead to
outliers.
Sampling errors : In some cases, outliers can arise due to issues with the sampling process.
Data entry errors : Human errors during data entry can introduce outliers.
i. Detection Methods
a. Statistical Methods
Z-Score (Standard Score): Measures how many standard deviations a data point is from the mean.
Inter quartile Range (IQR): Uses the spread of the middle 50% of data.
Box plots: Visually identify outliers as points beyond the whiskers (Q1 - 1.5·IQR or Q3 +
1.5·IQR).
Removal: Delete outliers if they are errors or irrelevant (e.g., data entry mistakes).
Transformation: Apply log, square root, or other transformations to reduce the effect of extreme
values.
Treat as Separate Class: Model outliers separately if they represent a meaningful subgroup (e.g.,
fraud detection).
Line Plots?
A Line Plot is a graphical representation of data in which individual data points are plotted along
a line to display the relationship between two variables.
Bar Plots?
Bar plots are used to visualize categorical data, comparing quantities across different categories
with rectangular bars.
values = [25,40,30,55]
plt.bar(categories, values)
plt.xlabel("Categories")
plt.ylabel("Values")
plt.title("Bar Plot Example")
plt.show()
import matplotlib.pyplot as
plt
import numpy as np
data =
np.random.randn(1000)
plt.hist(data,bins=30,color="skyblue", edgecolor="black")
plt.xlabel("Value")
plt.title("Histogram of Random Data")
plt.show()
Density Plot: Density plot, also known as a kernel density estimate (KDE) plot, provides a
smoothed representation of the data distribution. It estimates the probability density function of the
data, showing the likelihood of different values occurring.
plt.title(‘Scatter Plot’)
plt.show()
Output:
Point Plot: Point Plot often created using libraries like Sea born, represents an estimate of central
tendency for a numeric variable by the position of the dot and provides an indication of the
uncertainty around that estimate using error bars.
Create a basic Point Plot using Matplotlib:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
Output: