0% found this document useful (0 votes)
0 views64 pages

Foundation of Data Science (Bsc)

The document outlines the curriculum for a Foundations of Data Science course at Zenex Vision Degree College, covering key topics such as the introduction to data science, data collection strategies, descriptive statistics, Python programming, and data cleaning techniques. It emphasizes the importance of data science in decision-making, business growth, and innovation across various fields, while also addressing data security issues and the prerequisites for becoming a data scientist. Additionally, it discusses the data science process, the relationship between business intelligence and data science, and applications in healthcare, finance, and other sectors.

Uploaded by

Pacha Omkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views64 pages

Foundation of Data Science (Bsc)

The document outlines the curriculum for a Foundations of Data Science course at Zenex Vision Degree College, covering key topics such as the introduction to data science, data collection strategies, descriptive statistics, Python programming, and data cleaning techniques. It emphasizes the importance of data science in decision-making, business growth, and innovation across various fields, while also addressing data security issues and the prerequisites for becoming a data scientist. Additionally, it discusses the data science process, the relationship between business intelligence and data science, and applications in healthcare, finance, and other sectors.

Uploaded by

Pacha Omkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 64

ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

UNIT-1
Introduction to Data Science: Need for Data Science - What is Data Science - Evolution of Data
Science, Data Science Process - Business Intelligence and Data Science - Prerequisites for a Data
Scientist. Applications of Data Science in various fields - Data Security Issues
Data Collection Strategies: Data Pre-Processing Overview, Data Cleaning, Data Integration and
Transformation, Data Reduction, Data Discretization, Data Munging, Filtering

UNIT-II
Descriptive Statistics : Mean, Standard Deviation, Skewness and Kurtosis; Box Plots - Pivot Table
Heat Map - Correlation Statistics –ANOVA
No.SQL: Document Databases, Wide-column Databases and Graphical Databases

UNIT – III
Python for Data science -Python Libraries, Python integrated Development Environments (lDEXor
Data Science
NumPy Basics: Arrays and vectorized computation- The NumPy ndarray
creating ndarays- Data Types for ndarrays- Arithmetic with NumPy Arrays- Basic Indexing and
Slicing - Boolean Indexing-Transposing Arrays and Swapping Axes
Universal Functions: Fast Element-Wise Array Functions- Mathematical and Statistical Methods
Sorting

UNIT-IV
Introduction to pandas Data Structures: Series, Data Frame and Essential Functionality:
Dropping Entries- Indexing, Selection, and Filtering- Function Application and Mapping-
Sortingand Ranking. Summarizing and computing Descriptive Statistics- unique values, value
counts, and Membership. Reading and Writing Data in Text Fornat

UNIT-V
Data Cleaning and Preparation: Handling Missing Data - Data Transformation: Removing
Duplicates, Transforming Data Using a Function or Mapping, Replacing Values, Detecting and
Filtering Outliers.
Plotting with pandas: Line Plots, Bar Plots, Histograms and Density Plots, Scatter or Point Plots

V SEMESTER Page 1 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

UNIT 1
Introduction to Data Science: Data Science is an interdisciplinary field that uses
scientific methods, algorithms, processes, and systems to extract knowledge and insights from
structured and unstructured data. It combines elements of statistics, computer science, mathematics,
and domain expertise to analyse data and solve complex problems.
Key Components of Data Science:

Data Collection: Gathering data from various sources such as databases, sensors, or web scraping.

Data Cleaning: Preparing the data by handling missing values, removing duplicates, and correcting
errors.

Data Analysis: Using statistical methods and tools to identify patterns and trends in the data.

Data Visualization: Representing data through graphs, charts, and dashboards to make insights
more understandable.

Machine Learning: Employing algorithms that allow computers to learn from data and make
predictions or decisions.

Big Data Technologies: Working with tools and platforms designed for handling large and
complex data sets (Ex: Hadoop, Spark).

Need for Data Science?


Data science is critical in today’s world due to the explosion of data and the need to extract
actionable insights from it.

i. Decision-Making: Data science enables evidence-based decisions by analyzing large datasets to


uncover patterns, trends, and correlations. Businesses use this to optimize strategies, like targeting
customers or improving operations.

ii. Business Growth: Companies leverage data science for competitive advantage think
personalized marketing (e.g., Netflix recommendations) or supply chain optimization (e.g.,
Amazon’s logistics).

iii. Innovation: Data science fuels advancements in AI, machine learning, and automation.
Industries like healthcare use it for predictive diagnostics.

iv. Problem-Solving: It tackles complex challenges, from climate modeling to urban planning, by
processing vast datasets that humans can’t manually analyze.

V SEMESTER Page 2 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

v. Job Demand: The U.S. Bureau of Labor Statistics projects a 36% growth in data science jobs
from 2023 to 2033, far above average, with median earnings around $108,000 annually.

vi. Fraud Detection & Cyber security – Banks and organizations use data science to detect
anomalies and prevent fraud.

What is Data Science?


Data science is an interdisciplinary field that combines elements of computer science, statistics,
mathematics, and domain expertise to process, model, and interpret data for decision-making.
Core Components of Data Science:

Data Collection: Gathering data from diverse sources, such as sensors, databases, and the web.

Data Processing: Cleaning, organizing, and preparing raw data for analysis.

Exploratory Data Analysis (EDA): Understanding data patterns, trends, and relationships.

Modelling and Algorithms: Employing machine learning, AI, or statistical techniques to create
predictive models.

Data Visualization: Representing results through graphs, charts, or dashboards to communicate


findings effectively.

Evolution of Data Science?


1. Early Data Analysis (Before 1900s): Early statistical techniques emerged in the 17th and 18th
centuries, laying the foundation for modern data analysis.

2. The Age of Statistics (1900s–1950s): The early 20th century saw significant advancements in
statistical theory and mathematical modelling, with pioneers like Karl Pearson and Ronald Fisher .

3. The Rise of Computing (1960s–1970s): The invention of computers transformed data analysis,
enabling faster computations and handling of larger datasets.

4. Birth of Modern Data Science (1980s–1990s): The term "data science" started emerging as
computing power and software capabilities expanded.

5. Big Data Era (2000s): The explosion of data from the internet, social media, and mobile
technologies led to the era of "big data." Technologies like Hadoop and Apache Spark were

developed to process and analyse massive datasets.

V SEMESTER Page 3 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

6. AI and Deep Learning Revolution (2010s–Present): Cloud computing platforms like AWS,
Google Cloud, and Azure made it easier to store and process data.
7.Future Directions: The future of data science includes ethical AI, explainable AI (XAI), and
real-time data analysis with technologies like IOT.

What is Data Science Process?


The Data Science Process is a structured workflow used to solve problems and extract actionable
insights from data.

1. Problem Definition: Clearly define the problem or question that needs to be addressed.
Understand the objectives and desired outcomes.

2. Data Collection: Gather data from various sources, such as databases, APIs, sensors, or web
scraping. Ensure the data is relevant to the problem being addressed.

3. Data Cleaning: Handle missing values, remove duplicates, and correct errors in the data.
Convert data into a usable format, such as normalizing or encoding variables.

4. Exploratory Data Analysis (EDA): Analyse and visualize data to understand patterns, trends,
and relationships. Identify outliers or anomalies that may impact the analysis.

5. Feature Engineering: Select and transform relevant variables (features) to improve model
performance. Create new features that add value to the analysis.

6. Model Building: Choose appropriate algorithms based on the problem (e.g., regression,
classification, clustering). Train and validate the models using a subset of the data.
7. Model Evaluation: Assess the model's performance using metrics like accuracy, precision,
recall, or RMSE. Optimize and fine-tune the model if necessary.
8. Deployment: Deploy the model into production to make predictions or decisions based on live
data. Monitor its performance to ensure it meets the desired objectives.
9. Insights and Action: Interpret the results and translate them into actionable insights. Implement
decisions or strategies informed by the data analysis.

What is Business Intelligence and Data Science?


Business Intelligence (BI):

1.Definition:BI involves analysing historical and current data to generate insights, create reports,

and support decision-making within organizations.


V SEMESTER Page 4 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

2.Focus: Primarily concerned with descriptive analytics—what happened and why. Focuses on
reporting, dashboards, and visualization of data for stakeholders.

3.Tools: Common BI tools include Tableau, Power BI, QlikView, and Excel.

4.Use Case: Used for tracking key performance indicators (KPIs), generating operational insights,
and ensuring effective business processes.

5.Audience: Designed for business users and decision-makers who require insights to monitor and
plan their operations.

Data Science:

1.Definition: Data Science is a more advanced, interdisciplinary field that uses statistics, machine
learning, and computational methods to extract deeper insights, predict outcomes, and solve
complex problems.

2.Focus: Goes beyond historical analysis to predictive and prescriptive analytics—what will
happen and what actions should be taken. Involves building models, algorithms, and performing
hypothesis testing.

3.Tools: Data Science tools include Python, R, Apache Spark, TensorFlow.

4.Use Case: Applied in tasks like customer segmentation, fraud detection, personalized
recommendations, and predictive maintenance.

5.Audience: Typically caters to data scientists, analysts, and technical teams skilled in coding and
algorithm development.

How They Work Together:

 BI focuses on understanding and reporting what has already happened, making it valuable
for day-to-day operational decision-making.
 Data Science dives deeper by predicting trends, automating processes, and uncovering
insights that are not immediately obvious.
 Together, they empower organizations to make well-informed decisions by combining
historical analysis with future-oriented predictions.

Prerequisites for a Data Scientist – Tools and skills required?

V SEMESTER Page 5 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Skills Required:

1.Programming Skills: Proficiency in programming languages such as Python and R is essential


for data manipulation, analysis, and modelling. Knowledge of SQL is vital for querying and
managing databases.

2.Statistics and Mathematics: A strong understanding of statistics (e.g., probability, hypothesis


testing, and regression). Familiarity with linear algebra and calculus for machine learning
algorithms.

3.Data Wrangling and Cleaning: Ability to clean, pre process, and transform raw data into an
analysable format. Experience dealing with missing data, inconsistencies, and outliers.

4.Data Visualization: Skills in creating visual representations of data using tools like Matplotlib,
Seaborn, or Tableau. Ability to communicate insights through charts, graphs, and dashboards.

5.Machine Learning: Knowledge of supervised and unsupervised learning algorithms such as


decision trees, support vector machines, clustering, and neural networks.

6.Domain Expertise: Understanding the specific industry or domain to provide context to data
analysis.

7.Soft Skills: Strong problem-solving, critical-thinking, and communication skills. Collaboration


and teamwork abilities to work effectively with cross-functional teams.

Tools Required:

1.Programming and Scripting: Python, R, SQL.

2.Data Analysis and Visualization: Tools like Excel, Tableau, Power BI, Matplotlib, and Seaborn.

3.Big Data Technologies: Hadoop, Spark, and Hive for managing large-scale data.

4.Machine Learning Libraries: Scikit-learn, TensorFlow, Keras, and Py Torch.

5.Database Management: SQL-based systems (e.g., PostgreSQL, MySQL) and NoSQL databases
(e.g., MongoDB).

6.Version Control: Git and platforms like GitHub for collaborative coding and version
management.

7.Cloud Platforms: AWS, Google Cloud, and Azure for scalable computing and data storage.

V SEMESTER Page 6 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Applications of Data Science in various fields – Data Security Issues?


1. Healthcare:

i. Predicting diseases and outcomes using patient data.

ii. Personalized medicine and treatment plans through genomic data analysis.

2. Finance:

i. Risk assessment and credit scoring models.

ii. Predictive analytics for stock market trends.

3. Transportation:

i. Real-time traffic management and route optimization.

ii. Predictive maintenance for vehicles and infrastructure.

4. Education:

i. Adaptive learning systems tailored to students' progress and needs.

ii. Analysing education trends to improve teaching methodologies.

5. Manufacturing:

i. Quality control and defect detection using image processing.

ii. Supply chain optimization based on historical and real-time data.

6. Environment:

i. Predicting weather patterns and climate change impact.

ii. Monitoring air and water quality through IOT data analysis.

V SEMESTER Page 7 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Data Security Issues in Data Science?


1.Data Breaches: Unauthorized access to sensitive data can lead to exposure of personal or
financial information.

2.Privacy Concerns: Misuse of customer data collected for analysis can lead to ethical and legal

violations. Tracking and profiling users without their consent raises serious privacy issues.

3.Algorithm Bias: Poorly trained models may unintentionally expose sensitive data or favour
certain groups unfairly.

4.Cloud Security: Storing and processing data on cloud platforms introduces risks like hacking or
server vulnerabilities.

5.Regulatory Compliance: Ensuring compliance with data protection laws (e.g., GDPR, HIPAA)
can be challenging for global companies.

6.Data Ownership: Determining who owns the data and how it can be used is a complex issue.

7.Vulnerable Machine Learning Models: Attacks like adversarial inputs can manipulate models
to expose or misuse data.

8.Secure Data Sharing: Sharing data across platforms or teams can lead to unintended exposure if
not properly encrypted.

Data Collection Strategies?


Data Collection refers to the systematic process of gathering, measuring, and analysing
information from various sources to get a complete and accurate picture of an area of interest.
1. Define Objectives and Goals: Clearly outline what information is required and why it is being
collected. Align data collection efforts with business or research objectives.
2. Identify Data Sources: Primary Data: Data collected directly from surveys, interviews,
experiments, or observations. Secondary Data: Existing data obtained from published sources,
databases, or research studies.
3. Ensure Data Sampling: Use sampling techniques (e.g., random, stratified) to collect
representative subsets of data, especially for large populations. Avoid sampling bias to maintain
data integrity.
4. Optimize Survey and Feedback Methods: Design clear and concise survey questionnaires to
minimize confusion. Use online tools like Google Forms, Qualtrics, or Type form for digital
surveys. Provide incentives to encourage responses and improve participation rates.

V SEMESTER Page 8 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

5. Leverage Social Media and Online Platforms: Monitor social media activity using tools like
Hootsuite or Sprinklr to collect sentiment analysis data. Use APIs (e.g., Twitter, Instagram) to
access data from online platforms.

6. Integrate Big Data Technologies: Use distributed systems like Hadoop or Spark for real-time data
ingestion from large datasets. Process structured and unstructured data from multiple sources
simultaneously.

7. Ensure Data Privacy and Compliance: Anonymize sensitive information to safeguard personal
data.

Explain about Data Pre Processing Overview?


It involves preparing and transforming raw data into a clean, consistent, and usable format for
analysis and modelling.
1. Data Cleaning

Handling Missing Values: Filling missing data using techniques like mean, median, or mode
imputation, or removing incomplete rows/columns if necessary.

Dealing with Duplicates: Identifying and removing duplicate records to maintain data integrity.

Outlier Detection: Detecting and managing anomalies or extreme values that could skew the
analysis.

2. Data Integration

Combining data from multiple sources (databases, files, APIs) to create a unified dataset.

Ensuring consistency in format, structure, and schema.

3. Data Transformation

Normalization: Scaling data to a range (e.g., 0–1) to ensure uniformity and improve model
performance.

Standardization: Transforming data to have a mean of 0 and a standard deviation of 1.

Encoding Categorical Data: Converting categorical variables into numerical formats (e.g., one-hot
encoding or label encoding).

V SEMESTER Page 9 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

4. Feature Engineering

Feature Selection: Identifying and retaining the most relevant features to improve model
performance.

Feature Creation: Creating new variables that enhance predictive power (e.g., combining or

transforming existing variables).

5. Data Reduction

Dimensionality Reduction: Reducing the number of features using techniques like Principal
Component Analysis (PCA).

Sampling: Reducing the dataset size for faster processing without compromising data quality.

6. Data Formatting

Converting data into a consistent format (e.g., date/time parsing, text processing).

Structuring data as required for the chosen analysis or model.

What is Data Cleaning?


It involves identifying and rectifying inaccuracies, inconsistencies, and missing values in datasets to
ensure they are accurate, complete, and formatted correctly.
Key Steps in Data Cleaning:

1.Handle Missing Data

Imputation: Fill missing values using mean, median, or mode for numerical data.

Deletion: Remove rows or columns with excessive missing values, if justified.

Use domain knowledge to replace missing values with reasonable estimates.

2.Remove Duplicates

i. Identify duplicate rows or entries in the dataset.

ii. Eliminate redundant data to avoid skewing analysis.

3.Correct Errors

i. Fix typos, incorrect formatting, and inconsistencies in the dataset (e.g., “Male” vs. “M”).
V SEMESTER Page 10 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

ii. Use automated tools or scripts to detect and fix errors systematically.

4.Standardize Data Formats

i. Ensure consistent formats for dates, text, and numerical values across the dataset.

ii. Convert data units when necessary (e.g., Centi meters to meters).

5.Detect and Address Outliers

i. Use statistical methods like Z-scores or IQR to detect anomalies.

ii. Decide whether to remove, transform, or keep outliers based on their significance.

6.Transform Categorical Data

i. Apply encoding techniques like one-hot encoding or label encoding for categorical variables.

ii. Ensure consistent representation of categories.

7.Normalize or Scale Numerical Data

i. Use normalization (e.g., scaling values between 0 and 1) or

ii. Use Standardization (e.g., transforming to a mean of 0 and a standard deviation of 1) for better
modelling.

8.Validate Data

i. Perform sanity checks to confirm that the cleaned data aligns with expectations.

ii. Use validation techniques to spot any remaining inconsistencies.

Tools for Data Cleaning:


Excel: Basic cleaning tasks like removing duplicates and correcting errors.
Python Libraries:
Pandas: Ideal for handling missing values, formatting, and transformations.
NumPy: Useful for numerical manipulations and handling large datasets.
Open Refine: Specialized tool for cleaning messy datasets.

ETL Tools: Tools like Talend and Alteryx for large-scale data transformation and cleaning.

V SEMESTER Page 11 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

What is Data Integration and Data Transformation?


Data Integration is the process of combining data from multiple sources into a single, unified view.
Key Steps in Data Integration:

1.Data Identification: Identify and locate relevant data sources, such as databases, APIs, files, or
cloud systems.

2.Data Extraction: Retrieve data from these sources using ETL (Extract, Transform, Load) tools or
manual methods.

3.Schema Alignment: Resolve schema conflicts by matching and standardizing data fields across
different datasets.

4.Data Consolidation: Merge datasets by joining, appending, or consolidating information, while


ensuring there’s no loss of key data.

5.Data Validation: Check for inconsistencies, duplicates, or gaps during and after integration.

6.Data Storage: Store the unified dataset in a data warehouse or data lake for easy access.

Data Transformation involves converting raw data into a format suitable for analysis or modelling.
It ensures that data is clean, consistent, and ready for use.
Key Steps in Data Transformation:

1.Data Cleaning: Remove or correct errors, such as missing values and duplicates.

2.Data Normalization: Scale numerical data within a specific range (e.g., 0–1) for consistency.

3.Data Standardization: Ensure data follows a uniform format (e.g., standard date formats or units
of measurement).

4.Encoding Categorical Data: Convert categorical variables into numerical formats using one-hot
encoding, label encoding, or other methods.

5.Aggregation and Summarization: Combine data to create meaningful summaries or metrics


(e.g., total sales, average values).

6.Feature Engineering: Create new variables or modify existing ones to improve predictive
performance in modelling.

V SEMESTER Page 12 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

7.Data Reduction: Use methods like dimensionality reduction (e.g., PCA) to simplify large

datasets without losing essential information.

Tools for Integration and Transformation

ETL Tools: Talend, Informatica, Apache Nifi, Microsoft SSIS.

Data Integration Platforms: MuleSoft, Dell Boomi, Apache Kafka.

Data Transformation Libraries: Pandas, NumPy (Python), dplyr (R).

What is Data Reduction?


Data Reduction is the process of reducing the volume of data while retaining its essential
characteristics, ensuring that it remains meaningful and useful for analysis.

Techniques for Data Reduction


1.Dimensionality Reduction: Reducing the number of features or variables in the dataset while
preserving important information.

Methods:

Principal Component Analysis (PCA): Transforms data into fewer dimensions.

Singular Value Decomposition (SVD): A matrix factorization technique for dimensionality


reduction.

t-SNE (t-Distributed Stochastic Neighbour Embedding): Used for visualizing high-dimensional


data.

2.Data Compression: Representing data in a more compact form using encoding or aggregation
methods.

Techniques:

Lossless compression (e.g., Huffman encoding, run-length encoding).

Lossy compression (e.g., JPEG for images, MP3 for audio).

3.Sampling:

Selecting a representative subset of data points from a larger dataset.


V SEMESTER Page 13 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Techniques:

Random sampling: Choosing data points randomly.

Stratified sampling: Ensuring the sample reflects the data distribution across groups.

4.Aggregation:Combining data points to form summaries, such as averages, totals, or medians.

Common in time-series data (e.g., aggregating daily sales into monthly totals).

5.Cluster Analysis: Grouping similar data points into clusters and representing each cluster with a
single value or prototype.

Clustering algorithms like K-Means can be used for this purpose.

6.Feature Selection: Identifying and keeping the most relevant variables for analysis.

Methods:

Filter methods: Based on statistical measures like correlation.

Wrapper methods: Using machine learning models to test variable subsets.

Embedded methods: Feature selection performed during model training (e.g., Lasso regression).

7.Numerosity Reduction: Representing data in a reduced form, such as by using histograms,


decision trees, or clustering techniques.

Benefits of Data Reduction

i. Reduces storage and processing requirements.

ii. Improves the efficiency and speed of machine learning algorithms.

iii. Simplifies data visualization and interpretation.

What is Data Discretization?


Data Discretization is the process of converting continuous data into discrete intervals or categories.

Key Objectives of Data Discretization

1.Simplification: Reducing complexity in the dataset by grouping continuous values into discrete
bins.

V SEMESTER Page 14 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

2.Improved Interpretability: Making data easier to understand and analyse by transforming


numerical data into meaningful categories.

3.Enhanced Algorithm Performance: Some algorithms perform better with categorical variables

rather than continuous ones.

Methods of Data Discretization

1.Equal Width Binning:

Divides the range of data into intervals of equal width.

Example: Splitting age data into intervals like 0–20, 21–40, 41–60, etc.

Simple but may result in uneven distribution of data points within bins.

2.Cluster-Based Discretization:

Uses clustering algorithms (e.g., K-Means) to group data points into clusters, which serve as
discrete categories.

Useful for identifying natural groupings in the data.

3.Decision Tree Discretization:

Constructs a decision tree model and uses thresholds determined by the splits as bins.

Effective for data with non-linear relationships.

4.Supervised Discretization:

Uses a target variable to influence the binning process, ensuring that the intervals optimize
predictive relationships.

5.Manual Discretization:

Defining custom intervals based on domain knowledge or specific requirements.

Example: Categorizing body mass index (BMI) into "Underweight," "Normal," "Overweight," and
"Obese."

V SEMESTER Page 15 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Challenges in Data Discretization:

Loss of Information: Continuous values are compressed, leading to potential loss of nuanced
information.

Determining the Optimal Number of Bins: Too few bins may oversimplify the data, while too

many bins may cause overfitting or loss of interpretability.

Applications:

Decision Trees: Discretized variables simplify splitting criteria.

Rule-Based Models: Helps generate rules that are interpretable and easy to understand.

Visualization: Discretized data can be effectively represented in charts and graphs.

What is Data Munging?


Data Munging also referred to as Data Wrangling, is the process of transforming and preparing raw
data into a structured and usable format.

It involves cleaning, organizing, and enriching the data so it can be effectively analysed or
modelled in data science workflows.

Key Steps in Data Munging:

1.Data Exploration: Understand the raw dataset by inspecting its structure, quality, and
content.Identify inconsistencies, errors, or gaps.

2.Data Cleaning: Handle missing values by imputing them or removing incomplete records.
Remove duplicates and correct formatting issues. Detect and manage outliers that could skew the
analysis.

3.Data Structuring: Organize data into a logical format, such as tables or arrays. Standardize
columns, headers, and types (e.g., numerical, categorical).

4.Data Transformation: Normalize or standardize numerical values. Convert categorical variables


into numerical representations using encoding techniques. Aggregate or derive new features to

enhance analysis.

V SEMESTER Page 16 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

5.Data Integration: Combine data from multiple sources or datasets to create a unified view.
Resolve schema conflicts and align data fields.

6.Data Reduction: Reduce the dataset size using sampling or dimensionality reduction

techniques. Focus on the most relevant data points or features.

7.Validation and Enrichment: Verify data accuracy and consistency through validation checks.
Enrich data by adding external information, such as geographic details or metadata.

Tools for Data Munging:

Programming Libraries: Python's Pandas and NumPy, or R's dplyr and tidyr.

ETL Tools: Talend, Alteryx, and Apache Nifi for automating wrangling processes.

Visualization: Tools like Tableau or Excel for inspecting and cleaning data interactively.

What is Filtering?
Filtering is a data processing technique used to extract specific subsets of data based on defined
criteria or conditions.
This ensures that only relevant and useful information is retained for further analysis.
Types of Filtering:

1.Value-Based Filtering: Selecting data rows or elements that match a specific value or fall within
a certain range.

Example: Filtering products with a price greater than ₹1000.

2.Condition-Based Filtering: Using logical conditions (e.g., greater than, less than, equal to) to
filter data.

Example: Filtering customer records where age is above 30.

3.Column-Based Filtering: Choosing specific columns in a dataset that are relevant for analysis.

Example: Selecting only "Name" and "Sales" columns from a sales dataset.

4.Group-Based Filtering: Filtering data based on group properties or categories.

Example: Filtering transactions based on "Region" or "Department."

V SEMESTER Page 17 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

5.Keyword-Based Filtering: Extracting data containing specific keywords or phrases. Commonly


used in text data or unstructured datasets (e.g., filtering tweets containing the word "AI").

6.Time-Based Filtering: Filtering data based on time intervals or dates.

Example: Extracting transactions recorded in the last month.

Methods of Filtering:

1.Manual Filtering: Using tools like Excel to manually sort and filter data based on user-defined

criteria.

2.Automated Filtering: Employing programming languages (e.g., Python, R) to automate filtering


tasks with libraries like Pandas or dplyr.

3.Query-Based Filtering: Using SQL queries to filter data directly from databases.

Example:SELECT * FROM sales WHERE amount > 1000;

Applications of Filtering:

Data Pre Processing: Removing irrelevant data before analysis.

Data Visualization: Filtering data for a focused and clearer representation.

Real-Time Systems: Filtering streaming data for actionable insights (e.g., alerts based on criteria).

UNIT -II

V SEMESTER Page 18 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Mean?
Mean is the sum of all the values in the data set divided by the number of values in the data set.
It is also called the Arithmetic Average.
The Mean is denoted as x̅ and is read as x bar.
Mean = Sum of observations
Number of observations
Mean (x̅ ) = Σ xi
n
Example: A cricketer's scores in five ODI matches are as follows: 12, 34, 45, 50, 24. To find his
average score in a match, we calculate the Mean of data?
Mean = Sum of all observations/Number of observations
Mean = (12 + 34 + 45 + 50 + 24)
5
Mean = 165
5
Mean = 33
Types of Data:
Data can be present in Raw Form or Tabular Form.
1.Raw Data
x1, x2, x3 , . . . , xn be n observations.
We can find the arithmetic mean using the mean formula.
Mean( x̄ )= (x1 + x2 + ... + xn)
n
Example: If the heights of 5 people are 142 cm, 150 cm, 149 cm, 156 cm, and 153 cm.Find the
mean height.
Mean height, (x̄ ) = (142 + 150 + 149 + 156 + 153)
5
= 750
5
Mean( x̄ ) = 150 cm
2. Tabular Form or Frequency Distribution Form
Mean ( x̄ )= (x1f1 + x2f2 + ... + xn fn)
(f1 + f2 + ... + fn
Example 1: Find the mean of the following distribution?
x f
4 5
6 10
9 10
10 7
1 8
Solution:
Calculation table for arithmetic mean:
V SEMESTER Page 19 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

xi fi xi*fi
4 5 20
6 10 60
9 10 90
10 7 70
15 8 120

Σ fi = 40 ; Σ xi fi = 360
Mean(x̄ )= (Σxi fi)
(Σfi)
=360/40
Mean = 9
Example 2: The following table indicates the data on the number of patients visiting a hospital in a
month. The data is in the form of class intervals. Find the average number of patients visiting the
hospital in a day.

Number of Patients visiting hospital Number of days


0 - 10 2
10 – 20 6
20 - 30 9
30 - 40 7
40 - 50 4
50 - 60 2

Solution:
Class mark = (lower limit + upper limit)/2
x1, x2, x3, . . ., xn be the class marks of the respective classes.

Class mark(xi) Frequency(fi) xi * fi


5 2 10
15 6 90
25 9 225
35 7 245
45 4 180
55 2 110

Σ fi = 30
Σ fixi = 860
V SEMESTER Page 20 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Mean(x̄ ) = (Σ xifi) = 860/30


(Σ fi)
Mean(x̄ ) = 28.67
Standard Deviation ?
Standard Deviation is defined as the square root of Variance.

σ = Standard Deviation

xi = Terms Given in the Data

x̄ = Mean

n = Total number of Terms

Example: Calculate Standard Deviation for this data 2, 4, 4, 4, 5, 5, 7, 9

Solution: Mean = 2+4+4+4+5+5+7+9

Mean = 5

Differences from mean: -3, -1, -1, -1, 0, 0, 2, 4

Squared differences: 9, 1, 1, 1, 0, 0, 4, 16

Variance = (9 + 1 + 1 + 1 + 0 + 0 + 4 + 16) / 8 = 32 / 8 = 4

Standard deviation = 2

Skewness?
Skewness is an important statistical technique that helps to determine the asymmetrical behaviour
of the frequency distribution.
A distribution or dataset is symmetric if it looks the same to the left and right of the centre point.

Types of Skewness:

1.Symmetrical Skewness: A perfect symmetric


distribution is one in which frequency distribution is
same on the sides of the centre point of the frequency
curve.
In this, Mean = Median = Mode.

V SEMESTER Page 21 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

There is no skewness in a perfectly symmetrical distribution.

2.Asymmetric Skewness: Asymmetrical distribution is one in which the spread of the


frequencies is different on both the sides of the centre point or the frequency curve is more
stretched towards one side or value of Mean. Median and Mode falls at different points.
Positive Skewness: The concentration of frequencies is more towards higher values of the
variable i.e. the right tail is longer than the left tail.
Negative Skewness: In this, the concentration of frequencies is more towards the lower values of
the variable i.e. the left tail is longer than the right tail.

Kurtos
is?
Kurtosis defines the idea about the shape of a frequency distribution. It is also a characteristic of
the frequency distribution .
Types of Kurtosis

1.Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution.
In this curve, there is too much concentration of items near the central value.
2.Meso kurtic: Meso kurtic is a curve having a
normal peak than the normal curve. In this curve, there
is equal distribution of items around the central value.
3.Platykurtic: Platykurtic is a curve having a low
peak than the normal curve is called platykurtic. In
this curve, there is less concentration of items around
the central value.
Box plot?
V SEMESTER Page 22 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Box plot is also known as a whisker plot, box and whisker plot, or box and whisker diagram.
Box plot is a graphical representation of the distribution of a dataset.
Box plot displays key summary statistics such as the median, quartiles, and potential outliers in a
concise and visual manner.
Elements of Box Plot: A box plot gives a five-number summary of a set of data which is
1.Minimum – It is the minimum value in the dataset excluding the outliers.
2.First Quartile (Q1) – 25% of the data lies below the First (lower) Quartile.
3.Median (Q2) – It is the mid-point of the dataset. Half of the values lie below it and half above.
4.Third Quartile (Q3) – 75% of the data lies below the Third (Upper) Quartile.
5.Maximum – It is the maximum value in the dataset excluding the outliers.

What are Pivot Tables?


A Pivot Table is a powerful data analysis tool used to summarize, organize, and analyze large
datasets by rearranging data based on specific columns or rows.

It allows users to extract meaningful insights by grouping, aggregating, and filtering data without
altering the original dataset.

Pivot tables are commonly used in spreadsheet software like Microsoft Excel, Google Sheets, and
data analysis tools like Python (pandas) or SQL.

Key Components of a Pivot Table

i. Rows: Categories or values displayed along the rows of the pivot table.

ii. Columns: Categories or values displayed across the columns.

iii. Values: The data being summarized (e.g., sum, count, average) based on a numeric field.

iv. Filters: Criteria to include or exclude specific data points from the analysis.

V SEMESTER Page 23 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Steps to Create a Pivot Table (in Excel/Google Sheets)

i. Select Data: Highlight the dataset.

ii. Insert Pivot Table: In Excel, go to Insert > Pivot Table; in Google Sheets, go to Data >

Pivot Table.

iii. Configure:

Drag fields to Rows, Columns, Values, or Filters.

Choose aggregation (e.g., sum, count, average).

iv. Customize: Apply filters, sort, or format for clarity.

v. Analyze: Interpret the summarized data.

Example: Dataset:

Date Region Product Sales


2025-01-01 North A 100

2025-01-01 South B 150

2025-01-02 North A 200

2025-01-02 South B 300

Pivot Table (Summing Sales by Region and Product):

Region Product A Product B

300 0
North
South 0 450

Common Uses:

Summarizing Data: Aggregate sales, counts, or averages by categories.

Comparing Groups: Analyze performance across regions, time periods, or products.

Filtering: Focus on specific subsets of data (e.g., sales in a specific year).


V SEMESTER Page 24 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Benefits

Speed: Quickly summarize large datasets.


No Data Modification: Original data remains unchanged.

Heat Map?
Heat Map data visualization is a powerful tool used to represent numerical data graphically,
where values are depicted using colours.
This method is particularly effective for identifying patterns, trends within large datasets.
Types of Heat Maps
1. Website Heat Maps: Website Heat Maps are used to visualize user behaviour on web pages.
 Click Maps: Show where users click on a webpage, highlighting popular elements.

 Scroll Maps: Indicate how far users scroll down a page, revealing engagement levels.
 Mouse Tracking Heat Maps: Visualize mouse movements on a page, showing where
users are hovering and clicking.
 Eye-Tracking Heat Maps: Track users' eye movements on a page, revealing attention
patterns.
2.Grid Heat Maps: Display data in a two-dimensional grid, representing relationships between
variables.
 Clustered Heat Maps: Group similar rows and columns based on a chosen similarity
metric, revealing hierarchical structures in the data.
 Correlogram Heat Maps: Show the correlation between different variables, often used in
statistical analysis.
Correlation Statistics?
Correlation is a statistical measure that describes the direction and strength of a relationship
between two variables.
Types of Correlation:
1.Positive Correlation : When one variable increases, the other also increases.
2.Negative Correlation : When one variable increases, the other decreases.
3.Zero Correlation : No relationship between variables.

V SEMESTER Page 25 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

What it measures:
 Strength of the relationship: Correlation measures how tightly two variables are connected,
ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0
indicating no linear correlation.
 Direction of the relationship: A positive correlation means that as one variable increases, the
other tends to increase as well, while a negative correlation means that as one variable
increases, the other tends to decrease.
 Linearity: Correlation is most appropriate for variables that have a linear relationship,
meaning the relationship is a straight line when graphed.

ANOVA?
ANOVA (Analysis of Variance) is a statistical test in data science used to determine if significant
differences exist between the means of two or more groups
Types of ANOVA
One-way ANOVA: Used when there is one independent variable (factor) with multiple levels
(groups).
Two-way ANOVA: Used when there are two or more independent variables (factors), allowing
for the analysis of interaction effects between factors.
When to Use ANOVA:
More than Two Groups: When you want to
compare the means of more than two groups,
ANOVA is a suitable alternative to the t-test,
which is used for comparing the means of two
groups.

V SEMESTER Page 26 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Independent Variables: ANOVA can be used to examine the impact of one or more independent
variables (factors) on a dependent variable (outcome).

Hypothesis Testing: ANOVA is a hypothesis testing procedure, allowing researchers to test


hypotheses about the relationships between variables.

Databases?
Database is a collection of structured data or information that is stored in a computer system and
can be accessed easily.
A database is usually managed by a Database Management System (DBMS ).

1.Document Databases

The Document Database is a Non Relational Database.

 Instead of storing the data in rows and columns (tables), it uses the documents to store the
data in the database.
 A document database stores data in JSON, BSON, or XML documents.
 A Document Data Model is a lot different than other data models because it stores data in
JSON, BSON, or XML documents.
 Popular Document Databases & Use Cases

Database Use Case

Mongo DB Content management, product catalogs, user profiles

Couch DB Offline applications, mobile synchronization

Firebase Fire store Real-time apps, chat applications

2.Column Oriented Databases


 A Column Oriented Database is a Non-Relational Database.
 It stores the data in columns instead of rows.
 This approach is particularly beneficial for analytical queries where data is processed
column-wise.
 Popular Column Oriented Databases & Use Cases

V SEMESTER Page 27 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Database Use Case

Apache Cassandra Real-time analytics, IoT applications

Google Bigtable Large-scale machine learning, time-series data

HBase Hadoop ecosystem, distributed storage

3. Graph-Based Databases
 Graph-Based Databases focus on the relationship between the elements.
 It stores the data in the form of nodes in the database.
 The connections between the nodes are called Links or Relationships.
 Data is represented as nodes (objects) and edges (connections).
 Popular Graph Databases & Use Cases
Database Use Case

Neo4j Fraud detection, social networks

Amazon Neptune Knowledge graphs, AI recommendations

ArangoDB Multi-model database, cybersecurity

UNIT III
Python Libraries?

V SEMESTER Page 28 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

A Library is a collection of pre-written code modules that contains sets of code, classes, values, and
templates. Some of the Python libraries in data science are:

i.NumPy: NumPy is a fundamental library for numerical computing and supports for
multidimensional arrays, matrices, statistics, linear algebra etc

Install commands: pip install numpy


ii.Pandas: Pandas library is used for representing missing data, allowing insertion or deletion of
data, and converting data into data frames.

Install commands: pip install pandas

iii.Matplotlib: Matplotlib works with Python scripts, Jupyter Notebook, web applications, and other
graphic user interfaces (GUI) to generate plots, which makes it a versatile visualisation tool for data
scientists.

Install commands: pip install matplotlib

iv. Seaborn: Seaborn is a library built on top of the Matplotlib library and helps make statistical
graphics more straightforward.
Install commands: conda install seaborn or pip install seaborn
v.TensorFlow: Open-source library developed by Google, widely used for deep learning and other
machine learning tasks.
Install commands: pip install tensorflow
vi.Plotly:Interactive graphing library for creating web-based visualizations.
Install commands: pip install plotly

Python integrated Development Environment(IDE) for Data Science ?


An integrated development environment (IDE) is a software application that helps programmers
develop software code efficiently.
Some of the Python IDE's that are used for Data Science are:

i. Jupyter Notebook: Jupyter Notebook is an open-source IDE that is used to create Jupyter
documents that can be created and shared with live codes. It is a web-based interactive
computational environment. It can support various languages in data science such as Python,
Julia, Scala, R, etc.

ii. Spyder: Spyder is an open-source Python IDE that was originally created and developed by
Pierre Raybaut in 2009. It can be integrated with many different Python packages such as
NumPy, SymPy, SciPy, pandas, IPython, etc.

V SEMESTER Page 29 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

iii. Sublime text: Sublime text is a proprietary code editor and it supports a Python API. Some of
the features of Sublime Text are project-specific preferences, quick navigation, supportive
plugins for cross-platform, etc.

iv. Visual Studio Code: Visual Studio Code is a code editor that was developed by
Microsoft. Some of the features of Visual Studio Code are embedded Git control, intelligent code
completion, support for debugging, syntax highlighting, code refactoring, etc.
v. Pycharm: Pycharm is a Python IDE developed by JetBrains and created specifically for
Python. Pycharm is particularly useful in machine learning because it supports libraries such as
Pandas, Matplotlib, Scikit-Learn, NumPy, etc.

Arrays and Vectorized Computation in Numpy?


An array is a multi-dimensional data structure (called nd array) used for storing and manipulating
homogeneous data.

Defining a NumPy Array: Arrays are typically created using the np.array() function.Vectorization
in NumPy is a method of performing operations on entire arrays without explicit loops.
Example:
import numpy as np
a1 = np.array([2,4,6,8,10 ])
number= 2
result = a1 + number
print(result)
Output: [ 4 6 8 10 12]
Pratical Examples of Vectorization
i. Adding two arrays together with Vectorization
Example:
import numpy as np
a1 = np.array([1, 2, 3])
a2 = np.array([4, 5, 6])
result = a1 + a2
print(result)
Output: [5 7 9]
ii. Element Wise Multiplication with array
Example:
import numpy as np

V SEMESTER Page 30 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

a1 = np.array([1, 2, 3, 4])
result = a1 * 2
print(result)
Output: [2 4 6 8]
iii. Logical Operations on Arrays: Logical operations such as comparisons can be applied
directly to arrays.
Example:
import numpy as np
a1 = np.array([10, 20, 30])
result = a1 > 15
print(result)
Output: [False True True]
iv.Matrix Operations Using Vectorization:NumPy supports vectorized matrix operations like
dot products and matrix multiplications using functions such as np.dot and @.
Example:
import numpy as np
a1= np.array([[1, 2], [3, 4]])
a2 = np.array([[5, 6], [7, 8]])
result = np.dot(a1, a2)
print(result)
Output: [[19 22]
[43 50]]

Numpy ndarray?
ndarray is a short form for N-dimensional array which is a important component of NumPy.
It’s allows us to store and manipulate large amounts of data efficiently.
All elements in an nd array must be of same type making it a homogeneous array.
Example:
import numpy as np
arr1 = np.array([1, 2, 3, 4, 5])

arr2 = np.array([[1, 2, 3], [4, 5, 6]])


arr3 = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

V SEMESTER Page 31 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

print(arr1)
print(arr2)
print(arr3)
Output: [1 2 3 4 5] [[1 2 3] [4 5 6]] [[[1 2] [3 4]] [[5 6] [7 8]]]
Attributes of ndarray:
i. ndarray.shape: Returns a tuple representing the shape (dimensions) of the array.
ii. ndarray.ndim: Returns the number of dimensions (axes) of the array.
iii. ndarray.size: Returns the total number of elements in the array.
iv. ndarray.dtype: Provides the data type of the array elements.
v. ndarray.itemsize: Returns the size in bytes of each element
Example:

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Shape:", arr.shape)
print("Dimensions:", arr.ndim)
print("Size:", arr.size)
print("Data type:", arr.dtype)
print("Item size:", arr.itemsize)
Output: Shape: (2, 3)
Dimensions: 2
Size: 6
Data type: int32
Item size: 4

Creating nd array?
The array object in NumPy is called ndarray.We can create a NumPy ndarray object by using
the array() function.

Example:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)
V SEMESTER Page 32 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

print(type(arr))

Output: [1 2 3 4 5]

<class 'numpy.ndarray'>

Note: type():This built in python function tells us the type of object passed to it.

To create an ndarray, we can pass a list, tuple object into the array() method, and it will be
converted into an ndarray.

Example:

import numpy as np
arr1 = np.array((1, 2, 3, 4, 5))
arr2 = np.array([1,2,3,4,5])
print(arr1)
print(arr2)
Output: [1 2 3 4 5]
[1 2 3 4 5]

Data Types for ndarrays?


In NumPy, the data type of an ndarray is defined by its dtype, which specifies the type of elements
stored in the array. The dtype determines the size, range, and behavior of the data.

i.Numeric Data Types: Integer Types:

int8: 8-bit signed integer (-128 to 127)

int16: 16-bit signed integer (-32,768 to 32,767)

int32: 32-bit signed integer (-2,147,483,648 to 2,147,483,647)

int64: 64-bit signed integer (-2^63 to 2^63 - 1)

Numeric Data Types: Floating Point Types:

float16: Half-precision floating-point

float32: Single-precision floating-point

float64: Double-precision floating-point

V SEMESTER Page 33 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

ii. Boolean Data Type: bool_: Boolean value (True or False), stored as a single byte.
iii. String Data Types

string_: Fixed-length string type

unicode_: Fixed-length Unicode string type

Example:

import numpy as np

arr_int = np.array([1, 2, 3], dtype=np.int16)

print(arr_int.dtype)

arr_float = np.array([1.5, 2.7, 3.2], dtype=np.float32)

print(arr_float.dtype)

arr_str = np.array(['apple', 'banana'], dtype='U10')

print(arr_str.dtype)

Output: int16

float32

<U10

Arithmetic with numpy array?


Arithmetic operations are used for numerical computation add, subtract, multiply, divide and get
power of elements in an array.

i. Addition of Arrays: Addition is an arithmetic operation where the corresponding elements of


two arrays are added together using the np.add() function.

ii. Subtraction of Arrays: This function subtracts each element of the second array from the
corresponding element in the first array using np.subtract() function.
iii.Multiplication of Arrays: Multiplication in NumPy can be done element-wise using
the np.multiply() function.
iv. Division of Arrays: This divides each element of the first array by the corresponding element

in the second array using the np.divide() function.


v.Exponentiation (Power):It allows us to raise each element in an array to a specified power
using the np.power() function.

V SEMESTER Page 34 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

vi. Modulus Operation: It finds the remainder when one number is divided by another using
the np.mod() function.

Example:

import numpy as np

a = np.array([5, 72, 13, 100])

b = np.array([2, 5, 10, 30])

add_ans = np.add(a, b)

sub_ans = np.subtract(a,b)

mul_ans = np.multiply(a, b)

div_ans = np.divide(a, b)

pow_ans = np.power(a, b)

mod_ans = np.mod(a, b)

print(add_ans)

print(sub_ans)

print(mul_ans)

print(div_ans)

print(pow_ans)

print(mod_ans)

Output:

[ 9 76 47 300]

[ 5 58 -13 200]

[ 14 603 510 12500]

[3.5 7.44444444 0.56666667 5. ]

[ 49 -364726493 632059105 0]

[1 4 17 0]

V SEMESTER Page 35 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Indexing and Slicing?


Indexing: Indexing retrieves a single element from a sequence using its position (index). NumPy
arrays support:

 Zero-based indexing: The first element is at index 0.


 Negative indexing: -1 refers to the last element, -2 to the second-to-last, etc.
 Multi-dimensional indexing: For n-dimensional arrays, use comma-separated indices:
array[i,j,...].
 Out-of-range indices raise an Index Error.
 Single element indexing returns a scalar for 1D arrays or a value for higher dimensions.

Example:

import numpy as np

arr1 = np.array([10, 20, 30, 40])

print(arr1[0])

print(arr1[-1])

arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print(arr2[0, 1])

print(arr2[1, -1])

Output: 10

40

Slicing: Slicing extracts sub arrays using the syntax array[start:stop:step], which can be applied to
each dimension of a multi-dimensional array.

start: Starting index (inclusive). Defaults to 0.

stop: Ending index (exclusive). Defaults to the dimension's length.

step: Increment. Defaults to 1. Negative steps reverse the order.

Use commas to slice multiple dimensions: array[start:stop:step, start:stop:step, ...]

Example:
V SEMESTER Page 36 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

import numpy as np

arr1 = np.array([10, 20, 30, 40, 50])

print(arr1[1:4])

print(arr1[:3])

print(arr1[::2])

print(arr1[::-1])

arr2 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(arr2[0:2, 1:3])

print(arr2[:, 1])

print(arr2[1:, :])

Output:

[20 30 40]

[10 20 30]

[10 30 50]

[50 40 30 20 10]

[[2 3] [5 6]]

[2 5 8]

[[4 5 6] [7 8 9]]

Boolean Indexing?
Boolean indexing in NumPy uses a Boolean array to select elements from an array that meets a
certain condition.

The boolean array acts as a filter, where True values indicate the elements to be selected,
and False values indicate the elements to be excluded.

Example:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

V SEMESTER Page 37 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

mask = arr > 5


print("Boolean mask:", mask)
result = arr[mask]
print("Result:", result)
Output:
Boolean mask: [False False False False False True True True True True]
Result: [ 6 7 8 9 10]
How it works:
Create a boolean array (mask): This array has the same shape as the array you want to
index and contains boolean values (True or False).
Use the boolean array to index: Pass the boolean array inside the square brackets [] of the
array you want to index. NumPy will return a new array containing only the elements where
the boolean array is True.
Transposing Arrays and Swapping axes?
Transposing Arrays:Transposing an array flips its axes, swapping rows into columns and columns
into rows for a 2D array (matrix).For higher-dimensional arrays, it reverses the order of the axes
unless specified otherwise. NumPy provides the .T attribute or the numpy.transpose() function for
this.

i. Transposing a 2D Array

import numpy as np

array_2d = np.array([[1, 2, 3], [4, 5, 6]])

transposed = array_2d.T or transposed = np.transpose(array_2d)

print(transposed)

Output:

[[1 4]

[2 5]

[3 6]]

ii.Transposing a 3D Array

import numpy as np

V SEMESTER Page 38 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

array_3d = np.ones((2, 3, 4))

transposed_3d = array_3d.transpose()

print(transposed_3d.shape)

custom_transpose = array_3d.transpose(2, 1, 0)

print(custom_transpose.shape)

Output: (4, 3, 2)

(4, 3, 2)

Swapping Axis: It focuses on swapping only two specified axes while leaving the rest
unchanged, using the numpy swapaxes() function.
The syntax of numpy.swapaxes() is: numpy.swapaxes(array, axis1, axis2)
array : is the input array whose axes are to be swapped.
axis1 : The first axis to be swapped.
axis2 : The second axis to be swapped with axis1.
i.Swapping Axes in a 2D Array
Example:

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

result = np.swapaxes(arr, axis1=0, axis2=1)

print("Original array:\n", arr)

print("Array after swapping axes:\n", result)

Output:

Original array:

[[1 2 3]

[4 5 6]]

Array after swapping axes:

[[1 4]

V SEMESTER Page 39 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

[2 5]

[3 6]]

ii.Swapping Axes in a 3D Array


Example:
import numpy as np

arr = np.random.rand(2, 3, 4)

result = np.swapaxes(arr, axis1=0, axis2=2)

print("Original shape:", arr.shape)

print("New shape:", result.shape)

Output:

Original shape: (2, 3, 4)

New shape: (4, 3, 2)

Example:

import numpy as np

array_3d = np.ones((2, 3, 4))

swapped_axes = np.swapaxes(array_3d, 0, 1)

print(swapped_axes.shape)

print(array_3d.shape)

Output:

(3, 2, 4)

(2, 3, 4)

Mathematical and Statistical Methods:?


Mathematical and Statistical methods that operate element-wise on arrays, providing efficient ways
to perform calculations.

Mathematical Functions:

V SEMESTER Page 40 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Arithmetic Operations: NumPy ufuncs support basic arithmetic like addition, subtraction,
multiplication, division, and exponentiation.
Rounding: Functions like round(), trunc(), and floor() are available for rounding values in arrays.
Trigonometric Functions: sin(), cos(), tan(), and their inverses are included for trigonometric
calculations.
Exponential and Logarithmic Functions: exp(), log(), log10() allow for exponential and
logarithmic operations.
Statistical Functions:

DescriptiveStatistics:Functionslike sum(), mean(), std(), var(), min(), max(), median(),


and mode() calculate descriptive statistics on arrays.

Statistical Analysis:Numpy provides functions for more complex statistical calculations,


including corrcoef() for correlation, quantile() for quantiles, and percentile() for percentiles.

Probability Theory: NumPy can be used with probability theory to calculate probabilities,
distributions, and other statistical concepts.

Sorting in Universal Functions?


Sorting refers to rearranging the elements of an array in a specific order, typically ascending or
descending.

i. Using np.sort() with ufuncs: The np.sort() function returns a sorted copy of an array.

Example:
import numpy as np
arr = np.array([5, 2, 8, 1, 9])
sorted_arr = np.sort(arr)
print(sorted_arr)
squared_arr = np.square(arr)
sorted_squared_arr = np.sort(squared_arr)
print(sorted_squared_arr)

Output:
[1 2 5 8 9]
[ 1 4 25 64 81]
Note: np.sort () for Multi Dimensional Arrays:
import numpy as np

V SEMESTER Page 41 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

arr_2d = np.array([[5, 2, 8], [1, 9, 4]])


sorted_arr_2d_axis0 = np.sort(arr_2d, axis=0)
sorted_arr_2d_axis1 = np.sort(arr_2d, axis=1)
print(sorted_arr_2d_axis0)
print(sorted_arr_2d_axis1)
Output: [[1 2 4]
[5 9 8]]
[[2 5 8]
[1 4 9]]
ii. Indirect sorting with np.argsort(): The np.argsort() function returns the indices that would sort
an array. This is useful for sorting one array based on the order of another.
Example:
import numpy as np
arr = np.array([5, 2, 8, 1, 9])
indices = np.argsort(arr)
print(indices)
names = np.array(['E', 'B', 'H', 'A', 'I'])
sorted_names = names[indices]
print(sorted_names)
Output: [3 1 0 2 4]
['A' 'B' 'E' 'H' 'I']
Unique and Other Set Logic?
The np.unique function is used to extract unique elements from an array and identifying distinct
categories, or removing duplicates.
Set Operations in NumPy: NumPy provides functions for set operations like union, intersection,
difference, and symmetric difference, which are useful for comparing datasets, filtering data, or
merging information.
Key Functions:

np.union1d(ar1, ar2): Returns unique elements from both arrays (union).

np.intersect1d(ar1, ar2, assume_unique=False): Returns common elements between two arrays

(intersection).

np.setdiff1d(ar1, ar2, assume_unique=False): Returns elements in ar1 that are not in ar2
(difference).

V SEMESTER Page 42 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

np.setxor1d(ar1, ar2, assume_unique=False): Returns elements that are in either ar1 or ar2, but
not both (symmetric difference).

np.in1d(ar1, ar2, assume_unique=False): Tests whether each element of ar1 is in ar2 (returns a
boolean array).

UNIT IV
Series: Pandas Series is a fundamental data structure, it is a one-dimensional labelled array
capable of holding data of any type i.e., integers, strings, floating point numbers, Python objects,
etc.

V SEMESTER Page 43 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Creating a Pandas Series


Example i: From a List
import pandas as pd
data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)
Output: 0 10
1 20
2 30
3 40
dtype: int64
Example ii: Num Py arrays
Python
Importpandasaspd
importnumpyasnp
data=np.array([1,2,3,4,5])
series=pd.Series(data)
print(series)
Output: 0 1
1 2
2 3
3 4
4 5
dtype: int32
Example iii: Dictionaries
import pandas as pd
data={'a':1,'b':2,'c':3}

series = pd.Series(data)
print(series)
Output: a 1
b 2
V SEMESTER Page 44 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

c 3
dtype: int64
Example iv : Scalar values
import pandas as pd
series=pd.Series(10,index=['a','b','c'])
print(series)
Output: a 10
b 10
c 10
dtype: int64

Example v:Custom index

import pandas as pd
data = [10, 20, 30, 40]
series = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series)
Output: a 10
b 20
c 30
d 40

dtype: int64

Data Frame: Pandas Data Frame is a two-dimensional size mutable tabular data structure
with labelled axes rows and columns.
Pandas Data Frame consists of three principal components Data, Rows, and Columns.
Creating a Pandas Data Frame
Example i: Using a single list or a list of lists.
import pandas as pd

data = {'Name': ['A', 'B', 'C'],

'Age': [25, 30, 35],


'City': ['New York', 'Los Angeles', 'Chicago']}
V SEMESTER Page 45 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

df1=pd.DataFrame(data)
Output: Name Age City
0 A 25 New York
1 B 30 Los Angeles
2 C 35 Chicago

Example ii: From dict of nd array /lists.


data={'Name':['Tom', 'nick', 'krish', 'jack'],
'Age':[20,21,19,18]}
df=pd.DataFrame(data)
print(df)
Output: Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18

Accessing and Manipulating Data:

i. Column Selection: Access one or more columns by their names using square brackets
(e.g., df['column_name']).
Example: import pandas as pd
data = {'Name': ['A', 'B', 'C'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df= pd.DataFrame(data)
print(df['Name'])
Output: 0 A
1 B
2 C

Name: Name, dtype: object


ii. Row Selection: Use .loc[] or .iloc[] to select rows by label or integer position, respectively.
Example: import pandas as pd

V SEMESTER Page 46 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

data = {'Name': ['A', 'B', 'C'],


'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df= pd.DataFrame(data)
print(df1.iloc[1])
Output: Name Bob

Age 30

City Los Angeles

Name: 1, dtype: object

Dropping entries: Dropping entries refers to removing rows or columns by index or column
name from a Data Frame in Pandas. This is typically done using the drop() method.

How to Drop Entries:

i. Identify the entries to be dropped: Determine which rows or columns you want to remove.

ii. Use the drop() method:

o For rows, specify the axis parameter as 0 or index.

o For columns, specify the axis parameter as 1 or columns.


iii. Specify the labels parameter: Provide a list of labels (index labels for rows or column names
for columns) to be removed.

iv. Set in place to True if you want to modify the original Data Frame: By
default, drop() returns a new Data Frame with the entries dropped.

v. Example:
import pandas as pd
data = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8], 'col3': [9, 10, 11, 12]}
df = pd.Data Frame(data, index=['a', 'b', 'c', 'd'])

# Drop rows with index labels 'b' and 'd'


df = df.drop(['b', 'd'])
# Drop columns 'col2' and 'col3'
df = df.drop(['col2', 'col3'], axis=1)
V SEMESTER Page 47 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

print(df)
Output: col1
a 1
c 3

Indexing in Pandas :Indexing refers to selecting specific rows and columns from a Data
Frame.
It allows you to subset data in various ways, such as selecting all rows with specific columns,
some rows with all columns, or a subset of both rows and columns.
This technique is also known as Subset Selection.
Methods for indexing:
i.Square bracket ([]) indexing: It is the most basic way to access columns.

Example: import pandas as pd

data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}

df = pd.DataFrame(data)
print(df['col_1']) # Returns the column named 'col_1'

Output: 0 3
1 2
2 1
3 0
Name: col_1, dtype: int64
ii.loc indexing: It is label-based, allowing access to rows and columns using their labels.
Example:
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
print(df.loc[0]) # Returns the row with index label 0
print(df.loc[:, 'col_2']) # Returns the column with label 'col_2'

V SEMESTER Page 48 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

print(df.loc[0, 'col_1']) # Returns the element at row 0 and column 'col_1'


Output: col_1 3
col_2 a
Name: 0, dtype: object

iii. .iloc indexing: It is integer-based, enabling access using numerical positions.

Example:
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
print(df['col_1']) # Returns the column named 'col_1'
print(df.iloc[0]) # Returns the first row
print(df.iloc[:, 1]) # Returns the second column
print(df.iloc[0, 0]) # Returns the element at the first row and first column
Output: 0 3
1 2
2 1
3 0
Name: col_1, dtype: int64
col_1 3
col_2 a
Name: 0, dtype: object
0 a
1 b
2 c
3 d
Name: col_2, dtype: object

iv.Boolean indexing: It uses boolean arrays to filter data based on conditions.

Example:
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
print(df[df['col_1'] > 1]) # Returns rows where 'col_1' is greater than 1

Output: col_1 col_2


V SEMESTER Page 49 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

0 3 a

1 2 b

Function Application and Mapping: Function application involves using a function


on a value or set of values to produce a result.
Mapping in the context of programming and data manipulation extends this concept by applying a
function to each element of a collection (like a list or array), creating a new collection with the
transformed elements.
Common implementations of mapping:
map() in Python: Applies a function to every item in an iterable and returns an iterator that yields
the results.
Example:
def square(x):
return x * x
numbers = [1, 2, 3, 4, 5]
squared_numbers = map(square, numbers)
print(list(squared_numbers))
Output: [1, 4, 9, 16, 25]
map() in JavaScript: Creates a new array by applying a provided function to each element in the
calling array.

Sorting and Ranking?


Sorting: Sorting involves arranging data in a specific order based on a particular attribute or
value.

Examples: Ordering a list of names alphabetically, sorting numerical data from smallest to largest.
Python offers several built-in functions and libraries like sorted(), sort(), and Pandas for sorting and
ranking data in different ways.
Sorting in Python:
sorted(): This function takes an iterable (like a list or tuple) and returns a new sorted list without
modifying the original.
sort(): This method is used to sort lists directly, modifying the original list in place.
Example: i. Sorting a list of numbers.
numbers = [5, 2, 8, 1, 9, 4]
sorted_numbers = sorted(numbers)

V SEMESTER Page 50 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

print(sorted_numbers)
numbers.sort()

print( numbers)
Output: [1, 2, 4, 5, 8, 9]
[1, 2, 4, 5, 8, 9]
Example: ii. Sorting a list of strings.
strings = ["apple", "banana", "orange", "grape"]
sorted_strings = sorted(strings) ]
print(sorted_strings)
strings.sort()
print(strings)
Output: ['apple', 'banana', 'grape', 'orange']
['apple', 'banana', 'grape', 'orange']

Example: iii. Sorting by a custom key: This will arrange the words from shortest to longest.

a = ["apple", "banana", "kiwi", "cherry"]


a.sort(key=len)
print(a)
Output: ['kiwi', 'apple', 'banana', 'cherry']
Ranking: Ranking assigns a numerical or ordinal value to each item in a dataset.

Examples: Ranking search results based on relevance, ranking products based on sales volume.

Ranking in Python:
rank(): This method in Pandas provides assigning ranks based on the first occurrence of a value .

average (default): Assigns the average rank to tied values.


min: Assigns the minimum rank to tied values.
max: Assigns the maximum rank to tied values.
first: Assigns ranks in the order the values appear.
dense: Like 'min', but ranks are consecutive (no gaps)
Examples:

V SEMESTER Page 51 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

import pandas as pd
df = pd.DataFrame({ 'A': [120, 40, 40, 30, 60], 'B': [15, 35, 55, 75, 95], 'C': [10, 30, 50, 70, 90],
'D': [5, 25, 45, 65, 85], 'E': [1, 21, 41, 61, 81] })
df['ranked_df_A_min'] = df['A'].rank(method='min')
df['ranked_df_A_max'] = df['A'].rank(method='max')
df['ranked_df_A_first'] = df['A'].rank(method='first')
print(df)
Output:
A B C D E ranked_df_A_min ranked_df_A_max ranked_df_A_first
0 120 15 10 5 1 5.0 5.0 5.0
1 40 35 30 25 21 2.0 3.0 2.0
2 40 55 50 45 41 2.0 3.0 3.0
3 30 75 70 65 61 1.0 1.0 1.0
4 60 95 90 85 81 4.0 4.0 4.0

Unique Values: Unique Values refer to the distinct entries present in a dataset, column, or
array.
unique(): This method, applied to a Series, returns a NumPy array of unique values in the order
they appear. It is efficient and includes NaN values if present.
nunique(): This method counts the number of unique values in a Series or each column of a Data
Frame. It's useful for summarizing data.
Example:
import pandas as pd
data = {'col1': [1, 2, 2, 3, 4, 4, None]}
df = pd.DataFrame(data)
unique_values = df['col1'].unique()
print(unique_values)
num_unique = df['col1'].nunique()
print(num_unique)
Output: [1. 2. 3. 4. nan]
4
V SEMESTER Page 52 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Value Counts: Value count provides how many times each distinct value appears in the
series.
Example:
import pandas as pd
sr = pd.Series(['Apple', 'Cat', 'Lion', 'Rabbit', 'Cat', 'Lion'])
print(sr.value_counts()
Output: Cat 2
Lion 2
Apple 1
Rabbit 1
Name: count, dtype: int64

Membership: Membership testing involves checking if a specific value exists within a dataset
or collection.
The primary method for this is .isin().
It returns a boolean Series indicating whether each element in the original Series/DataFrame is
found in the specified set of values.
Example:
import pandas as pd
Create a sample Series
s = pd.Series(['apple', 'banana', 'cherry', 'guava'])
mask = s.isin(['apple', 'guava'])
print(mask)
Output: 0 True
1 False
2 False
3 True
dtype: bool

V SEMESTER Page 53 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Reading and Writing to text files in Python?


Writing to a Text File: To write to a text file, you can use the open() function with modes like 'w'
(write, overwrites the file) or 'a' (append, adds to the end). Always close the file or use a with
statement to handle it automatically.

Example:

with open('example.txt', 'w') as file:

file.write('Hello, World!\n')

file.write('This is a new line.')

with open('example.txt', 'a') as file:

file.write('Appending this line.\n')

Note:w': Creates a new file or overwrites an existing one.

'a': Appends to the end of the file; creates it if it doesn’t exist.


The with statement ensures the file is properly closed after writing.
Python Read Text File: There are three ways to read txt file in Python:
Using read() : Reads the entire file as a single string.
Using readline() : Reads one line at a time.
Using readlines() : Reads all lines into a list.
Example:
with open('example.txt', 'r') as file:
content = file.read()
print(content)
with open('example.txt', 'r') as file:
for line in file:
print(line.strip())
with open('example.txt', 'r') as file:
lines = file.readlines()
print(lines)
Output:

Hello, World!

This is a new line.Appending this line.


V SEMESTER Page 54 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Hello, World!

This is a new line.Appending this line.

['Hello, World!\n', 'This is a new line.Appending this line.\n']

UNIT 5
Data Cleaning and Data Preparation?
Data Cleaning is the process of identifying and correcting errors, inconsistencies, and inaccuracies
in datasets to ensure data quality and reliability.

V SEMESTER Page 55 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Data Preparation is the process of making raw data ready for after processing and analysis.

Data Cleaning Steps:

i. Remove Duplicates: Detect and delete duplicate rows that may skew analysis.

ii. Handle Missing Values: Options include

 Removing rows/columns with too many missing values.


 Imputing values using mean, median, mode, or more complex methods.

iii. Fix Structural Errors: Standardize formats (e.g., “USA” vs. “U.S.A”).

iv. Filter Outliers: Identify and handle extreme values using statistical methods (e.g., Z-score,
IQR).

v. Validate Data Types: Ensure correct data types (e.g., date fields are in date time format,
numeric columns are not stored as strings).

Data Preparation Steps:

i. Normalize or Standardize Data

 Normalization: Scales data between 0 and 1.


 Standardization: Transforms to a distribution with mean 0 and std dev 1.

ii. Encoding Categorical Variables: Convert categories to numbers using:

 One-hot encoding.
 Label encoding.

iii. Feature Engineering: Create new features from existing ones to better represent the underlying
problem.

iv. Split Data: Train/test split for ML workflows (e.g., 80/20 or 70/30 split).

v. Scaling and Transformation: Apply log, square root, or Box-Cox transformations if data is
skewed.

Data Transformation?
Data Transformation refers to converting data into a suitable format or structure to improve
analysis, modelling, and interpretation.

Common Types of Data Transformation


V SEMESTER Page 56 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Normalization – Standardizing data to ensure uniformity in scale, format, or unit (e.g., converting
currency values to a standard denomination).

Aggregation – Summarizing data by grouping or combining multiple records into meaningful


values (e.g., computing total sales per region).

Filtering – Removing unwanted or irrelevant data to enhance quality and reduce noise.

Joining – Merging multiple datasets based on a common key to create enriched datasets.

Sorting – Organizing data in a specific order (ascending or descending) for better readability and
processing.

Encoding – Converting data into different formats such as categorical data to numerical
representations.

Removing Duplicates?
Removing Duplicates is the process of identifying and eliminating redundant records from a
dataset.
How to Remove Duplicates?
Several methods can be used to remove duplicates, depending on the data format and the tool used:
Python Pandas: The drop duplicates () function in Pandas can be used to remove duplicate rows
or columns from a Data Frame.
SQL: SQL databases offer various methods for removing duplicates, including
using DELETE statements with JOIN or GROUP BY clauses.
Excel: Excel's "Remove Duplicates" feature allows users to identify and eliminate duplicates
based on specified columns.
Manual Review and Editing: For small datasets or specific scenarios, manual inspection and
removal of duplicates may be necessary .

Why Remove Duplicates?


Data Accuracy and Integrity: Duplicates can lead to inaccurate representations of the data,
potentially skewing results and hindering the understanding of patterns.

Efficiency and Resource Optimization: Removing duplicates reduces dataset size, leading to
faster processing times and lower computational costs.

Transforming Data Using a Function or Mapping?

V SEMESTER Page 57 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Transforming data using a function or mapping is a powerful technique in data science to clean,
standardize, or create new features.

Function: This approach involves defining a function that takes an input, processes it, and returns a
transformed output.

Purpose: To apply a predefined transformation to each data element.

Example: Using a function to convert all names in a list to uppercase.

Implementation: Can be implemented in various programming languages (e.g., Python, Java) and
can be very flexible for complex transformations.
Mapping: Mapping refers to the process of applying a function to each element in a collection,
producing a new collection of the same size.

Purpose: To align data fields between different systems or data sources.

Example: Mapping data fields in an Excel spreadsheet to fields in a database.

Implementation: Often involves tools and interfaces for visual mapping and configuration.

When to use which:

Functions: When you need to apply a specific transformation rule or calculation to each data
element (e.g., data cleaning, normalization, feature engineering).

Mappings: When you need to align data between different systems or data sources, particularly
when dealing with data integration and ETL processes.

Replacing values?
Replacing values refers to modifying specific data points within a dataset whether due to errors,
missing values, inconsistencies, or the need for standardization.

Why Replace Values?

Handling Missing Data: Filling in gaps using methods like mean, median, mode, or predictive
techniques.

Correcting Errors: Fixing incorrect entries due to typos or data corruption.

Standardization: Converting data to a uniform format (e.g., changing "Yes"/"No" to 1/0).

Encoding Categories: Replacing text labels with numerical values for machine learning
V SEMESTER Page 58 Ph No: 9652520444, 9515786774
ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

models.

Outlier Treatment :Modifying extreme values to prevent skewed analysis.

Detecting and Filtering Outliers?


Detecting and filtering outliers involves identifying data points that deviate significantly from the
norm and then removing or adjusting them.

Outlier is a data item/object that deviates significantly from the rest of the objects.
How Outliers are Caused?
 Measurement errors : Errors in data collection or measurement processes can lead to
outliers.
 Sampling errors : In some cases, outliers can arise due to issues with the sampling process.
 Data entry errors : Human errors during data entry can introduce outliers.

i. Detection Methods
a. Statistical Methods

Z-Score (Standard Score): Measures how many standard deviations a data point is from the mean.

Inter quartile Range (IQR): Uses the spread of the middle 50% of data.

b. Visualization Based Methods

Box plots: Visually identify outliers as points beyond the whiskers (Q1 - 1.5·IQR or Q3 +
1.5·IQR).

Scatter Plots: Spot anomalies in 2D data distributions.

Histograms: Highlight data points in low-density regions.

ii. Filtering Outliers:

Removal: Delete outliers if they are errors or irrelevant (e.g., data entry mistakes).

Capping/ Winsorizing: Replace outliers with a threshold value.

Transformation: Apply log, square root, or other transformations to reduce the effect of extreme
values.

Imputation: Replace outliers with mean, median, or interpolated values.

V SEMESTER Page 59 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Treat as Separate Class: Model outliers separately if they represent a meaningful subgroup (e.g.,
fraud detection).

Line Plots?
A Line Plot is a graphical representation of data in which individual data points are plotted along
a line to display the relationship between two variables.

How to Create a basic Line Plot?


Collect Data: Gather the data you want to visualize, ensuring you have both an independent and
a dependent variable.
Choose a Tool: Select a data visualization tool or software, such as Excel, R, Python or
specialized data visualization tools like Tableau.
Input Data: Enter your data into the chosen tool, specifying the independent and dependent
variables.
Create the Plot: Generate the line plot, customize it as needed (e.g., labels, titles, colours), and
interpret the resulting graph.
Example:

Bar Plots?
Bar plots are used to visualize categorical data, comparing quantities across different categories
with rectangular bars.

Create a basic bar plot using Matplotlib: Output:


import matplotlib.pyplot as plt
import numpy as np
categories = ["A","B","C","D"]

V SEMESTER Page 60 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

values = [25,40,30,55]
plt.bar(categories, values)
plt.xlabel("Categories")
plt.ylabel("Values")
plt.title("Bar Plot Example")
plt.show()

Histogram and Density Plot?


Histogram: Histogram divides the data into bins and displays the frequency or count of data points
falling into each bin as bars.

Create a basic Histogram using Matplotlib: Output:

import matplotlib.pyplot as
plt
import numpy as np
data =
np.random.randn(1000)

plt.hist(data,bins=30,color="skyblue", edgecolor="black")
plt.xlabel("Value")
plt.title("Histogram of Random Data")
plt.show()

Density Plot: Density plot, also known as a kernel density estimate (KDE) plot, provides a
smoothed representation of the data distribution. It estimates the probability density function of the
data, showing the likelihood of different values occurring.

V SEMESTER Page 61 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Create a basic Density plot using Matplotlib:


import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
data=np.random.randn(1000)sns.kdeplot(data,color="skyblue",fill=True)
plt.xlabel("Value")
plt.ylabel("Density")
plt.title("Density of Random Data")
plt.show()
Output:

Scatter Plot and Point


Plot?
Scatter Plot: A scatter plot displays individual data points as dots on a graph, with one variable
plotted against the other.

Create a basic Scatter Plot using Mat plot lib:

import matplotlib.pyplot as plt


x= [1,2,3,4,5]
y= [2,4,1,3,5]
plt.scatter(x, y)
plt.xlabel(‘X-axis’)
plt.ylabel(‘Y-axis’)

V SEMESTER Page 62 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

plt.title(‘Scatter Plot’)
plt.show()
Output:

Point Plot: Point Plot often created using libraries like Sea born, represents an estimate of central
tendency for a numeric variable by the position of the dot and provides an indication of the
uncertainty around that estimate using error bars.
Create a basic Point Plot using Matplotlib:
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

data={"Category":["A", "B", "A", "B", "A"], "Value":[1,2,3,4,5]}


df= pd.DataFrame(data)
sns.pointplot(x="Category", y="Value", data=df)
plt.xlabel('Category')
plt.ylabel('Value')
plt.title('Point Plot')
plt.show()

V SEMESTER Page 63 Ph No: 9652520444, 9515786774


ZENEX VISION DEGREE COLLEGE BSC(CS) FOUNDATIONS OF DATA SCIENCE

Output:

V SEMESTER Page 64 Ph No: 9652520444, 9515786774

You might also like