0% found this document useful (0 votes)
6 views

FDSNotes

Ty bsc cs sppu

Uploaded by

student.2004.in
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

FDSNotes

Ty bsc cs sppu

Uploaded by

student.2004.in
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Foundation of Data Science

1. Introduction to Data Science


Introduction to Data Science
• Definition: Data Science is a multidisciplinary field that uses scientific methods, algorithms,
processes, and systems to extract knowledge and insights from structured and unstructured
data.
• Key Components: It involves the integration of statistics, computer science, machine
learning, data mining, and domain knowledge.
• The 3 V’s of Data:
• Volume: Refers to the vast amount of data generated every second from various
sources (e.g., social media, sensors, transactions).
• Velocity: The speed at which data is generated, processed, and analyzed. In today’s
fast-paced world, data needs to be processed in real-time or near real-time.
• Variety: The different forms and types of data, including structured (e.g., databases),
semi-structured (e.g., XML, JSON), and unstructured data (e.g., text, images,
videos).

Why Learn Data Science?


• Demand for Data Scientists: The demand for data scientists is high across various
industries, as businesses increasingly rely on data-driven decision-making.
• Versatility: Data Science skills are applicable in numerous fields such as healthcare,
finance, marketing, and technology.
• Problem Solving: Data Science enables professionals to solve complex problems, improve
business processes, and innovate.
• Career Growth: Offers lucrative career opportunities with high earning potential and job
security.

Applications of Data Science


• Healthcare: Predictive analytics for patient outcomes, personalized medicine, and medical
image analysis.
• Finance: Fraud detection, risk management, algorithmic trading, and customer
segmentation.
• Marketing: Customer behavior analysis, targeted advertising, sentiment analysis, and
recommendation systems.
• Retail: Inventory management, demand forecasting, and personalized shopping experiences.
• Transportation: Route optimization, autonomous vehicles, and predictive maintenance.

The Data Science Lifecycle


• Data Collection: Gathering data from various sources such as databases, sensors, or the
web.
• Data Cleaning: Preprocessing the data to handle missing values, outliers, and errors to
ensure quality.
• Data Exploration: Analyzing the data to discover patterns, trends, and relationships using
statistical methods.
• Data Modeling: Building predictive models using machine learning algorithms to make
forecasts or decisions.
• Data Interpretation: Interpreting the results to gain insights and inform decision-making.
• Model Deployment: Implementing the model in a production environment where it can be
used to make real-time decisions.
• Monitoring & Maintenance: Continuously monitoring the model’s performance and
updating it as needed.

Data Scientist’s Toolbox


• Programming Languages: Python, R, and SQL are essential for data manipulation,
analysis, and modeling.
• Libraries & Frameworks:
• Pandas: Data manipulation and analysis.
• NumPy: Numerical computing.
• Scikit-learn: Machine learning algorithms.
• TensorFlow & PyTorch: Deep learning frameworks.
• Data Visualization Tools: Matplotlib, Seaborn, and Tableau for creating visual
representations of data.
• Big Data Technologies: Hadoop and Spark for processing and analyzing large datasets.
• Database Management: SQL databases (e.g., MySQL, PostgreSQL) and NoSQL databases
(e.g., MongoDB, Cassandra).

Types of Data
• Structured Data:
• Definition: Data that is organized in a specific format, often in rows and columns,
making it easily searchable in databases.
• Examples: Excel sheets, SQL databases.
• Semi-structured Data:
• Definition: Data that doesn’t have a fixed format but includes tags or markers to
separate elements.
• Examples: XML, JSON files.
• Unstructured Data:
• Definition: Data that lacks a specific format or structure, making it more challenging
to process and analyze.
• Examples: Text documents, images, videos, emails.
• Problems with Unstructured Data:
• Storage Issues: Requires more space and advanced storage solutions.
• Processing Complexity: Difficult to process and analyze due to its lack of
structure.
• Interpretation Challenges: Requires advanced techniques like natural
language processing (NLP) or image recognition.
Data Sources
• Open Data: Publicly available data that can be freely used and shared. Examples include
government datasets, public health data, and environmental data.
• Social Media Data: Data generated from social media platforms, such as posts, likes,
shares, and comments. Useful for sentiment analysis and trend prediction.
• Multimodal Data: Data that combines multiple types of information, such as text, images,
and audio. Examples include video files with subtitles or annotated images.
• Standard Datasets: Widely-used datasets in Data Science for benchmarking algorithms and
models. Examples include the Iris dataset, MNIST dataset, and ImageNet.

Data Formats
• Integers and Floats:
• Integers: Whole numbers used for counting or indexing.
• Floats: Numbers with decimal points, used for representing continuous data.
• Text Data:
• Plain Text: Simple text data stored without any formatting (e.g., .txt files).
• Text Files:
• CSV Files: Comma-separated values, often used for storing tabular data.
• JSON Files: JavaScript Object Notation, used for storing and exchanging data.
• XML Files: Extensible Markup Language, used for encoding documents in a format
that is both human-readable and machine-readable.
• HTML Files: Hypertext Markup Language, used for creating web pages.
• Dense Numerical Arrays: Arrays containing numerical data, typically used in scientific
computing and data analysis (e.g., NumPy arrays).
• Compressed or Archived Data:
• Tar Files: Archive files that can contain multiple files and directories.
• GZip Files: Compressed files that reduce storage space and transfer time.
• Zip Files: Archive files that can contain multiple files in a compressed format.
• Image Files:
• Rasterized Images: Images made up of pixels (e.g., JPEG, PNG).
• Vectorized Images: Images made up of paths and curves, scalable without losing
quality (e.g., SVG files).
• Compressed Images: Images that have been compressed to reduce file size (e.g.,
JPEG).

2. Statistical Data Analysis


Role of Statistics in Data Science
• Definition: Statistics is the branch of mathematics that deals with the collection, analysis,
interpretation, presentation, and organization of data.
• Importance in Data Science:
• Data Collection: Statistics provides methods to design surveys and experiments to
collect data efficiently.
• Data Analysis: Statistical techniques are essential for analyzing and interpreting
complex data sets.
• Inference: Statistics helps in making inferences about a population based on sample
data.
• Decision Making: Statistical methods enable data-driven decision-making by
providing a quantitative basis for assessing the reliability and significance of results.

Descriptive Statistics (6 Lectures)


• Definition: Descriptive statistics involves summarizing and organizing data to understand
its main characteristics, typically through numerical summaries, graphs, and tables.
• Key Components:
• Measuring the Frequency:
• Definition: Frequency refers to how often a data point occurs in a dataset.
• Tools: Frequency distributions, histograms, and bar charts are used to
visualize frequency.
• Measuring the Central Tendency:
• Mean: The arithmetic average of a set of numbers.
• Median: The middle value in a dataset when arranged in ascending or
descending order.
• Mode: The value that appears most frequently in a dataset.
• Measuring the Dispersion:
• Range: The difference between the highest and lowest values in a dataset.
• Standard Deviation: A measure of the amount of variation or dispersion in a
set of values.
• Variance: The square of the standard deviation, representing the spread of a
dataset.
• Interquartile Range (IQR): The difference between the first quartile (Q1)
and the third quartile (Q3), representing the middle 50% of the data.

Inferential Statistics (10 Lectures)


• Definition: Inferential statistics involves making predictions or inferences about a
population based on a sample of data drawn from that population.
• Key Concepts:
• Hypothesis Testing:
• Definition: A method used to determine if there is enough evidence to reject
a null hypothesis in favor of an alternative hypothesis.
• Steps:
1. Formulate Hypotheses: Define the null hypothesis (H0) and
alternative hypothesis (H1).
2. Choose Significance Level (α): Commonly used levels are 0.05 or
0.01.
3. Calculate Test Statistic: Based on the sample data.
4. Determine p-value: Compare the p-value with the significance level
to make a decision.
5. Make a Conclusion: Accept or reject the null hypothesis.
• Multiple Hypothesis Testing:
• Definition: Testing several hypotheses simultaneously, often using
adjustments like the Bonferroni correction to control the overall error rate.
• Parameter Estimation Methods:
• Point Estimation: Estimating an unknown parameter using a single value
(e.g., sample mean for population mean).
• Interval Estimation: Providing a range within which the parameter is
expected to lie, with a certain level of confidence (e.g., confidence intervals).

Measuring Data Similarity and Dissimilarity


• Definition: Similarity and dissimilarity measures are used to compare data points or objects,
which is essential for clustering, classification, and other data analysis tasks.
• Key Concepts:
• Data Matrix versus Dissimilarity Matrix:
• Data Matrix: Represents data with rows as objects and columns as attributes.
• Dissimilarity Matrix: Represents pairwise dissimilarities between objects,
with values indicating how different two objects are.
• Proximity Measures for Nominal Attributes:
• Definition: Nominal attributes are categorical attributes with no intrinsic
ordering (e.g., color, gender).
• Proximity Measures: Jaccard coefficient, Simple Matching Coefficient
(SMC).
• Proximity Measures for Binary Attributes:
• Definition: Binary attributes take on two values (e.g., 0 or 1).
• Proximity Measures: Hamming distance, Jaccard coefficient for binary data.
• Dissimilarity of Numeric Data:
• Euclidean Distance: The straight-line distance between two points in a
multi-dimensional space.
• Manhattan Distance: The sum of absolute differences between the
coordinates of two points (also known as L1 distance).
• Minkowski Distance: A generalization of Euclidean and Manhattan
distances, parameterized by a value 'p' that determines the specific distance
measure (p=1 for Manhattan, p=2 for Euclidean).
• Proximity Measures for Ordinal Attributes:
• Definition: Ordinal attributes have a clear, ordered relationship between
values (e.g., rankings).
• Proximity Measures: Can use rank correlation coefficients like Spearman's
rank correlation or Kendall's tau.

Concept of Outlier
• Definition: An outlier is a data point that significantly differs from other observations in a
dataset.
• Types of Outliers:
• Univariate Outliers: Outliers that occur in a single variable.
• Multivariate Outliers: Outliers that occur in a combination of variables, not
apparent when looking at individual variables.
• Contextual Outliers: Outliers that are only considered abnormal in a specific
context (e.g., temperature readings that are normal in summer but outliers in winter).
• Outlier Detection Methods:
• Z-Score Method: Calculates how many standard deviations a data point is from the
mean. Data points with a Z-score beyond a certain threshold (e.g., ±3) are considered
outliers.
• IQR Method: Outliers are identified as data points that fall below Q1 - 1.5IQR or
above Q3 + 1.5IQR.
• Machine Learning Methods: Techniques like clustering, isolation forests, and one-
class SVMs can be used to detect outliers in more complex datasets.

3. Data Preprocessing
Data Objects and Attribute Types
• What is an Attribute?
• Definition: An attribute (or feature) is a property or characteristic of an object or
data point. In a dataset, attributes are the columns that describe different aspects of
the data objects (rows).
• Types of Attributes:
• Nominal Attributes:
• Definition: Categorical attributes with no inherent order or ranking among
the values.
• Examples: Colors (red, blue, green), gender (male, female).
• Binary Attributes:
• Definition: Attributes that have two possible states or values.
• Types:
• Symmetric Binary: Both outcomes are equally important (e.g., 0 or 1
in a binary variable).
• Asymmetric Binary: One outcome is more significant than the other
(e.g., success/failure, where success is more critical).
• Ordinal Attributes:
• Definition: Categorical attributes with a meaningful order or ranking
between values.
• Examples: Education levels (high school, bachelor's, master's), customer
satisfaction ratings (poor, fair, good, excellent).
• Numeric Attributes:
• Definition: Attributes that are quantifiable and expressible in numbers.
• Types:
• Discrete Attributes: Attributes that take on a countable number of
distinct values.
• Examples: Number of students in a class, number of cars in a
parking lot.
• Continuous Attributes: Attributes that can take on any value within a
range.
• Examples: Temperature, height, weight.
Data Quality: Why Preprocess the Data?
• Importance of Data Preprocessing:
• Accuracy: Ensures the accuracy and reliability of the analysis by addressing issues
such as missing data, noise, and inconsistencies.
• Efficiency: Reduces the complexity of data, making it easier to process and analyze.
• Consistency: Aligns data from different sources or formats, ensuring that it is
coherent and uniform.
• Improves Model Performance: Clean and well-preprocessed data lead to better
model performance and more accurate predictions.

Data Munging/Wrangling Operations


• Definition: Data munging or wrangling refers to the process of transforming raw data into a
clean, structured format suitable for analysis.
• Common Operations:
• Data Parsing: Converting raw data into a structured format.
• Data Filtering: Removing irrelevant or redundant data.
• Data Aggregation: Summarizing or combining data from multiple sources.
• Data Enrichment: Enhancing data with additional relevant information.

Cleaning Data
• Definition: Data cleaning is the process of identifying and correcting (or removing) errors
and inconsistencies in data to improve its quality.
• Common Data Cleaning Issues:
• Missing Values: Data points where information is absent.
• Handling Methods: Imputation (filling in missing values), deletion, or using
algorithms that can handle missing data.
• Noisy Data: Data that contains errors, inconsistencies, or irrelevant information.
• Types of Noisy Data:
• Duplicate Entries: Multiple records for the same entity.
• Multiple Entries for a Single Entity: Different entries representing
the same entity with slight variations.
• Missing Entries: Partial data missing for certain records.
• NULLs: Missing values represented as NULL.
• Huge Outliers: Data points that are significantly different from other
observations.
• Out-of-Date Data: Data that is no longer accurate or relevant.
• Artificial Entries: Data that is not genuine or was created for testing
purposes.
• Irregular Spacings: Inconsistent spacing within text data.
• Formatting Issues: Different formatting styles used across tables or
columns.
• Extra Whitespace: Unnecessary spaces that can cause parsing issues.
• Irregular Capitalization: Inconsistent use of uppercase and
lowercase letters.
• Inconsistent Delimiters: Different delimiters used to separate data
fields.
• Irregular NULL Format: Inconsistent representation of missing
data.
• Invalid Characters: Characters that do not belong in the dataset.
• Incompatible Datetimes: Different date and time formats that need
standardization.

Data Transformation
• Definition: Data transformation involves converting data into a suitable format or structure
for analysis.
• Common Data Transformation Techniques:
• Rescaling: Adjusting the range of data values to a specific scale, often to bring all
variables into the same range.
• Example: Rescaling data to a range of 0 to 1.
• Normalizing: Adjusting the data to have a mean of 0 and a standard deviation of 1.
• Example: Z-score normalization.
• Binarizing: Converting numerical data into binary form (e.g., 0 or 1).
• Example: Converting a continuous attribute into a binary attribute based on a
threshold.
• Standardizing: Ensuring data follows a standard normal distribution with a mean of
0 and standard deviation of 1.
• Example: Standardizing data to remove the effects of different scales.
• Label Encoding: Converting categorical attributes into numerical form by assigning
a unique integer to each category.
• One-Hot Encoding: Converting categorical attributes into binary vectors where each
category is represented by a binary variable (0 or 1).

Data Reduction
• Definition: Data reduction involves reducing the volume of data while maintaining its
integrity and meaning, making it easier to analyze.
• Techniques:
• Dimensionality Reduction: Reducing the number of attributes or features while
retaining essential information (e.g., PCA, LDA).
• Numerosity Reduction: Reducing the number of data points or records through
techniques like clustering, sampling, or aggregation.

Data Discretization
• Definition: Data discretization involves converting continuous data into discrete intervals or
categories.
• Importance: Useful for transforming continuous attributes into categorical attributes, which
can simplify analysis and improve model performance.
• Methods:
• Binning: Dividing data into intervals, or "bins," and assigning a categorical label to
each bin.
• Histogram Analysis: Using histograms to define intervals based on data distribution.
• Cluster Analysis: Grouping similar data points and assigning them to discrete
categories.

4. Data Visualization
Introduction to Exploratory Data Analysis (EDA)
• Definition: EDA is an approach to analyzing data sets to summarize their main
characteristics, often using visual methods.
• Purpose of EDA:
• Identifying Patterns: Detecting trends, correlations, and relationships in data.
• Spotting Anomalies: Finding outliers or irregularities in the data.
• Checking Assumptions: Verifying the validity of assumptions made about the data.
• Guiding Further Analysis: Informing the choice of statistical models or algorithms
to apply.

Data Visualization and Visual Encoding


• Definition: Data visualization is the graphical representation of data to make complex data
more accessible and understandable.
• Visual Encoding: The process of mapping data attributes (e.g., numbers, categories) to
visual elements like color, shape, size, or position in a chart.
• Examples of Visual Encoding:
• Position: The location of data points on a plot (e.g., x and y axes in a scatter
plot).
• Color: Used to distinguish different categories or indicate data intensity (e.g.,
heat maps).
• Size: Represents the magnitude of data points (e.g., bubble size in bubble
plots).
• Shape: Differentiates between categories (e.g., different marker shapes in a
scatter plot).

Data Visualization Libraries


• Definition: Libraries or software packages that provide tools and functions for creating
visual representations of data.
• Popular Libraries:
• Matplotlib: A widely used Python library for creating static, animated, and
interactive visualizations.
• Seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for
creating attractive and informative statistical graphics.
• Plotly: An interactive graphing library that enables complex, web-based
visualizations.
• ggplot2: A popular data visualization package in R, based on the Grammar of
Graphics.
• D3.js: A JavaScript library for producing dynamic, interactive data visualizations in
web browsers.
Basic Data Visualization Tools
• Histograms:
• Definition: A graphical representation of the distribution of a dataset. It shows the
frequency of data points in specified ranges (bins).
• Use: Ideal for displaying the distribution of a single continuous variable.
• Bar Charts/Graphs:
• Definition: A chart that presents categorical data with rectangular bars. The length of
each bar is proportional to the value it represents.
• Use: Best for comparing the frequency or count of different categories.
• Scatter Plots:
• Definition: A plot that shows the relationship between two numerical variables. Each
point represents an observation in the dataset.
• Use: Useful for identifying correlations or patterns between variables.
• Line Charts:
• Definition: A type of chart that displays data points connected by a line. It shows
trends over time or ordered categories.
• Use: Commonly used to track changes over time.
• Area Plots:
• Definition: Similar to line charts, but the area under the line is filled with color or
shading.
• Use: Good for visualizing cumulative data or comparing multiple variables.
• Pie Charts:
• Definition: A circular chart divided into sectors, each representing a proportion of
the whole.
• Use: Ideal for showing the relative proportions of categories in a dataset.
• Donut Charts:
• Definition: A variation of the pie chart with a central hole, often used to provide
additional information in the center.
• Use: Similar to pie charts but with an added aesthetic appeal.

Specialized Data Visualization Tools


• Boxplots:
• Definition: A graphical representation of the distribution of a dataset based on five
summary statistics: minimum, first quartile (Q1), median, third quartile (Q3), and
maximum.
• Use: Effective for identifying outliers and understanding the spread and skewness of
data.
• Bubble Plots:
• Definition: A variation of a scatter plot where each point is replaced by a bubble, and
the size of the bubble represents a third variable.
• Use: Useful for visualizing three dimensions of data on a two-dimensional plane.
• Heat Maps:
• Definition: A graphical representation of data where individual values are
represented by colors.
• Use: Ideal for displaying the intensity or density of data points across a matrix.
• Dendrogram:
• Definition: A tree-like diagram used to illustrate the arrangement of clusters
produced by hierarchical clustering.
• Use: Useful for visualizing the structure and hierarchy of data clusters.
• Venn Diagram:
• Definition: A diagram that shows all possible logical relations between a finite
collection of sets.
• Use: Effective for illustrating set relationships, such as intersections and unions.
• Treemap:
• Definition: A hierarchical structure represented as nested rectangles, where each
rectangle's size is proportional to the data value.
• Use: Useful for visualizing large amounts of hierarchical data.
• 3D Scatter Plots:
• Definition: An extension of the scatter plot into three dimensions, where each point
is defined by three numerical coordinates.
• Use: Ideal for visualizing the relationship between three continuous variables.

Advanced Data Visualization Tools - Wordclouds


• Definition: A visual representation of text data where the size of each word reflects its
frequency or importance.
• Use: Effective for quickly identifying the most prominent words or themes in a text dataset.

Visualization of Geospatial Data


• Definition: The process of visualizing data that includes geographical or spatial
components.
• Tools and Techniques:
• Choropleth Maps: Maps where areas are shaded or patterned in proportion to the
data value.
• Point Maps: Maps that represent individual data points as symbols, such as dots.
• Heat Maps: Geographical maps that use color to represent the density of data points
in a given area.
• Interactive Maps: Maps that allow users to interact with data by zooming, clicking,
or filtering.

Data Visualization Types


• Categorical Data Visualization:
• Tools: Bar charts, pie charts, donut charts.
• Purpose: Comparing different categories or understanding the distribution of
categorical data.
• Numerical Data Visualization:
• Tools: Histograms, boxplots, scatter plots.
• Purpose: Understanding the distribution, trends, and relationships between
numerical variables.
• Hierarchical Data Visualization:
• Tools: Treemaps, dendrograms.
• Purpose: Displaying the structure and relationships within hierarchical datasets.
• Network Data Visualization:
• Tools: Network graphs, node-link diagrams.
• Purpose: Visualizing relationships and interactions between entities within a
network.

You might also like