FDSNotes
FDSNotes
Types of Data
• Structured Data:
• Definition: Data that is organized in a specific format, often in rows and columns,
making it easily searchable in databases.
• Examples: Excel sheets, SQL databases.
• Semi-structured Data:
• Definition: Data that doesn’t have a fixed format but includes tags or markers to
separate elements.
• Examples: XML, JSON files.
• Unstructured Data:
• Definition: Data that lacks a specific format or structure, making it more challenging
to process and analyze.
• Examples: Text documents, images, videos, emails.
• Problems with Unstructured Data:
• Storage Issues: Requires more space and advanced storage solutions.
• Processing Complexity: Difficult to process and analyze due to its lack of
structure.
• Interpretation Challenges: Requires advanced techniques like natural
language processing (NLP) or image recognition.
Data Sources
• Open Data: Publicly available data that can be freely used and shared. Examples include
government datasets, public health data, and environmental data.
• Social Media Data: Data generated from social media platforms, such as posts, likes,
shares, and comments. Useful for sentiment analysis and trend prediction.
• Multimodal Data: Data that combines multiple types of information, such as text, images,
and audio. Examples include video files with subtitles or annotated images.
• Standard Datasets: Widely-used datasets in Data Science for benchmarking algorithms and
models. Examples include the Iris dataset, MNIST dataset, and ImageNet.
Data Formats
• Integers and Floats:
• Integers: Whole numbers used for counting or indexing.
• Floats: Numbers with decimal points, used for representing continuous data.
• Text Data:
• Plain Text: Simple text data stored without any formatting (e.g., .txt files).
• Text Files:
• CSV Files: Comma-separated values, often used for storing tabular data.
• JSON Files: JavaScript Object Notation, used for storing and exchanging data.
• XML Files: Extensible Markup Language, used for encoding documents in a format
that is both human-readable and machine-readable.
• HTML Files: Hypertext Markup Language, used for creating web pages.
• Dense Numerical Arrays: Arrays containing numerical data, typically used in scientific
computing and data analysis (e.g., NumPy arrays).
• Compressed or Archived Data:
• Tar Files: Archive files that can contain multiple files and directories.
• GZip Files: Compressed files that reduce storage space and transfer time.
• Zip Files: Archive files that can contain multiple files in a compressed format.
• Image Files:
• Rasterized Images: Images made up of pixels (e.g., JPEG, PNG).
• Vectorized Images: Images made up of paths and curves, scalable without losing
quality (e.g., SVG files).
• Compressed Images: Images that have been compressed to reduce file size (e.g.,
JPEG).
Concept of Outlier
• Definition: An outlier is a data point that significantly differs from other observations in a
dataset.
• Types of Outliers:
• Univariate Outliers: Outliers that occur in a single variable.
• Multivariate Outliers: Outliers that occur in a combination of variables, not
apparent when looking at individual variables.
• Contextual Outliers: Outliers that are only considered abnormal in a specific
context (e.g., temperature readings that are normal in summer but outliers in winter).
• Outlier Detection Methods:
• Z-Score Method: Calculates how many standard deviations a data point is from the
mean. Data points with a Z-score beyond a certain threshold (e.g., ±3) are considered
outliers.
• IQR Method: Outliers are identified as data points that fall below Q1 - 1.5IQR or
above Q3 + 1.5IQR.
• Machine Learning Methods: Techniques like clustering, isolation forests, and one-
class SVMs can be used to detect outliers in more complex datasets.
3. Data Preprocessing
Data Objects and Attribute Types
• What is an Attribute?
• Definition: An attribute (or feature) is a property or characteristic of an object or
data point. In a dataset, attributes are the columns that describe different aspects of
the data objects (rows).
• Types of Attributes:
• Nominal Attributes:
• Definition: Categorical attributes with no inherent order or ranking among
the values.
• Examples: Colors (red, blue, green), gender (male, female).
• Binary Attributes:
• Definition: Attributes that have two possible states or values.
• Types:
• Symmetric Binary: Both outcomes are equally important (e.g., 0 or 1
in a binary variable).
• Asymmetric Binary: One outcome is more significant than the other
(e.g., success/failure, where success is more critical).
• Ordinal Attributes:
• Definition: Categorical attributes with a meaningful order or ranking
between values.
• Examples: Education levels (high school, bachelor's, master's), customer
satisfaction ratings (poor, fair, good, excellent).
• Numeric Attributes:
• Definition: Attributes that are quantifiable and expressible in numbers.
• Types:
• Discrete Attributes: Attributes that take on a countable number of
distinct values.
• Examples: Number of students in a class, number of cars in a
parking lot.
• Continuous Attributes: Attributes that can take on any value within a
range.
• Examples: Temperature, height, weight.
Data Quality: Why Preprocess the Data?
• Importance of Data Preprocessing:
• Accuracy: Ensures the accuracy and reliability of the analysis by addressing issues
such as missing data, noise, and inconsistencies.
• Efficiency: Reduces the complexity of data, making it easier to process and analyze.
• Consistency: Aligns data from different sources or formats, ensuring that it is
coherent and uniform.
• Improves Model Performance: Clean and well-preprocessed data lead to better
model performance and more accurate predictions.
Cleaning Data
• Definition: Data cleaning is the process of identifying and correcting (or removing) errors
and inconsistencies in data to improve its quality.
• Common Data Cleaning Issues:
• Missing Values: Data points where information is absent.
• Handling Methods: Imputation (filling in missing values), deletion, or using
algorithms that can handle missing data.
• Noisy Data: Data that contains errors, inconsistencies, or irrelevant information.
• Types of Noisy Data:
• Duplicate Entries: Multiple records for the same entity.
• Multiple Entries for a Single Entity: Different entries representing
the same entity with slight variations.
• Missing Entries: Partial data missing for certain records.
• NULLs: Missing values represented as NULL.
• Huge Outliers: Data points that are significantly different from other
observations.
• Out-of-Date Data: Data that is no longer accurate or relevant.
• Artificial Entries: Data that is not genuine or was created for testing
purposes.
• Irregular Spacings: Inconsistent spacing within text data.
• Formatting Issues: Different formatting styles used across tables or
columns.
• Extra Whitespace: Unnecessary spaces that can cause parsing issues.
• Irregular Capitalization: Inconsistent use of uppercase and
lowercase letters.
• Inconsistent Delimiters: Different delimiters used to separate data
fields.
• Irregular NULL Format: Inconsistent representation of missing
data.
• Invalid Characters: Characters that do not belong in the dataset.
• Incompatible Datetimes: Different date and time formats that need
standardization.
Data Transformation
• Definition: Data transformation involves converting data into a suitable format or structure
for analysis.
• Common Data Transformation Techniques:
• Rescaling: Adjusting the range of data values to a specific scale, often to bring all
variables into the same range.
• Example: Rescaling data to a range of 0 to 1.
• Normalizing: Adjusting the data to have a mean of 0 and a standard deviation of 1.
• Example: Z-score normalization.
• Binarizing: Converting numerical data into binary form (e.g., 0 or 1).
• Example: Converting a continuous attribute into a binary attribute based on a
threshold.
• Standardizing: Ensuring data follows a standard normal distribution with a mean of
0 and standard deviation of 1.
• Example: Standardizing data to remove the effects of different scales.
• Label Encoding: Converting categorical attributes into numerical form by assigning
a unique integer to each category.
• One-Hot Encoding: Converting categorical attributes into binary vectors where each
category is represented by a binary variable (0 or 1).
Data Reduction
• Definition: Data reduction involves reducing the volume of data while maintaining its
integrity and meaning, making it easier to analyze.
• Techniques:
• Dimensionality Reduction: Reducing the number of attributes or features while
retaining essential information (e.g., PCA, LDA).
• Numerosity Reduction: Reducing the number of data points or records through
techniques like clustering, sampling, or aggregation.
Data Discretization
• Definition: Data discretization involves converting continuous data into discrete intervals or
categories.
• Importance: Useful for transforming continuous attributes into categorical attributes, which
can simplify analysis and improve model performance.
• Methods:
• Binning: Dividing data into intervals, or "bins," and assigning a categorical label to
each bin.
• Histogram Analysis: Using histograms to define intervals based on data distribution.
• Cluster Analysis: Grouping similar data points and assigning them to discrete
categories.
4. Data Visualization
Introduction to Exploratory Data Analysis (EDA)
• Definition: EDA is an approach to analyzing data sets to summarize their main
characteristics, often using visual methods.
• Purpose of EDA:
• Identifying Patterns: Detecting trends, correlations, and relationships in data.
• Spotting Anomalies: Finding outliers or irregularities in the data.
• Checking Assumptions: Verifying the validity of assumptions made about the data.
• Guiding Further Analysis: Informing the choice of statistical models or algorithms
to apply.