DWH m2p2
DWH m2p2
1. Data Objects: Various names include record, point, vector, pattern, event, case,
sample, observation, or entity.
2. Attributes: Descriptive characteristics that define data objects, such as GPA, ID
number, or temperature.
5. Nominal Attributes:
o Represented by names or labels (e.g., eye color, gender).
o Operations: Equality and inequality (e.g., mode, contingency tables).
6. Ordinal Attributes:
o Values have a meaningful order (e.g., grades, rankings).
o Operations: Order comparison (e.g., median, rank correlation).
7. Interval Attributes:
o Meaningful differences between values but no true zero point (e.g., Celsius
temperature).
o Operations: Addition and subtraction (e.g., mean, standard deviation).
8. Ratio Attributes:
o Meaningful differences and ratios between values (e.g., age, length).
o Operations: Multiplication and division (e.g., geometric mean, percent
variation).
9. Discrete Attributes:
o Finite or countably infinite set of values (e.g., zip codes, counts).
o Can be categorical (e.g., binary attributes like true/false).
10. Continuous Attributes:
o Real-number values (e.g., temperature, height).
o Typically represented as floating-point variables for precision.
Record Data
Record data is structured as a collection of records (or data objects), where each record
consists of a fixed set of data fields (attributes). Here are some key variations:
Graph-Based Data
Graphs are used to represent data where relationships between objects are crucial:
4. Data with Relationships Among Objects:
o Defini&on: Objects are nodes, rela/onships are edges with proper/es (e.g.,
direc/on, weight).
o Example: Web pages linked by hyperlinks; social networks with users as
nodes and friendships as edges.
o A/ributes: Nodes and edges can have addi/onal proper/es.
o Representa&on: Enables network analysis and algorithms like PageRank for
search engines.
5. Data with Objects That Are Graphs:
o Defini&on: Objects themselves have internal structure represented as graphs.
o Example: Chemical compounds where atoms are nodes and bonds are edges.
o A/ributes: Atomic proper/es and bond types.
o Representa&on: Used in chemical informa/cs for predic/ng compound
proper/es.
Ordered Data
6. Sequential Data:
o Defini&on: Extends record data with /me-stamped records.
o Example: Retail transac/ons with /mestamps; clickstream data on websites.
o A/ributes: Includes /me stamps or sequence posi/ons.
o Representa&on: Analyzed for sequen/al paHerns and temporal correla/ons.
7. Sequence Data:
o Defini&on: Similar to sequen/al data but without explicit /mestamps,
focusing on ordered sequences.
o Example: Gene/c sequences represented by nucleo/de sequences.
o A/ributes: DNA or RNA bases (A, T, C, G).
o Representa&on: Used in bioinforma/cs for genome analysis and sequence
alignment.
8. Time Series Data:
o Defini&on: Each record is a series of measurements taken over /me.
o Example: Stock prices, temperature readings over months or years.
o A/ributes: Time-indexed measurements.
o Representa&on: Analyzed for trends, seasonality, and temporal correla/ons.
9. Spatial Data:
o Defini&on: Data with spa/al aHributes (posi/ons or areas) and possibly other
aHributes.
o Example: Geospa/al data like weather maps, satellite imagery.
o A/ributes: Spa/al coordinates and environmental variables (temperature,
precipita/on).
o Representa&on: Analyzed for spa/al autocorrela/on and geographical
paHerns.
Data mining often deals with data collected for purposes other than mining itself. Therefore,
addressing data quality issues at the source is usually not feasible. Instead, data mining
focuses on detecting and correcting data quality problems and using algorithms tolerant to
poor data quality.
Data imperfections are common due to human error, limitations of measuring devices, or
flaws in the data collection process. Issues include missing values, duplicate data objects, and
inconsistent data. Data cleaning involves detecting and correcting these issues.
Measurement error refers to inaccuracies in recorded values compared to true values, while
data collection errors include omission or inappropriate inclusion of data objects. Errors can
be systematic or random.
Noise refers to random disturbances in data, while artifacts are deterministic distortions.
Techniques from signal or image processing are used to reduce noise, preserving underlying
patterns.
Precision is the closeness of repeated measurements to each other, bias is systematic variation
from the true value, and accuracy is the closeness to the true value. Significant digits should
match data precision.
Outliers
Outliers are data objects or values significantly different from others in a dataset. They can be
of interest in anomaly detection tasks like fraud detection.
Missing Values
Some data objects may lack attribute values, impacting analysis. Strategies include
eliminating data with missing values, estimating missing values, or modifying analysis
methods to ignore missing values.
Inconsistent Values
Inconsistencies arise when data values conflict with expected norms (e.g., negative height).
Detecting and correcting such issues often requires external validation or additional
redundant information.
Duplicate Data
Duplicates or almost duplicates of data objects may exist in datasets, requiring identification
and resolution to avoid inaccuracies in analysis results. Deduplication processes help manage
these issues effectively.
Conclusion
Understanding and addressing data quality issues are crucial for effective data mining and
analysis. Techniques for detecting, correcting, and managing data imperfections ensure
reliable results from data-driven applications.
Data quality is often defined by its suitability for the intended use, a perspective particularly
valuable in business and industry, as well as in statistics and experimental sciences where
data collection is tailored to specific hypotheses.
Timeliness
Data can lose relevance over time, especially when it reflects dynamic processes like
customer purchasing behavior or web browsing patterns. Outdated data leads to outdated
models and patterns.
Relevance
For effective modeling, data must include all necessary information. Omissions, such as
excluding driver age and gender from a model predicting accident rates, can severely impact
model accuracy unless indirect replacements exist.
Sampling Bias
Sampling bias arises when a sample doesn't accurately represent the full population, skewing
analysis results. For example, survey data may only reflect respondents' views, not the entire
population's views.
Data quality isn't just about accuracy and completeness; it's also about suitability for specific
applications. Timeliness, relevance, absence of bias, and comprehensive documentation are
crucial aspects that ensure data meets its intended purpose effectively.
2.3.2 Sampling
Sampling Approaches
Sampling is a widely used technique in both statistics and data mining for selecting a subset
of data objects to analyze. While statisticians and data miners have different motivations for
sampling, the goal remains consistent: to efficiently obtain insights from a subset of data that
represents the larger population.
Simple Random Sampling
Simple random sampling involves selecting data objects from a population where each object
has an equal probability of being chosen. There are two variations:
In practical applications, the differences between these methods are minor when the sample
size is small relative to the population size. Sampling with replacement is often simpler to
implement and analyze statistically due to its consistent probabilities throughout the selection
process.
Stratified Sampling
Stratified sampling is particularly useful when the population consists of distinct groups or
strata. Instead of sampling directly from the entire population, stratified sampling divides the
population into homogeneous subgroups called strata and samples proportionately from each
stratum. This approach ensures that each stratum is adequately represented in the sample,
which is crucial for maintaining the integrity of rare classes or groups within the data.
Progressive Sampling
Progressive sampling is an adaptive technique where the sample size increases gradually until
a representative sample is achieved. This method avoids the need to pre-determine an exact
sample size, which can be challenging in practice. The decision to stop sampling is often
based on achieving a desired level of representativeness or accuracy in the model being built.
Conclusion
Sampling plays a pivotal role in data mining by enabling efficient analysis of large datasets.
Whether through simple random sampling, stratified sampling, or adaptive progressive
sampling, the goal is to obtain a subset that accurately represents the characteristics of the
entire dataset. Each sampling method has its advantages and is chosen based on specific
objectives and constraints in data mining applications.
• Feature Extraction:
• Definition: Creating a new set of features from raw data to make it suitable for
classification algorithms.
• Example: Extracting features like edges or areas correlated with human faces from
photographs to classify images.
• Domain-specific: Techniques vary greatly across different domains (e.g., image
processing vs. financial data).
• Purpose: Transforming data to reveal hidden patterns (e.g., Fourier transform for time
series to expose frequency information).
• Example: Using Fourier transform to identify periodic patterns in time series data
despite noise.
• Feature Construction:
• Purpose: Creating new features that enhance the effectiveness of data mining
algorithms.
• Example: Constructing a density feature from mass and volume attributes of artifacts
to classify them by material type.
• Variable Transformation:
• Normalization or Standardization: