0% found this document useful (0 votes)
10 views8 pages

DWH m2p2

data warehouse-data mining
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

DWH m2p2

data warehouse-data mining
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Types of Data Objects and Attributes

1. Data Objects: Various names include record, point, vector, pattern, event, case,
sample, observation, or entity.
2. Attributes: Descriptive characteristics that define data objects, such as GPA, ID
number, or temperature.

Attributes and Measurement

3. Definition of Attribute: A property or characteristic of an object that can vary.


4. Measurement Scale: Defines how attributes are quantified (numerically or
symbolically).

Types of Attributes and Operations

5. Nominal Attributes:
o Represented by names or labels (e.g., eye color, gender).
o Operations: Equality and inequality (e.g., mode, contingency tables).
6. Ordinal Attributes:
o Values have a meaningful order (e.g., grades, rankings).
o Operations: Order comparison (e.g., median, rank correlation).
7. Interval Attributes:
o Meaningful differences between values but no true zero point (e.g., Celsius
temperature).
o Operations: Addition and subtraction (e.g., mean, standard deviation).
8. Ratio Attributes:
o Meaningful differences and ratios between values (e.g., age, length).
o Operations: Multiplication and division (e.g., geometric mean, percent
variation).

Attribute Characteristics and Representation

9. Discrete Attributes:
o Finite or countably infinite set of values (e.g., zip codes, counts).
o Can be categorical (e.g., binary attributes like true/false).
10. Continuous Attributes:
o Real-number values (e.g., temperature, height).
o Typically represented as floating-point variables for precision.

Special Attribute Types

11. Binary Attributes:


o Specific case of discrete attributes with only two possible values (e.g., yes/no,
male/female).
12. Asymmetric Attributes:
o Focus on presence (non-zero values) rather than absence (e.g., courses taken
by students).

Practical Applications and Considerations


13. Data Analysis Techniques:
o Selection of appropriate techniques based on attribute type (e.g., mean for
interval and ratio attributes).
14. Measurement and Transformation:
o Transformations (e.g., converting temperature scales) must preserve the
attribute's meaning.
15. Statistical Validity:
o Ensure statistical operations align with the type of attribute to yield
meaningful results (e.g., correlation tests for quantitative attributes).

Types of Data Sets

Record Data

Record data is structured as a collection of records (or data objects), where each record
consists of a fixed set of data fields (attributes). Here are some key variations:

1. Transaction or Market Basket Data:


o Defini&on: Each record represents a transac/on involving a set of items.
o Example: Shopping records where each transac/on lists purchased items.
o A/ributes: Typically asymmetric (0-1 entries) indica/ng presence or absence
of items.
o Representa&on: OCen stored as sparse matrices due to the large number of
poten/al items and sparsity of transac/ons.
2. Data Matrix:
o Defini&on: Data objects have a fixed set of numeric aHributes, forming a
mul/dimensional space.
o Example: Sta/s/cal datasets where each row represents an observa/on and
each column represents a numeric aHribute.
o A/ributes: Numeric, allowing for standard matrix opera/ons (e.g., mean,
covariance).
o Representa&on: Used in sta/s/cal analysis and machine learning for
modeling rela/onships between aHributes.
3. Sparse Data Matrix:
o Defini&on: A subset of data matrix where aHributes are mostly zero.
o Example: Document-term matrices in text mining where terms are sparse
across documents.
o A/ributes: Asymmetric (0-1 entries), indica/ng term presence.
o Representa&on: Efficient storage and processing due to sparsity.

Graph-Based Data

Graphs are used to represent data where relationships between objects are crucial:
4. Data with Relationships Among Objects:
o Defini&on: Objects are nodes, rela/onships are edges with proper/es (e.g.,
direc/on, weight).
o Example: Web pages linked by hyperlinks; social networks with users as
nodes and friendships as edges.
o A/ributes: Nodes and edges can have addi/onal proper/es.
o Representa&on: Enables network analysis and algorithms like PageRank for
search engines.
5. Data with Objects That Are Graphs:
o Defini&on: Objects themselves have internal structure represented as graphs.
o Example: Chemical compounds where atoms are nodes and bonds are edges.
o A/ributes: Atomic proper/es and bond types.
o Representa&on: Used in chemical informa/cs for predic/ng compound
proper/es.

Ordered Data

Data where attributes have an inherent order or relationship in time or space:

6. Sequential Data:
o Defini&on: Extends record data with /me-stamped records.
o Example: Retail transac/ons with /mestamps; clickstream data on websites.
o A/ributes: Includes /me stamps or sequence posi/ons.
o Representa&on: Analyzed for sequen/al paHerns and temporal correla/ons.
7. Sequence Data:
o Defini&on: Similar to sequen/al data but without explicit /mestamps,
focusing on ordered sequences.
o Example: Gene/c sequences represented by nucleo/de sequences.
o A/ributes: DNA or RNA bases (A, T, C, G).
o Representa&on: Used in bioinforma/cs for genome analysis and sequence
alignment.
8. Time Series Data:
o Defini&on: Each record is a series of measurements taken over /me.
o Example: Stock prices, temperature readings over months or years.
o A/ributes: Time-indexed measurements.
o Representa&on: Analyzed for trends, seasonality, and temporal correla/ons.
9. Spatial Data:
o Defini&on: Data with spa/al aHributes (posi/ons or areas) and possibly other
aHributes.
o Example: Geospa/al data like weather maps, satellite imagery.
o A/ributes: Spa/al coordinates and environmental variables (temperature,
precipita/on).
o Representa&on: Analyzed for spa/al autocorrela/on and geographical
paHerns.

Characteristics Impacting Data Mining Techniques


• Dimensionality: Number of aHributes affec/ng complexity and curse of
dimensionality.
• Sparsity: Presence of mostly zero values affec/ng storage and computa/on
efficiency.
• Resolu&on: Level of detail impac/ng paHern visibility and noise in data analysis.

Summary of Section 2.2 Data Quality

Data Quality Issues in Data Mining Applications

Data mining often deals with data collected for purposes other than mining itself. Therefore,
addressing data quality issues at the source is usually not feasible. Instead, data mining
focuses on detecting and correcting data quality problems and using algorithms tolerant to
poor data quality.

Measurement and Data Collection Issues

Data imperfections are common due to human error, limitations of measuring devices, or
flaws in the data collection process. Issues include missing values, duplicate data objects, and
inconsistent data. Data cleaning involves detecting and correcting these issues.

Measurement and Data Collection Errors

Measurement error refers to inaccuracies in recorded values compared to true values, while
data collection errors include omission or inappropriate inclusion of data objects. Errors can
be systematic or random.

Noise and Artifacts

Noise refers to random disturbances in data, while artifacts are deterministic distortions.
Techniques from signal or image processing are used to reduce noise, preserving underlying
patterns.

Precision, Bias, and Accuracy

Precision is the closeness of repeated measurements to each other, bias is systematic variation
from the true value, and accuracy is the closeness to the true value. Significant digits should
match data precision.

Outliers

Outliers are data objects or values significantly different from others in a dataset. They can be
of interest in anomaly detection tasks like fraud detection.

Missing Values

Some data objects may lack attribute values, impacting analysis. Strategies include
eliminating data with missing values, estimating missing values, or modifying analysis
methods to ignore missing values.
Inconsistent Values

Inconsistencies arise when data values conflict with expected norms (e.g., negative height).
Detecting and correcting such issues often requires external validation or additional
redundant information.

Duplicate Data

Duplicates or almost duplicates of data objects may exist in datasets, requiring identification
and resolution to avoid inaccuracies in analysis results. Deduplication processes help manage
these issues effectively.

Conclusion

Understanding and addressing data quality issues are crucial for effective data mining and
analysis. Techniques for detecting, correcting, and managing data imperfections ensure
reliable results from data-driven applications.

Summary of Section 2.2.2: Issues Related to Applications

Data Quality from an Application Viewpoint

Data quality is often defined by its suitability for the intended use, a perspective particularly
valuable in business and industry, as well as in statistics and experimental sciences where
data collection is tailored to specific hypotheses.

Timeliness

Data can lose relevance over time, especially when it reflects dynamic processes like
customer purchasing behavior or web browsing patterns. Outdated data leads to outdated
models and patterns.

Relevance

For effective modeling, data must include all necessary information. Omissions, such as
excluding driver age and gender from a model predicting accident rates, can severely impact
model accuracy unless indirect replacements exist.

Sampling Bias

Sampling bias arises when a sample doesn't accurately represent the full population, skewing
analysis results. For example, survey data may only reflect respondents' views, not the entire
population's views.

Documentation and Knowledge

Well-documented datasets enhance analysis quality by providing insights into data


characteristics like attribute relationships or missing value indicators (e.g., "-9999"). Lack of
documentation can lead to flawed analyses due to misinterpretation or ignorance of crucial
data aspects.
Conclusion

Data quality isn't just about accuracy and completeness; it's also about suitability for specific
applications. Timeliness, relevance, absence of bias, and comprehensive documentation are
crucial aspects that ensure data meets its intended purpose effectively.

1. Importance of Similarity and Dissimilarity:


o Used in data mining techniques like clustering, classification, and anomaly
detection.
o Data can be transformed into a similarity or dissimilarity space for analysis.
2. Definitions:
o Similarity: Measures how alike two objects are. Usually ranges from 0 (no
similarity) to 1 (complete similarity).
o Dissimilarity: Measures how different two objects are. Often used
interchangeably with distance, which has specific properties.
3. Transformations:
o Converting similarities to dissimilarities or vice versa, often to fit specific
ranges like [0, 1].
o Example transformations include linear and non-linear mappings.
4. Types of Proximity Measures:
o Simple Attributes:
§ Nominal: Similarity is 1 if values match, 0 otherwise; dissimilarity is
opposite.
§ Ordinal: Takes into account order; mapped to integers to quantify
differences.
§ Interval or Ratio: Uses absolute differences between attribute values.
5. Complex Proximity Measures:
o Multiple Attributes: Combines individual attribute proximities into overall
object proximity.
o Distance Measures: Euclidean, Manhattan, Minkowski distances for numeric
data.
6. Specific Examples:
o Similarity Measures: Simple Matching Coefficient (SMC), Jaccard
Coefficient, Cosine Similarity (for documents).
7. Properties:
o Metrics satisfy positivity, symmetry, and the triangle inequality.
o Non-metric measures like set differences or time intervals can lack these
properties.
8. Applications:
o Used in various contexts from document similarity to binary data
comparisons.

2.3.2 Sampling

Sampling Approaches

Sampling is a widely used technique in both statistics and data mining for selecting a subset
of data objects to analyze. While statisticians and data miners have different motivations for
sampling, the goal remains consistent: to efficiently obtain insights from a subset of data that
represents the larger population.
Simple Random Sampling

Simple random sampling involves selecting data objects from a population where each object
has an equal probability of being chosen. There are two variations:

• Sampling without replacement: Once an object is selected, it is removed from the


population.
• Sampling with replacement: Objects remain in the population after being selected,
allowing them to be chosen more than once in the sample.

In practical applications, the differences between these methods are minor when the sample
size is small relative to the population size. Sampling with replacement is often simpler to
implement and analyze statistically due to its consistent probabilities throughout the selection
process.

Stratified Sampling

Stratified sampling is particularly useful when the population consists of distinct groups or
strata. Instead of sampling directly from the entire population, stratified sampling divides the
population into homogeneous subgroups called strata and samples proportionately from each
stratum. This approach ensures that each stratum is adequately represented in the sample,
which is crucial for maintaining the integrity of rare classes or groups within the data.

Progressive Sampling

Progressive sampling is an adaptive technique where the sample size increases gradually until
a representative sample is achieved. This method avoids the need to pre-determine an exact
sample size, which can be challenging in practice. The decision to stop sampling is often
based on achieving a desired level of representativeness or accuracy in the model being built.

Determining Sample Size

Choosing an appropriate sample size is critical in sampling. Larger samples generally


increase the representativeness of the sample but come with higher computational costs.
Conversely, smaller samples may miss important patterns or introduce bias. Figure 2.9 in
your source material illustrates how different sample sizes affect the representation of data
structure, highlighting the trade-offs involved in sample size selection.

Conclusion

Sampling plays a pivotal role in data mining by enabling efficient analysis of large datasets.
Whether through simple random sampling, stratified sampling, or adaptive progressive
sampling, the goal is to obtain a subset that accurately represents the characteristics of the
entire dataset. Each sampling method has its advantages and is chosen based on specific
objectives and constraints in data mining applications.

• Feature Extraction:

• Definition: Creating a new set of features from raw data to make it suitable for
classification algorithms.
• Example: Extracting features like edges or areas correlated with human faces from
photographs to classify images.
• Domain-specific: Techniques vary greatly across different domains (e.g., image
processing vs. financial data).

• Mapping Data to a New Space:

• Purpose: Transforming data to reveal hidden patterns (e.g., Fourier transform for time
series to expose frequency information).
• Example: Using Fourier transform to identify periodic patterns in time series data
despite noise.

• Feature Construction:

• Purpose: Creating new features that enhance the effectiveness of data mining
algorithms.
• Example: Constructing a density feature from mass and volume attributes of artifacts
to classify them by material type.

• Discretization and Binarization:

• Purpose: Transforming continuous attributes into categorical or binary forms


required by certain algorithms.
• Methods: Equal width, equal frequency discretization; binarization for association
pattern algorithms.

• Variable Transformation:

• Purpose: Applying transformations to entire variables to modify their distribution or


scale.
• Examples: Using logarithmic transformation to compress large value ranges (e.g.,
byte transfer sizes).

• Normalization or Standardization:

• Purpose: Adjusting variables to a common scale or distribution to avoid dominance


by certain attributes.
• Methods: Standardizing variables to have a mean of 0 and standard deviation of 1,
adjusting for outliers.

You might also like