0% found this document useful (0 votes)

6 views

Lecture 1

The document provides an overview of Data Mining (DM), defining it as the process of extracting useful patterns from various data sources and detailing its role in the knowledge discovery process. It outlines key steps in data preparation, techniques used in DM, types of data, and applications, emphasizing the importance of similarity measures like cosine similarity in understanding relationships within data. Additionally, it discusses challenges in data integration and the significance of dimensionality reduction techniques such as PCA for effective data analysis.

Uploaded by

Dina Bardakji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Lecture 1

Uploaded by

Dina Bardakji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 68

Lecture 1: Introduction

1. What is Data Mining (DM)?

Data Mining, also called knowledge discovery from data (KDD), is a process of
extracting useful patterns from data sources such as text, web, images, etc. The
patterns must be non-trivial, novel, potentially useful, and understandable.

2. How does Data Mining fit into the knowledge discovery process?
Data Mining is an essential step in the knowledge discovery process, which
includes data preparation (cleaning, integration, transformation, selection), data
mining (applying intelligent methods to extract patterns), pattern evaluation (to
identify interesting patterns), and knowledge presentation (visualization and
representation).

3. What are the key steps involved in data preparation?

The key steps in data preparation include:
a. Data cleaning - to remove noises and inconsistent data
b. Data integration - combining multiple data sources
c. Data transformation - converting data into appropriate forms for mining
d. Data selection - selecting relevant data for the analysis task

4. What techniques are used in Data Mining?

The image does not provide details on specific Data Mining techniques. However,
it mentions that Data Mining involves "intelligent methods applied to extract
patterns" from the data.
5. What is the purpose of pattern evaluation in Data Mining?
The purpose of pattern evaluation is to identify the truly interesting patterns from
the results of the Data Mining process. This helps ensure that the discovered
patterns are meaningful and useful.

6. What types of data are used in Data Mining?

The image mentions three types of data used in Data Mining:
1. Structured data
2. Unstructured data
3. Semi-structured data

7. What are the two main categories of Data Mining tasks?

The two main categories of Data Mining tasks are:
1. Descriptive tasks
2. Predictive tasks

8. What are some successful applications of Data Mining?

The image provides two examples of successful Data Mining applications:
1. Search engines - classifying and grouping data to create summaries of identified
relationships
2. Digital marketing and search engine optimization (SEO)

9. Can you provide an example of Data Mining in action?

A search engine using Data Mining to classify and group data to create a summary
of identified relationships.
Lecture 2: Mining frequent patterns,
associations, and correlations

1. What are the corresponding Data Mining techniques for the different
data types?
Structured data: Sequential pattern mining, relational data mining
Semi-structured data: Graph pattern mining, information network mining
1. Unstructured data: Text mining, image and video recognition (deep learning )

2. What is the purpose of mining frequent patterns, associations, and

correlations?
The purpose of mining frequent patterns, associations, and correlations is to
discover descriptive knowledge about the data. This includes finding
frequent item sets, association rules (e.g., "if a customer buys a computer,
they are likely to also buy a webcam"), and correlations between variables.

3. What are some applications of Data Mining?

There are several applications of Data Mining, including:
a. Business intelligence (e.g., market analysis, customer relationship
management)
b. Web search engines (handling large, growing datasets and free-text
queries)
c. Social media and social network analysis (detecting communities,
analyzing information propagation)
4. Can you provide an example of the diversity of data types for Data
Mining?
An example of an online shopping site, which contains a mix of
structured data (product information in a database), semi-structured data
(customer reviews in XML format), and unstructured data (product
images, videos, and user reviews).

5. What are the two main categories of Data Mining tasks?

- The two main categories of Data Mining tasks are:
1. Descriptive mining - characterizes properties of the data, such as mining
frequent patterns, associations, and correlations.
2. Predictive mining - performs induction on the data to make predictions, such
as classification and regression.

6. How are frequent patterns, associations, and correlations used in

descriptive mining?
Frequent patterns are sets of items that occur together frequently in the
data. Association analysis is used to discover rules that describe which
items are likely to be purchased together (e.g., "if a customer buys a
computer, they are likely to also buy a webcam"). Correlation analysis
measures the strength of the relationship between variables.
Lecture 3: Data, measurements, and
preprocessing

1. Calculate the cosine similarity between the term-frequency vectors

of the four documents (Document1, Document2, Document3, and
Document4) provided in the lecture. Explain the step-by-step
process.

Answer:

To calculate the cosine similarity between the term-frequency vectors of

the four documents, we will follow these steps:

1. Represent each document as a term-frequency vector:

- Document1 = (5, 0, 3, 0, 2, 0, 0, 20, 0)
- Document2 = (3, 0, 2, 0, 1, 1, 0, 10, 1)
- Document3 = (0, 7, 0, 2, 1, 0, 0, 30, 0)
- Document4 = (0, 1, 0, 0, 1, 2, 2, 3, 0)
2. Calculate the dot product between each pair of document vectors:
- Dot product of Document1 and Document2 = 5*3 + 0*0 + 3*2 + 0*0
+ 2*1 + 0*1 + 0*0 + 20*10 + 0*1 = 205
- Dot product of Document1 and Document3 = 5*0 + 0*7 + 3*0 + 0*2
+ 2*1 + 0*0 + 0*0 + 20*30 + 0*0 = 600
- Dot product of Document1 and Document4 = 5*0 + 0*1 + 3*0 + 0*0
+ 2*1 + 0*2 + 0*2 + 20*3 + 0*0 = 60
- Dot product of Document2 and Document3 = 3*0 + 0*7 + 2*0 + 0*2
+ 1*1 + 1*0 + 0*0 + 10*30 + 1*0 = 300
- Dot product of Document2 and Document4 = 3*0 + 0*1 + 2*0 + 0*0
+ 1*1 + 1*2 + 0*2 + 10*3 + 1*0 = 33
- Dot product of Document3 and Document4 = 0*0 + 7*1 + 0*0 + 2*0
+ 1*1 + 0*2 + 0*2 + 30*3 + 0*0 = 93

3. Calculate the Euclidean norms of each document vector:

- Norm of Document1 = √(5^2 + 0^2 + 3^2 + 0^2 + 2^2 + 0^2 + 0^2 +
20^2 + 0^2) = √(25 + 0 + 9 + 0 + 4 + 0 + 0 + 400 + 0) = √438 = 20.91
- Norm of Document2 = √(3^2 + 0^2 + 2^2 + 0^2 + 1^2 + 1^2 + 0^2 +
10^2 + 1^2) = √(9 + 0 + 4 + 0 + 1 + 1 + 0 + 100 + 1) = √116 = 10.77
- Norm of Document3 = √(0^2 + 7^2 + 0^2 + 2^2 + 1^2 + 0^2 + 0^2 +
30^2 + 0^2) = √(0 + 49 + 0 + 4 + 1 + 0 + 0 + 900 + 0) = √954 = 30.88
- Norm of Document4 = √(0^2 + 1^2 + 0^2 + 0^2 + 1^2 + 2^2 + 2^2 +
3^2 + 0^2) = √(0 + 1 + 0 + 0 + 1 + 4 + 4 + 9 + 0) = √19 = 4.36

4. Calculate the cosine similarity between each pair of documents using

the formula:
Cosine similarity = dot product / (norm of vector1 * norm of vector2)
- Cosine similarity between Document1 and Document2 = 205 / (20.91
* 10.77) = 0.91
- Cosine similarity between Document1 and Document3 = 600 / (20.91
* 30.88) = 0.93
- Cosine similarity between Document1 and Document4 = 60 / (20.91 *
4.36) = 0.66
- Cosine similarity between Document2 and Document3 = 300 / (10.77
* 30.88) = 0.90
- Cosine similarity between Document2 and Document4 = 33 / (10.77 *
4.36) = 0.70
- Cosine similarity between Document3 and Document4 = 93 / (30.88 *
4.36) = 0.69
Sketch based on:

1. What are the types of data that can be mined in Data Mining, and what
are the corresponding Data Mining techniques for each type?
Diversity of data types can be mined, including structured data, unstructured data,
and semi-structured data. It then outlines the specific Data Mining techniques that
are applicable for each data type, such as sequential pattern mining and relational
data mining for structured data, graph pattern mining and information network
mining for semi-structured data, and text mining as well as image/video
recognition using deep learning for unstructured data.

2. How can similarity be measured between the different data types in

Data Mining?
The lecture introduces the concept of cosine similarity as a technique for
measuring similarity between diverse data representations. Cosine similarity
compares the angle between two vectors, where a smaller angle indicates greater
similarity. This method is particularly useful for comparing text documents that are
represented as term-frequency vectors, as demonstrated in the example provided.

3. What is the purpose of measuring similarity in Data Mining, and how

does it relate to the different Data Mining techniques discussed?
Measuring similarity between data objects is a crucial aspect of Data Mining, as it
allows for the identification of patterns, associations, and relationships within the
data. The lecture explains how the various Data Mining techniques, such as
sequential pattern mining, graph pattern mining, and text mining, can leverage
similarity measures like cosine similarity to uncover meaningful insights from the
diverse data types.

4. How does the concept of similarity and distance measures fit into the
overall knowledge discovery process in Data Mining?
The lecture situates the discussion of similarity and distance measures within the
broader context of the knowledge discovery process in Data Mining. It highlights
how understanding the types of data, their characteristics, and the appropriate
similarity measures is a crucial step in the data preparation and preprocessing
phase, which then enables the effective application of Data Mining techniques to
extract useful patterns and insights from the data.
Chapter 2: Similarity and Data
processing
Questions and Answers on Data, Measurements, and Preprocessing

1. What are similarity measures, and why are they important in data
mining?
Answer: Similarity measures quantify how alike two objects are, often based on
their attributes. They are crucial in data mining for tasks such as clustering,
classification, and recommendation systems, as they help identify patterns and
relationships within data.

2. Explain the concept of capturing hidden semantics in similarity

measures.
Answer: Capturing hidden semantics involves understanding the deeper meanings
behind words or phrases beyond their literal interpretations. The example of “The
cat bites a mouse” vs. “The mouse bites a cat” illustrates that while the words are
similar, their meanings change based on context, which traditional models may fail
to capture.

3. What is the role of word embedding in understanding similarity?

Answer: Word embedding is a technique that represents words as vectors in a
continuous vector space, capturing their meanings based on context. The Common
Bag-of-Words (CBOW) model predicts a word based on its surrounding context,
allowing for a better understanding of semantic relationships between words.

4. How do context and usage influence the similarity of words in natural

language processing?
Answer: Contextual usage determines how words relate to each other. For
example, "bank" in "river bank" vs. "financial bank" shows that words can
have different meanings based on context, affecting their similarity.

5. What factors contribute to data quality?

Answer: Factors include:
- Inaccuracy: Incorrect attribute values.
- Incompleteness: Missing values or lack of detail.
- Inconsistency: Contradictory values across datasets.
- Timeliness: Data must be up-to-date and available when needed.
- Believability: Trustworthiness of the data from the user's
perspective.
- Interpretability: Ease of understanding the data.

6. What methods can be used to handle missing values during data cleaning?
Answer: Methods include:
- Ignore the tuple: Exclude incomplete records.
- Fill in missing values
- Manually: Enter values based on domain knowledge.
- Global constant: Use a fixed value for all missing entries.
- Central tendency: Use mean, median, or mode to fill gaps.

6. Describe the challenges involved in data integration.

- Answer: Challenges include:
- Redundancies: Duplicate data across sources.
- Inconsistencies: Different formats or representations of the same data.
- Entity identification problem: Different identifiers for the same entity.
- Tuple duplication: Repeated entries in denormalized tables.

7. What is the entity identification problem, and how does it affect data
integration?
- Answer: The entity identification problem occurs when the same
entity is represented differently across data sources (e.g., customer_id
vs. cust_number). This can lead to confusion and inaccuracies during
data integration, making it difficult to create a unified dataset.

8. What is data transformation, and why is it necessary for data mining?

- Answer: Data transformation is the process of converting data into a
suitable format for analysis. It is necessary to ensure that the data is
clean, consistent, and in a format that algorithms can effectively
process, enhancing the quality of insights derived from the data.

9. Explain dimensionality reduction and its importance in data

preprocessing.
- Answer: Dimensionality reduction reduces the number of features or
variables in a dataset while retaining essential information. It is
important because it simplifies models, decreases computation time,
reduces overfitting, and aids in visualization.

10.What is Principal Component Analysis (PCA), and how does it work?

- Answer: PCA is a dimensionality reduction technique that transforms
high-dimensional data into a lower-dimensional form by identifying
the principal components that capture the most variance in the data. It
works by calculating the eigenvectors and eigenvalues of the data's
covariance matrix and projecting the data onto these vectors.

Application of Similarity and Distance Measures

11.How can similarity and distance measures be applied in clustering
algorithms?
- Answer: Similarity and distance measures are used in clustering
algorithms to determine how closely related data points are. Clusters
are formed by grouping similar points together based on these
measures, allowing for the identification of natural groupings in the
data.

12.In what scenarios would you prefer using distance measures over
similarity measures, or vice versa?
- Answer: Distance measures are preferred when the magnitude of
differences is crucial (e.g., Euclidean distance in spatial data). Similarity
measures are preferred in cases where relationships and patterns are more
important than absolute differences (e.g., cosine similarity in text
analysis).
General Understanding

1. What is the overall goal of understanding similarity and data

processing in data mining?
- Answer: The goal is to extract meaningful patterns and insights from
data by effectively measuring relationships and similarities among data
points. This understanding enhances the accuracy and relevance of data
analysis and decision-making.

2. Discuss the significance of visualizing high-dimensional data after

applying dimensionality reduction techniques like PCA.
- Answer: Visualizing high-dimensional data in two or three
dimensions after dimensionality reduction helps in understanding
underlying patterns, relationships, and structures within the data. It aids
in interpreting results and making informed decisions based on the visual
representation of complex data.

3. Calculate the similarity between the documents.

1. Document 1: The cat chased the mouse.

2. Document 2: The dog barked at the cat.

3. Document 3: The bird sang in the tree.

4. Document 4: The fish swam in the pond.

- Question
Given the following documents:
1. The cat chased the mouse.
2. The dog barked at the cat.
3. The bird sang in the tree.
4. The fish swam in the pond.
- Calculate the similarity between these documents using a simple method,
such as counting common words, and provide an interpretation of the results.

 Answer:
Step 1: Identify Unique Words
- First, let's list the unique words in each document.
- Document 1: "The cat chased the mouse."
- Unique words: [the, cat, chased, mouse]
- Document 2: "The dog barked at the cat."
- Unique words: [the, dog, barked, at, cat]
- Document 3: "The bird sang in the tree."
- Unique words: [the, bird, sang, in, tree]
- Document 4: "The fish swam in the pond."
- Unique words: [the, fish, swam, in, pond]

o Step 2: Create a Word List

Now, let's create a combined list of all unique words across all
documents:
- Combined Unique Words: [the, cat, chased, mouse, dog, barked,
at, bird, sang, in, tree, fish, swam, pond]
Word Doc 1 Doc 2 Doc 3 Doc 4
the 1 1 1 1
cat 1 1 0 0
chased 1 0 0 0
mouse 1 0 0 0
dog 0 1 0 0
barked 0 1 0 0
at 0 1 0 0
bird 0 0 1 0
sang 0 0 1 0
in 0 0 1 1
tree 0 0 1 0
fish 0 0 0 1
swam 0 0 0 1
pond 0 0 0 1
- Step 3: Create Vectors for Each Document

- Vectors:
- Doc 1: [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- Doc 2: [1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
- Doc 3: [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0]
- Doc 4: [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

- Step 4: Calculate Cosine Similarity

Calculate the cosine similarity between Document 1 and Document 2

1. Dot Product
˙
Product (Doc 1 , Doc 2)}=(1 ×1)+(1 ×1)+(1 × 0)+(1× 0)+(0 × 1)+(0 ×1)+(0 ×1)+(0× 0)+(0 × 0)+(0

2. Magnitude of Each Document

- Magnitude of doc 1:√ ¿ ¿)= √ 4 = 2

- Magnitude of Doc 2: √ ¿ ¿)= √ 5= 2.24

3. Cosine Similarity

Step 5: Interpretation
The cosine similarity between Document 1 and Document 2 is
approximately 0.447. This value indicates a moderate level of similarity
between the two documents, meaning they share some common words
but are not identical.

Question on 3D semantic figure:

1. What does this operation represent in terms of semantic relationships?

The operation [king] - [man] + [woman] captures the idea of gender

transformation within a semantic space. Here’s the breakdown:
- [king] represents a male ruler.
- [man] is a general term for an adult male.
- [woman] is a general term for an adult female.

2. By subtracting [man] from [king], we effectively remove the male

aspect, leaving us with the concept of leadership or royalty
devoid of gender. Adding [woman] then introduces the female
aspect into this concept, yielding [queen].This operation
exemplifies how semantic relationships can be manipulated to
derive new meanings based on contextual attributes.

2. Using the 3D semantic model, illustrate how the positions of "king,"

"man," and "woman" relate to the resulting concept.

In the provided 3D semantic model, the positions of the entities are

represented as follows:
- King: Positioned in the space indicating a strong association with
power and gender (high on the power axis).
- Man: Located lower on the power axis but still representing the
male gender.
- Woman: Positioned higher on the gender axis, representing
femininity.
- When we visualize the vector operation:
- Starting from King, we move down to Man (removing the male
aspect of power).
- Then, we shift towards Woman, incorporating the female aspect
into the concept of royalty.
This results in the concept of Queen, which can be inferred as the conclusion of
this operation. In the model, this transformation can be visualized as a
movement through the space that maintains the essence of leadership while
altering the gender representation.

- Summary
The operation captures the relationships between gender and power, demonstrating
how semantic meanings can be dynamically derived and visualized within a
conceptual framework. The 3D model helps illustrate these relationships and
transformations effectively.
Pattern Mining Ch.3&4
Example: Chi-Square Test

 Scenario
Suppose we want to investigate whether there is a relationship between gender
(Male, Female) and preference for a type of beverage (Coffee, Tea). We collect
the following data:

Coffee Tea Total

Male 30(22.22) 10 40
Female 20 30 50
Total 50 40 90

- Step 1: Calculate Expected Frequencies:

To calculate the expected frequencies, use the formula:

(Row Total )×(Column Total)
E=
Grand Total

1. For Males who prefer Coffee:

(40) ×(50) 2000

E= = ≈ 22.22
90 90

- For Males who prefer Tea:

(40) ×(40) 1600

E= = ≈ 17.78
90 90

- For Females who prefer Coffee:

(50)×(50) 2500
E= = ≈ 27.78
90 90

- For Females who prefer Tea:

(50)×(40) 2000
E= = ≈ 22.22
90 90

- Step 2: Create the Expected Frequencies Table

Coffee Tea Total

Male 22.22 17.78 40

Female 27.78 22.22 50

Total 50 40 90

- Step 3: Calculate the Chi-Square Statistic

- Using the formula:

2
(O−E)
X =∑
2
E

- Where O is the observed frequency and E is the expected

frequency

- For Males who prefer Coffee:

2
(30−22.22) 60.61
≈ ≈ 2.73
22.22 22.22

-For Males who prefer Tea:

2
(10−17.78) 60.61
≈ ≈3.41
17.78 17.78

- For Females who prefer Coffee:

2
(20−27.78) 60.61
≈ ≈ 2.18
27.78 27.78

-For Females who prefer Tea:

2
(30−22.22) 60.61
≈ ≈ 2.73
22.22 22.22

- Step 4: Sum the Chi-Square Values

∑ x 2 ≈ 2.73+3.41+2.18+2.73 ≈ 11.05
- Step 5: Calculate Degrees of Freedom
- Using the formula:

df =( r−1 ) × ( c−1 )

Where r=2 (Male, Female) and c=2 (Coffee, Tea):

df =( 2−1 ) × ( 2−1 )=1

- Step 6: Conclusion

With a chi-square statistic of approximately 11.05 and 1 degree of freedom, you

would compare this value to a critical value from the chi-square distribution
table (e.g., at a significance level of 0.05, the critical value is approximately
3.841). Since 11.05 > 3.841 , we reject the null hypothesis, suggesting there is
a significant relationship between gender and beverage preference.
- Q1: What is the primary goal of data preprocessing in data
mining?
The primary goal of data preprocessing is to prepare and clean the data to
improve its quality and ensure accurate analysis.

- Q2: What is one common issue with real-world data?

One common issue is that real-world data often contains missing values.

- Q3: What does data cleaning involve?

Data cleaning involves identifying and correcting errors, handling missing data,
and removing inconsistencies in the dataset.

- Q4: How can missing data be handled in datasets?

Missing data can be handled by ignoring the tuples, filling in missing values
manually or automatically, or inferring values based on other data.

- Q5: What is the purpose of dimensionality reduction in data

preprocessing?
The purpose of dimensionality reduction is to reduce the number of features in
a dataset while retaining important information, making data analysis more
efficient.

- Q6: What techniques are effective for managing noisy data?

Effective techniques for managing noisy data include binning, regression
analysis, and clustering to detect and remove outliers.

- Q7: Explain the concept of redundancy in data integration.

Redundancy in data integration refers to duplicate data that can occur when
combining datasets from multiple sources. It can lead to inconsistencies and
inefficiencies in data processing.

- Q8: Discuss the implications of using simple random

sampling versus stratified sampling.
Simple random sampling gives each item an equal chance of selection but may
not represent subgroups well. Stratified sampling ensures that all subgroups are
proportionately represented, leading to more accurate results.

- Q9: What is the chi-square test, and how is it used in data

analysis?
The chi-square test is a statistical method used to determine if there is a
significant association between two categorical variables. It compares observed
frequencies to expected frequencies to assess relationships and potential
redundancy in data.
BIDA311: Data Mining, focusing on
Chapters 2 and 4-5.

Questions and Answers

- Q1 What is the purpose of data cleaning in data mining?

Data cleaning aims to enhance the quality of data by removing inaccuracies and
inconsistencies, ensuring that the data is suitable for analysis. This process includes
techniques like attribute creation (feature generation) to capture important
information more effectively.

- Q2: What are the three general methodologies for attribute

creation?
The three general methodologies for attribute creation are:
1. Attribute Extraction: Deriving new attributes from existing data.
2. Attribute Construction: Combining existing features to create new ones.
3. Data Discretization: Transforming continuous data into discrete categories.
- Q3: How does clustering contribute to attribute extraction?
Clustering partitions a dataset into groups based on similarity, allowing for the
representation of data through cluster centroids. This method is effective when data
is naturally clustered but less so when data is smeared. Hierarchical clustering can
also be utilized for multi-dimensional indexing.

- Q4: Explain the concept of sampling in data mining.

Sampling involves selecting a representative subset of a larger dataset to facilitate
analysis. Key principles include:
- Simple Random Sampling: Every item has an equal chance of being selected.
- Stratified Sampling: The dataset is partitioned, and samples are drawn
proportionally from each partition.

- Q5: Define frequent pattern analysis and its significance.

Frequent pattern analysis identifies patterns that occur frequently within a dataset,
such as sets of items or sequences. It is significant because it forms the foundation
for various data mining tasks, including association rule mining, market basket
analysis, and classification.

- Q7: What are closed patterns and max patterns in pattern

mining?
- Closed Patterns: An itemset is closed if there is no superset with the same
support. This helps compress the data by reducing the number of patterns.
- Max-Patterns: An itemset is a max-pattern if there is no frequent superset. Max-
patterns provide a further reduction in the number of patterns while retaining
essential information.

- Q8: Give an example of an application of frequent pattern

mining.
Frequent pattern mining can be applied in market basket analysis to determine
which products are often purchased together, such as identifying that customers
who buy bread also frequently buy butter.

- Q9: How does data discretization aid in data preprocessing?

Data discretization transforms continuous numerical data into categorical data,
which simplifies the analysis and can improve the performance of certain
algorithms by reducing the complexity of the dataset.

- Q10: Why is adaptive sampling preferred over simple random

sampling?
Adaptive sampling methods, like stratified sampling, are preferred because they
can provide a more representative sample by considering the characteristics of the
data, reducing the risk of poor performance that may arise from simple random
sampling.

- Q11: Given the following table of items bought by customers,

determine which
itemsets are closed
patterns and which
are max-patterns
based on a
minimum support
threshold of 50%
(i.e., at least 3
transactions).
- Support Counts
A: 5
B: 5
C: 6
D: 4
E: 2
{A, B}: 4
{A, C}: 4
{A, D}: 3
{B, C}: 5
{B, D}: 3
{C, D}: 3
{A, B, C}: 4
{A, B, D}: 3
{B, C, D}: 3

- Determine Closed and Max Patterns

1. Closed Patterns: Identify itemsets that are frequent and have no superset with the
same support.
2. Max Patterns: Identify itemsets that are frequent and have no frequent superset.

- Answer

Determine Closed Patterns:

- For each frequent itemset, check for supersets with the same support:
- {A, B} (4): Has {A, B, C} (4) as a superset, so it is not closed.
- {A, C} (4): Has {A, B, C} (4) as a superset, so it is not closed.
- {B, C} (5):No superset with the same support, so it is closed.
- {A, D} (3): No superset with the same support, so it is closed.
- {B, D} (3): No superset with the same support, so it is closed.
- {C, D} (3): No superset with the same support, so it is closed.
- {A, B, C} (4): No superset with the same support, so it is closed.
- {A, B, D} (3): No superset with the same support, so it is closed.
- {B, C, D} (3): No superset with the same support, so it is closed.

3. Determine Max Patterns:

- For each frequent itemset, check for any frequent supersets:
- {A, B} (4): Has {A, B, C} (4) as a frequent superset, so it is not max.
- {A, C} (4): Has {A, B, C} (4) as a frequent superset, so it is not max.
- {B, C} (5):No frequent superset, so it is max.
- {A, D} (3): No frequent superset, so it is max.
- {B, D} (3): No frequent superset, so it is max.
- {C, D} (3): No frequent superset, so it is max.
- {A, B, C} (4): No frequent superset, so it is max.
- {A, B, D} (3):No frequent superset, so it is max.
- {B, C, D} (3): No frequent superset, so it is max.

Final Summary of Patterns:

- Closed Patterns:
- {B, C}
- {A, D}
- {B, D}
- {C, D}
- {A, B, C}
- {A, B, D}
- {B, C, D}

- Max Patterns:
- {B, C}
- {A, D}
- {B, D}
- {C, D}
- {A, B, C}
- {A, B, D}
- {B, C, D}

BIDA311: Data Mining

Part 1:

1. What is the goal of mining frequent patterns?

- Answer: The goal is to identify patterns that occur frequently in a dataset,
which can help in understanding relationships and associations among items.

2. What does the downward closure property state?

- Answer: The downward closure property states that if an itemset is frequent,
then all of its subsets must also be frequent.

3. What is the Apriori algorithm primarily used for?

- Answer: The Apriori algorithm is used for mining frequent itemsets and
association rules in a dataset.

Part 2:
1. Explain the significance of the minsup threshold in frequent itemset mining.
- Answer: The minsup (minimum support) threshold determines the minimum
frequency an itemset must have to be considered frequent. It greatly affects the
number of itemsets generated; a low minsup can lead to an exponential number of
frequent itemsets.

2. Describe the basic method of the Apriori algorithm.

- Answer: The Apriori algorithm works by first scanning the database to find
frequent 1-itemsets. It then generates candidate itemsets of length k+1 from
frequent k-itemsets and tests these candidates against the database. This process
continues until no more frequent itemsets can be found.

3. What are the three major approaches to scalable mining methods?

- Answer: The three major approaches are the Apriori method, Frequent Pattern
Growth (FP-Growth), and the Vertical Data Format approach (Charm).

Part 3:

1. Discuss the computational complexity of frequent itemset mining and its

worst-case scenario.
- Answer: The computational complexity can be exponential in the worst case,
particularly when the minsup threshold is low. The worst-case scenario can be
represented as MN, where M is the number of distinct items and N is the maximum
length of transactions. This means that the number of potential itemsets can grow
significantly based on the dataset.

2. How does the Apriori algorithm utilize candidate generation and testing?
Provide an example.
- Answer: The Apriori algorithm generates candidate itemsets by self-joining the
frequent itemsets from the previous iteration. For example, if L2 = {AB, AC, BC},
then C3 can be generated by combining these itemsets to form candidates like
{ABC}. Candidates are then tested against the transaction database to determine
their frequency.
3. Explain the concept of partitioning in the context of improving the Apriori
algorithm.
- Answer: Partitioning involves dividing the transaction database into smaller
subsets or partitions. The algorithm scans each partition to find local frequent
patterns in two passes. The first pass identifies local frequent itemsets, and the
second pass consolidates these to find global frequent patterns, thus reducing the
number of scans required.

Transaction Database

Step 1: Define Minimum Support

We'll keep the minimum support threshold (minsup) at 3.

Step 2: Scan 1 - Count Single Item Frequencies

Frequent Items (support ≥ 3):

- A, B, C, D

Step 3: Scan 2 - Count Pair Frequencies

Now, let's accurately count the frequencies of item pairs.
Frequent Pairs (support ≥ 3):
- {A, B}
- {A, D}
- {B, C}
- {B, D}
- {C, D}

Step 4: Scan 3 - Count Triplet Frequencies

Now, we will count the frequencies of item triplets.

Frequent Triplets (support ≥ 3):

- {A, B, D}
- {B, C, D}

Summary of Results
- Frequent Items: A, B, C, D
- Frequent Pairs: {A, B}, {A, D}, {B, C}, {B, D}, {C, D}
- Frequent Triplets: {A, B, D}, {B, C, D}
-Conclusion
This corrected example now accurately reflects the frequent itemsets using the
Apriori algorithm with three scans.

Classification Ch.6+7

BIDA311: Data Mining (Ch. 6+7: Classification)

1. What is a decision tree?

- Answer: A decision tree is a flowchart-like structure used for classification
and regression tasks. It consists of internal nodes that represent tests on
attributes, branches for each possible attribute value, and leaf nodes that assign
a class label.

2. What is the primary goal of classification in data mining?

- Answer: The primary goal of classification is to predict the categorical label
of new instances based on past observations and to identify patterns in the data
that can be used for future predictions.

Medium Questions
3. Explain the concept of decision tree pruning and its
importance.
- Answer: Decision tree pruning is the process of removing sections of a
decision tree that provide little predictive power. It is important because it helps
prevent overfitting, improves the model's generalization to new data, and makes
the tree smaller and faster to use.

4. Given the following dataset, which attribute should

you choose to split on, X1 or X2? Use the counts
provided.

Counts for Y=t:

- X1: 4
- X2: 3
- Counts for Y=f:
- X1: 0
- X2: 1

- Answer: You should prefer to split on X1 because it results in a pure split (all
instances are Y=t), whereas X2 leaves some impurity in the classification. This
indicates that X1 provides significantly more information for classification
compared to X2 and is the better choice for splitting.

5. Describe the difference between pre-pruning and post-pruning

in decision trees.
- Answer:
-Pre-pruning (Early Stopping): This technique stops the growth of the decision
tree before it becomes too complex. It is done based on certain criteria, such as
the maximum depth or minimum number of samples required to split.
-Post-pruning (Reducing Nodes): This technique involves allowing the tree to
grow fully and then removing branches that do not provide significant
predictive power. This is done after the tree has been constructed to simplify it
and improve generalization.

6. Using Bayes' theorem, calculate the probability of a customer

being from City A given they made a purchase. Assume the
following:
- P(A) = 0.6 (probability of being from City A)
- P(B) = 0.4 (probability of being from City B)
- P(Purchase|A) = 0.8 (probability of purchase given City A)
- P(Purchase|B) = 0.5 (probability of purchase given City B)

- Answer:
Using Bayes' theorem:
P(Purchase| A )× P( A)
P( A| Purchase)
P(Purchase)

Where:

Thus, the probability of a customer being from City A given they made a
purchase is approximately 70.59%.

7. Explain how cross-validation is used to evaluate the

performance of a classification model.
- Answer: Cross-validation is a technique used to assess how the results of a
statistical analysis will generalize to an independent dataset. In classification,
the dataset is divided into k subsets (folds). The model is trained on k-1 folds
and tested on the remaining fold. This process is repeated k times, with each
fold used as the test set once. The overall performance is averaged to provide a
more reliable estimate of the model's effectiveness, helping to mitigate issues
like overfitting.

8. How can we evaluate the performance of a classification

model? Why is it important?

-Answer: Evaluating the performance of a classification model is essential to

understand its effectiveness. Common metrics include accuracy, precision,
recall, and F1-score.

Listing: Common Evaluation Metrics

1. Accuracy: Proportion of true results among the total cases.

2. Precision: Proportion of true positive results in all positive predictions.
3. Sensitivity: Proportion of true positive results in all actual positives.
4. F1-Score: Harmonic mean of precision and recall.
-Performance can be evaluated using metrics such as accuracy, precision, recall,
and F1-score, which provide insights into the model's effectiveness.

1. What is cluster analysis, and why is it considered an unsupervised

learning approach?
Answer: Cluster analysis is the process of grouping a set of objects in such a
way that objects in the same group (or cluster) are more similar to each other
than to those in other groups. It is considered an unsupervised learning
approach because it does not rely on labeled outcomes; instead, it identifies
patterns or structures within data based solely on its features.

2. What is hard clustering and soft (fuzzy) clustering? How do they differ?
Answer: Hard clustering assigns each data point exclusively to one cluster; no
overlapping occurs between clusters. Soft (fuzzy) clustering allows a single
data point to belong to multiple clusters with varying degrees of membership
ranging from 0 (not belonging) to 1 (fully belonging).
3. Describe the k-means algorithm's basic steps.
Answer: The k-means algorithm involves four main steps:
- Initialization: Partition objects into (k) non-empty subsets
randomly.
- Centroid Calculation: Compute new centroids by calculating the
mean of all points assigned to each cluster.
- Assignment Step: Assign each object/data point to its nearest
centroid's cluster.
- Iteration: Repeat steps 2 and 3 until there’s no change in
assignments or centroids stabilize.

4. What advantages does fuzzy clustering offer compared with traditional

k-means?
Answer: Fuzzy clustering provides several advantages:
1. Flexibility of Memberships: Allows for overlapping clusters which improves
representation of real-world scenarios where clear distinctions between
categories don’t exist.
2. Robustness Against Noise/Outliers: Since memberships can be shared across
clusters gradually rather than strictly defined boundaries as per hard
classification.
3. Greater Interpretability and Insightfulness, yielding richer information about
relationships among data points due to partial memberships indicating
nuances that pure classification might miss.

5. Explain how membership values are updated in the Fuzzy C-Means

algorithm during iterations.
Answer: Cluster analysis has numerous practical applications across various fields.
Examples:
1. Customer Segmentation in Marketing: Businesses use clustering techniques
to group customers based on purchasing behavior, demographics, and
preferences. This enables targeted marketing campaigns tailored to specific
segments, thereby improving customer engagement and retention.
2. Image Segmentation in Computer Vision: In computer vision tasks,
clustering algorithms like k-means can segment images into different
regions-based pixel color intensities or textures. This helps in object
detection and recognition by isolating distinct features within an image.
3. Anomaly Detection in Network Security: Clustering can also be employed to
identify unusual patterns of network traffic that differ from normal user
behaviors. By grouping similar traffic patterns, security systems can detect
anomalies that may signify potential cyber threats or breaches.

Overview of Cross-Validation Methods

1. Hold-Out Method:
- Split your dataset into two parts: one for training and one for testing.
- Pros: Simple to implement and fast.
- Cons: High variance; results can change based on how you split the data.

2. K-Fold Cross Validation:

- Divide your dataset into \( k \) equal parts (folds). Train the model \( k \) times,
each time using a different fold as testing data while using the remaining folds for
training.
- Pros: More reliable performance estimate than hold-out since it uses all data
points across iterations.
- Cons: More computationally intensive compared to hold-out.

3. Leave-One-Out Cross Validation (LOOCV):

- A special case of k-fold where ( k ) equals the number of samples in your
dataset. You train on all but one sample and test on that single sample, repeating
this process across all samples.
- Pros: Uses nearly all available data in every iteration, which can lead to
unbiased estimates if datasets are small.
- Cons: Very computationally expensive with larger datasets.

4. Bootstrap Method:
- Involves randomly sampling from your dataset with replacement to create
multiple “bootstrap” datasets, which are then used to estimate model accuracy.
- Pros: Helps in understanding variability and can provide confidence intervals
around model performance estimates.

Further explanation of the methods:

Overview of Cross-Validation
Cross-validation is a technique used to assess how well your model performs on
unseen data. It helps ensure that your model generalizes well beyond just the
training data. Here, we'll illustrate four common methods: Hold-Out, K-Fold,
Leave-One-Out (LOOCV), and Bootstrap.

1. Hold-Out Method

Concept: You split your dataset into two parts: one for training and one for
testing.

Example Dataset:
Imagine we have a small dataset with 6 samples:
Sample Feature 1 Feature 2 Label
1 2 3 A
2 4 5 B
3 5 7 A
4 6 -1 B
5 -1 -2 A
6 -3 -4 -B

- Split the Data:

- Training Set: Samples {1, 2, 3, 4}
- Testing Set: Samples {5, 6}

In this method:
- Train your model using the training set.
- Test its performance using the testing set.

- This method is easy but can lead to high variability since it

depends on how you choose to split the dataset.

Example of Hold-Out:

Assuming you trained a classifier and achieved an accuracy of 80%, this

means it correctly classified 80% of test samples (from test set).
1. K-Fold Cross Validation

Concept: The dataset is divided into (k) equal parts or "folds". The model is
trained on (k-1) folds and tested on the remaining fold. This process repeats
until each fold has been used as the test data once.

Example:

Let’s use ( k =) 3 folds, splitting our previous dataset as follows:

- Fold A: {Samples (1), (2)}

- Fold B: {Samples (3), (4)}
- Fold C: {Samples (5), (6)}

Now we train/test through three iterations:

Iteration #1:
- Train on Folds B & C → Test on Fold A
- Train with Samples {3,4}, Test with Samples {1,2}

Iteration #2:
- Train on Folds A & C → Test on Fold B
- Train with Samples {1,2}, Test with Samples {3,4}

Iteration #3:
- Train on Folds A & B → Test on Fold C
- Train with Samples{(5),(6)}

After all iterations are complete you average out performance metrics like
accuracy from each iteration's result to get an overall assessment.

Example of K-Fold
If in each iteration you got accuracies like 70%,85%,90%, you'll average them
giving an estimated overall accuracy around:
1. Average Percentage=
∑ of Scores
Number of Scores
70+85+ 90
2. Substituting the values=
3
245
Simplify= =81.67 %
3

3. Leave-One-Out Cross Validation (LOOCV)

Concept: In LOOCV you leave out one sample from your dataset during each
iteration while using all other samples for training.

Example:

For our original sample size of six ( n = 6) , you'd perform these iterations:

Iterate over all samples leaving one out at a time until every sample has been
tested once.

For instance,
Iteration #1: Exclude Sample 1, train using {Samples (2),(3),(4),(5),(6)}; test
against {Sample(1)}

Next,
Iteration #2: Exclude Sample 2, train using {Samples (A)(C)(D)(E)(F)};test
against {Sample(Exclusion)...
And so forth till every sample has been processed!

Average results across all six iterations will give final performance metric.

If accuracies recorded were 85%,75%,90%,80%,95%,60%, final performance

would be averaged yielding:

Average Accuracy=
∑ of Accuracies
Number of Measurements

85+75+90+ 80+95+60
Substituting the values=
6

486
Simplify= Average Accuracy = =81 %
6
4. Bootstrap Method

The bootstrap method involves creating multiple subsets or "bootstrap

samples" by randomly selecting observations with replacement . This means
some observations may appear more than once in any given subset while
others might not appear at all!

- Example:

Using our original dataset size ( n = 7 ) :

- You can create many bootstrap datasets! For instance,

A single bootstrap sample might look like this after sampling from original data
randomly—some points appearing multiple times such as:

Bootstrap Sample : [Sample(ABCDE)], [Sample(B)] ....

You repeat this process numerous times say reaching up to 1000 bootstraps,
computing metrics after modeling built off these sampled datasets provides an
estimate regarding confidence intervals/variation around certain metrics!

This helps estimate strength across random deviations instead relying solely
upon fixed validation splits!

Example Result could show that among several bootstrapped models—average

model accuracies could assess whether scoring yields consistent outcomes.

Questions on Cluster Analysis

1. What is cluster analysis, and how does it differ from classification

methods?
Answer: Cluster analysis is a technique used to group a set of objects in such a
way that objects in the same group (or cluster) are more similar to each other
than those in other groups. The main difference between clustering and
classification is that clustering is an unsupervised learning method that does not
use predefined labels or classes for the data points. In contrast, classification
involves supervised learning with known class labels.

2. Describe the difference between hard clustering and soft (fuzzy)

clustering.
Answer:
- Hard Clustering: Each data point belongs to only one cluster; there’s a
definitive assignment.
- Soft Clustering (Fuzzy Clustering): Data points can belong to multiple
clusters with varying degrees of membership between 0 and 1.

3. Outline the four steps involved in implementing the k-means

algorithm.
Answer:
The k-means algorithm consists of four iterative steps:
1. Partition objects into \(k\) non-empty subsets.
2. Compute seed points as centroids of current clusters.
3. Assign each object to the nearest centroid based on distance.
4. Repeat steps 2-3 until no changes occur in assignments.

4. Describe how centroids are computed in a Fuzzy C-Means algorithm

using a table example provided below:

Example Data Points Table:

Points Coordinates

A (1,3)

B (2,5)

C (4,8)

D (7,9)
Step for initialization of membership table might look like this:

Membership Table Example:

Let’s recalculate the centroid for C 1, including point D with coordinates (7,9)
and an assumed membership value for C 1. Please confirm the membership value
for DD in c1c_1. For now, let’s assume γD 1 = 0.5.

The membership value for a point is calculated based on its distance from a
reference point. The formula is:

Where:

 Distance is the Euclidean distance between the point and the reference
point (0.5, 0.5)
Membership value = 1 / (1 + Distance)

√(0.5 2+ 0.52)=0.707

Using this formula, the correct membership values are:

( √ ( 1−0.5 ) +( 3−0.5 )
)
2 2
γA 1=1− =0.8
0.707

( √( 2−0.5 ) +( 5−0.5 )
)
2 2
γB 1=1− =0.7
0.707

( √ ( 4−0.5 ) +( 8−0.5 )
)
2 2
γC 1=1− =0.2
0.707

( √( 7−0.5 ) +( 9−0.5 )
)
2 2
γD 1=1− =0.5
0.707

Updated Membership Values:

 γA 1 = 0.8
 γB 1 = 0.7
 γC 1 = 0.2
 γD 1 = 0.5

Fuzziness Parameter:

 m=2

Coordinates of Points :

 A=(2 , 3)
 B=(4 ,1)
 C=(1, 5)
 D=(7 , 9)

Formula for Centroid Calculation:

For the j -th dimension of centroid c 1:

Step 1: Compute Membership Values Raised to Power m.

Step 2: Compute the Numerator for Each Dimension:

Step 3: Compute the Denominator

Step 4: Calculate Centroid Coordinates

Final Centroid for: C 1

C 1 ≈(3.54 , 3.42)

Conclusion:
If this data represents customer locations, the centroid could indicate the
approximate "central location" of customers most closely aligned with Cluster
C 1. Businesses could use this information to determine where to allocate
resources, such as opening a new store or focusing marketing efforts.

5. Explain how distances from centroids impact membership values

during fuzzification?
Answer: When calculating distances from centroids back towards each point
using Euclidean distance—this distance influences how strongly each point
belongs to particular clusters based on proximity; closer distances yield higher
membership degrees while farther away reduce this degree leading towards
recalibrating those memberships iteratively until convergence occurs.

6. What are some advantages associated with fuzzy clustering compared

to traditional methods?
Answer:
- Overlapping clusters enable nuanced grouping where boundaries aren't rigidly
defined.
- Robustness against outliers allows analysts insight into patterns even amidst
noise or anomalies within datasets.
- More interpretative view gives clearer relationships presented through
memberships rather than crisp assignments allowing flexibility within analysis
pipelines serving complex datasets effectively enabling various applications
ranging across domains such as marketing segmentation through user
preferences analytics creating targeted campaigns enhancing success rates
drastically!

7. Discuss any challenges or limitations that might come with

implementing fuzzy clustering algorithms?
Answer:
Challenges include:
- Complexity arises due needing optimization when managing multi-
dimensional memberships leading computational costs potentially increasing
significantly under larger volumes without adequate resources allocated
efficiently mitigating performance issues likely encountered otherwise
hindering speed scalability overall efficiency!

8. Why it’s important evaluating model performance using techniques

like cross-validation after performing cluster analyses?
Answer: Evaluating model performances ensures adequate reliability validating
results obtained highlighting potential bias reducing chances overfitting thus
maintaining integrity providing confidence deeper insights realized supporting
better-informed decisions maximizing maximization effectively gleaned
ultimately aimed consistently optimizing entire processes employed
systematically producing valuable information contributing further
understanding preserving key components guiding future developments while
minimizing risks involved avoiding pitfalls readily present ensuring
comprehensive approaches adopted across board aiding achieving strategic
objectives seamlessly integrated facilitating growth collectively aiming best
practices sustained fostering innovation advancing research progressing field
dynamically evolving conditions faced continually adapting methodologies
accordingly recognized!

9. Identify a real-world scenario where cluster analysis could be

beneficial outside typical listed ones explaining potential impact?
Answer: An example arises within healthcare sectors analyzing patient records
identifying risk factors associated diseases developing preventative strategies
tailored unique situations confronting populations based upon commonalities
observed guiding decision-making processes enhancing overall patient care
improving prognosis reducing healthcare costs!

The hold-out method is a simple and widely used approach for evaluating the
performance of a machine learning model. It involves splitting the available
dataset into two (or sometimes three) parts:

Key Steps:

1. Dataset Splitting:
o Training Set: A portion of the data is used to train the machine
learning model. This is where the model learns patterns from the input
data.
o Test Set: The remaining portion of the data is used to test the model's
performance on unseen data. This helps evaluate how well the model
generalizes to new, unseen data.
o Optionally, a validation set can be used if hyperparameter tuning or
model selection is involved.

2. Model Training:
o The model is trained only on the training set.
o During this phase, the model learns to fit the data based on its
algorithm.

3. Model Evaluation:
o Once the training is complete, the model is tested on the test set.
o Performance metrics (e.g., accuracy, precision, recall, F1-score, etc.)
are calculated to determine how well the model performs.
Dataset Split Ratio:

 Common split ratios are:

o 80/20: 80% for training, 20% for testing.
o 70/30: 70% for training, 30% for testing.
 These splits may vary depending on the dataset size.

Advantages:

1. Simplicity:
o Easy to implement and interpret.
2. Fast:
o Works well when you have a large dataset where splitting doesn’t
significantly reduce the available training data.

Disadvantages:

1. Dependency on Split:
o Results can vary depending on how the data is split.
o If the test set isn’t representative, the evaluation may not be reliable.

2. Wasted Data:
o A portion of the data is left out during training, which could
potentially reduce the model’s ability to learn better.

Example in Context:

Suppose you are building a spam email classifier:

1. You have 10,000 emails.

2. Split the dataset:
o 8,000 emails for training.
o 2,000 emails for testing.
3. Train the model on the 8,000 training emails.
4. Test the model on the 2,000 test emails, and calculate metrics like accuracy
or F1-score to measure how well the model identifies spam emails.
Accuracy: 1.00 (100% accuracy)

 The model correctly classified all test samples in this case.

 Precision, Recall, F1-Score: All values are 1.0, indicating perfect
performance for all classes.

Visualization:

The plot shows the alignment between true labels (blue points) and predicted
labels (red points), confirming all predictions are accurate.

This demonstrates how the hold-out method evaluates a model's

performance.
Real-World Usage:

The hold-out method is often used when:

 The dataset is large enough to split without losing significant information.

 You need a quick and straightforward way to evaluate the model.
Let’s go through a step-by-step example of the Fuzzy C-Means (FCM)
clustering algorithm. I'll explain the calculations, the equations used, and what
each component means.

Example Data Points

We'll use the following data points for our example:

- Data Points: (1,3), (2,5), (4,8), (7,9)

- Number of Clusters (k): 2

Step 1:
Initialize Membership Values
We start by initializing a membership matrix
with random values. Each value represents
the degree of membership of each data point
in each cluster.

Step 2: Calculate Centroids

The centroid for each cluster is calculated using the formula:

Where:
 C i = centroid of cluster i
 u ij = membership degree of data point j in cluster i
 m m = fuzziness parameter (usually set to 2)
 xj x j = data point j
 n n = number of data points

Calculation of Centroids

For Cluster 1:

Calculating the numerator:

Now summing these:

Calculating the denominator:

For Cluster 2:

Using the same formula:

Calculating the numerator and denominator similarly will yield:

Step 3: Calculate Distances

Next, we calculate the distance of each data point from the centroids.

Using the Euclidean distance formula:

Example Calculation
For data point (1,3) to centroid C 1 (1.57, 4.05):

Repeat this for all data points and centroids.

Step 4: Update Membership Values
Update the membership values using:

Where:

Final Iteration
Repeat Steps 2-4 until the membership values stabilize (i.e., change is less than
a predefined tolerance).

This example illustrates how Fuzzy C-Means works step by step, using a simple dataset.
Each step involves calculations that help refine the clustering based on the membership
values and distances to centroids.
BIDA311: Data Mining Ch.12: DM
Applications Lecture 8
Question 1: (3 Marks)
Define sentiment analysis and explain its primary objective. How is it
related to opinion mining?

Answer:
Sentiment analysis, also known as opinion mining, is the process of
determining the attitude, polarity, or emotions expressed in a piece of text. The
primary objective of sentiment analysis is to answer the question, "What do
people feel about a certain topic?" by analyzing data related to opinions using
various automated tools. It identifies the sentiment polarity (positive, negative,
or neutral) and emotions (angry, sad, happy, etc.) expressed in the text.
Sentiment analysis is closely related to opinion mining because both aim to
extract subjective information from data, such as beliefs, views, and opinions,
to understand public sentiment on a given topic.

Question 2: (4 Marks)
Discuss the applications of sentiment analysis in business. Provide
examples of how it can be used in brand monitoring and customer service.

Answer:
Sentiment analysis has several applications in business, including brand
monitoring, customer service, and market research.

- Brand Monitoring: Sentiment analysis helps businesses gain insights into

product reviews and social media discussions. It allows companies to identify
strengths and areas for improvement in their products or services. For example,
by analyzing customer feedback on social media, a company can measure the
impact of new products, advertising campaigns, or company news.

- Customer Service: Sentiment analysis automates the sorting of user emails

into "urgent" and "non-urgent" categories, enabling customer service teams to
prioritize tasks. It also helps identify frustrated users and address their issues
promptly. Machine learning enhances this process by analyzing both sentiment
and intent, improving automated customer service responses.

These applications enable businesses to better understand their customers'

needs and improve overall satisfaction.

Question 3: (5 Marks)
Explain the four steps involved in the sentiment analysis process. Provide
an example for each step.

Answer:
The sentiment analysis process consists of four main steps:

1. Sentiment Detection: This step involves distinguishing between facts

(objectivity) and opinions (subjectivity). For example, the statement "The
phone has a 6-inch screen" is a fact, whereas "The phone's screen is amazing"
is an opinion.

2. N-P Polarity Classification: In this step, an opinionated piece of text is

classified as either positive (P) or negative (N). For instance, the text "I love
this product!" would be classified as positive, while "This was a terrible
experience" would be classified as negative.
3. Target Identification: This step identifies the target of the sentiment, such
as a person, product, or event. For example, in the sentence "The camera on this
phone is excellent," the target is the "camera."

4. Collection and Aggregation: In this step, sentiments from individual words

or phrases are aggregated to provide an overall sentiment score for a paragraph
or document. For instance, if a review contains multiple sentences, each with a
sentiment score, the scores are combined to determine the overall sentiment of
the review.

Question 4: (4 Marks)
What are the challenges faced in sentiment analysis? How do these
challenges affect the accuracy of the analysis?

Answer:
Sentiment analysis faces several challenges that can affect its accuracy:

1. Ambiguity in Sentiment Evaluation: Sentiments can be categorized as

categorical (e.g., happy, sad, angry) or as a bi-directional spectrum (e.g.,
happiness scale from -100 to 100). Determining the correct scale can be
challenging.

2. Rhetorical Devices: The use of sarcasm, irony, and implied meanings can
mislead sentiment analysis tools. For example, the sentence "Oh great, another
delay!" may seem positive but is actually negative due to sarcasm.

3. Context Dependence: Sentiment often depends on the context, which may

not always be captured by automated tools. For example, the word "cold" could
be negative when describing a meal but neutral when describing the weather.
4. Model Selection and Training: Choosing the appropriate pre-trained model or
training a custom model for a specific application domain is difficult. Models
like TextBlob, Syuzhet, or Stanford NLP may perform differently depending on
the dataset and context.

These challenges can lead to misclassification of sentiments, reducing the

reliability of the analysis.

Question 5: (5 Marks)
Describe how sentiment analysis can be implemented using the VADER
Sentiment Analyzer in Python. Provide an example of how text polarity is
classified.

Answer:
Sentiment analysis can be implemented using the VADER (Valence Aware
Dictionary and sEntiment Reasoner) Sentiment Analyzer in Python. VADER is
a pre-trained model that provides sentiment scores for text, including positive,
negative, neutral, and a compound score.

Steps for Implementation:

1. Import the necessary libraries, such as `nltk` and
`SentimentIntensityAnalyzer`.
2. Download the required NLTK data, such as `vader_lexicon` and `punkt`.
3. Initialize the VADER Sentiment Analyzer.
4. Analyze the sentiment of each text using the `polarity_scores` method.
5. Classify the sentiment based on the compound score:
- Positive: Compound score > 0.05
- Negative: Compound score < -0.05
- Neutral: Compound score between -0.05 and 0.05

Example:
For the text "I love this product! It works perfectly and makes my life easier,"
VADER would calculate a compound score greater than 0.05, classifying it as
"Positive." Similarly, for "This was a terrible experience," the compound score
would be less than -0.05, classifying it as "Negative."
BIDA311: Data Mining Ch.12: DM
Applications Lecture 8
1. What is sentiment analysis, and why is it important?
Answer:
Sentiment analysis is the process of determining the attitude, polarity, or
emotions expressed in a piece of text. It is important because it helps
businesses and organizations understand public opinion, customer feedback,
and emotions, enabling them to make informed decisions. Sentiment analysis is
used to gain insights into how people feel about a topic, product, or service. It
helps in applications like customer service, brand monitoring, and market
research.

2. What are the key steps in the sentiment analysis process?

Answer:
1. Sentiment Detection
2. N-P Polarity Classification
3. Target Identification
4. Collection and Aggregation
These steps involve detecting whether the text is objective or subjective,
classifying the sentiment as positive or negative, identifying the target of the
sentiment, and aggregating sentiments across multiple data points.

3. What is the difference between P-N polarity and S-O polarity in

sentiment analysis?
Answer:
- P-N Polarity: Focuses on classifying sentiment as Positive or Negative.
- S-O Polarity: Focuses on Subjectivity (opinion) versus Objectivity (fact).

4. What is the primary goal of sentiment analysis?

Answer:
The primary goal is to determine what people feel about a particular topic by
analyzing their opinions, emotions, and attitudes. Sentiment analysis aims to
extract meaningful insights from textual data, helping understand public
perception and emotions.

5. List three applications of sentiment analysis in business.

Answer:
1. Brand Monitoring
2. Customer Service
3. Market Research

Explanation:
Sentiment analysis helps businesses monitor public perception of their brand,
improve customer service, and gain insights into market trends and consumer
behavior.

6. How does sentiment analysis assist in brand monitoring?

Answer:
It helps businesses analyze product reviews and social media mentions, identify
strengths and weaknesses, and measure the impact of campaigns or new
products.

Explanation:
By understanding customer sentiment, businesses can improve their offerings
and respond to feedback effectively.

7. Write Python code to analyze sentiment using VADER.

Explanation:
This code uses VADER to analyze the sentiment of sample texts. The output
provides the polarity scores (positive, negative, neutral) and the overall
sentiment category (Positive, Negative, or Neutral) based on the compound
score.

8. What is the compound score in VADER, and how is it used?

Answer:
The compound score is a normalized, weighted sum of sentiment scores,
ranging from -1 (negative) to 1 (positive). It is used to determine the overall
sentiment of the text.

9. What are some challenges faced in sentiment analysis?

Answer:
1. Evaluating sentiment accurately (categorical or spectrum-based)
2. Handling rhetorical devices like sarcasm and irony
3. Selecting or training appropriate models

Explanation:
Sentiment analysis can be misleading without proper context, and choosing the
right tools or models is critical for accuracy.

10. How does context affect sentiment analysis?

Answer:
Context is crucial because rhetorical devices like sarcasm and irony can
mislead sentiment analysis if the underlying meaning is not understood.

Example:
Text: "Oh great, another delay. Just what I needed!"
- Without context, this might be classified as positive due to the word "great,"
but the true sentiment is negative.

11. What are some pre-trained models used in sentiment analysis?

Answer:
1. TextBlob
2. Syuzhet
3. NLP Group at Stanford

Explanation:
These models provide ready-to-use tools for sentiment analysis, saving time
and effort in training custom models.

12. How does sentiment detection work in the sentiment analysis process?
Answer:
Sentiment detection identifies whether a piece of text is objective (fact) or
subjective (opinion).

Example Code:
text = "The sky is blue."
# Detecting objectivity
if "opinion" in text.lower():
print("Subjective")
else:
print("Objective")

The code determines whether a given text expresses an opinion or a fact. In this
case, it identifies the statement as objective.

13. What is the difference between subjectivity and polarity in sentiment

analysis?
Answer:
- Subjectivity: Determines whether the text expresses an opinion or a fact.
- Polarity: Measures the emotional tone of the text (positive, negative, or
neutral).
Subjectivity focuses on the type of statement, while polarity focuses on its
emotional content.

14. How does target identification improve sentiment analysis?

**Answer:**
Target identification ensures the sentiment is associated with the correct
subject, such as a person, product, or event.

**Example:**
Text: "The camera quality of this phone is amazing."
- Target: "camera quality"

**Explanation:**
Accurate identification of the target ensures that the sentiment is correctly
attributed to the intended subject.

---

### **15. What is the purpose of collection and aggregation in sentiment

analysis?**
**Answer:**
Collection and aggregation combine sentiments from words, sentences, and
paragraphs to get an overall sentiment for the document.

Example Code:
```python
sentiments = [0.5, -0.3, 0.2]
overall_sentiment = sum(sentiments) / len(sentiments)
print(f"Overall Sentiment: {overall_sentiment}")
```

**Explanation:**
The code calculates the average sentiment score from a list of sentiment values,
providing a holistic view of the sentiment expressed.

---

### 16. How can sentiment analysis be applied in financial markets?

**Answer:**
It can analyze news articles, social media, and financial reports to gauge market
sentiment and predict trends.

**Example:**
Text: "The stock market is showing positive growth."
- Sentiment: Positive
**Explanation:**
The sentiment analysis identifies positive sentiment, which could indicate
optimism in the financial market.

---

### **17. What is cross-validation, and why is it important in sentiment

analysis?**
**Answer:**
Cross-validation splits the data into training and testing sets multiple times to
evaluate model performance more reliably.

**Explanation:**
It ensures that the model is tested on unseen data, providing a better estimate of
its accuracy.

---

### **18. What are some tools or libraries used for sentiment analysis?**
**Answer:**
1. NLTK
2. TextBlob
3. VADER
4. Syuzhet

**Explanation:**
These tools provide functionalities for sentiment analysis, ranging from
lexicon-based to machine learning approaches.

---

### **19. How does machine learning enhance sentiment analysis for customer
service?**
**Answer:**
Machine learning automates the sorting of user emails, identifies frustrated
users, and prioritizes their issues.

**Example Code:**
```python
emails = ["I need help now!", "This is fine."]
for email in emails:
if "help" in email.lower():
print("Urgent")
else:
print("Non-Urgent")
```

**Explanation:**
The code detects urgency in emails based on keywords, helping prioritize
customer service tasks.

---

### 20. How can sarcasm affect sentiment analysis?

**Answer:**
Sarcasm can mislead sentiment analysis by using positive words to express
negative sentiments.

**Example:**
Text: "What a wonderful day to get stuck in traffic!"
- True Sentiment: Negative
- Predicted Sentiment (without context): Positive

**Explanation:**
Understanding context is crucial to accurately interpret sarcasm in sentiment
analysis.

---

This revised list provides explanations for code outputs without explicitly
showing them, focusing on the logic and interpretation of results.

questions for Dats mining
No ratings yet
questions for Dats mining
57 pages
Mobile computing
No ratings yet
Mobile computing
3 pages
Data Mining Presentation
No ratings yet
Data Mining Presentation
154 pages
DM passing package
No ratings yet
DM passing package
38 pages
data mining Unitwise imp questions
No ratings yet
data mining Unitwise imp questions
3 pages
DWM Mid 2 Question Bank
No ratings yet
DWM Mid 2 Question Bank
5 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
Data Mining Mid 1_Students-1
No ratings yet
Data Mining Mid 1_Students-1
4 pages
Datamining Quiz
No ratings yet
Datamining Quiz
173 pages
CS-DM MODULE -1
No ratings yet
CS-DM MODULE -1
27 pages
Data Mining Unit-1 Notes
No ratings yet
Data Mining Unit-1 Notes
18 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
تنقيب بيانات 7 بعد التعديل Maj
No ratings yet
تنقيب بيانات 7 بعد التعديل Maj
35 pages
DM Unit1 Intro
No ratings yet
DM Unit1 Intro
12 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
DM-Model Question Paper Solutions
No ratings yet
DM-Model Question Paper Solutions
27 pages
Data Warehousing and Data Mining Iv-Cse A: Prepared by
No ratings yet
Data Warehousing and Data Mining Iv-Cse A: Prepared by
5 pages
DS Unit 1 Essay Answers.
No ratings yet
DS Unit 1 Essay Answers.
18 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
unit 1 mining
No ratings yet
unit 1 mining
15 pages
Whats App
No ratings yet
Whats App
23 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
30 pages
DM Ch3 Data Preprocessing
No ratings yet
DM Ch3 Data Preprocessing
45 pages
Data Mining 1
No ratings yet
Data Mining 1
36 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Unit - 1 INTRODUCTION, DATA - 1: What Is Data Mining? Motivating Challenges The Origins of Data 6 Hours
No ratings yet
Unit - 1 INTRODUCTION, DATA - 1: What Is Data Mining? Motivating Challenges The Origins of Data 6 Hours
6 pages
Data Science
No ratings yet
Data Science
13 pages
Unit-1 Notes (1)
No ratings yet
Unit-1 Notes (1)
24 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
Introduction to Data Mining1
No ratings yet
Introduction to Data Mining1
11 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
29 pages
21CS63 - Unit1 Practice Questions
No ratings yet
21CS63 - Unit1 Practice Questions
3 pages
QUESTION BANK BCA_IDS
No ratings yet
QUESTION BANK BCA_IDS
3 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
1. Introduction
No ratings yet
1. Introduction
26 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
4 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
02 DataPreparation
No ratings yet
02 DataPreparation
43 pages
Cs1004: Data Warehousing and Mining Two Marks Questions and Answers Unit I
No ratings yet
Cs1004: Data Warehousing and Mining Two Marks Questions and Answers Unit I
31 pages
Data Mining Imp. Questions in English
No ratings yet
Data Mining Imp. Questions in English
21 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
Data Mining
No ratings yet
Data Mining
6 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
Mca II Sem Data Ware Hoise and Mining
No ratings yet
Mca II Sem Data Ware Hoise and Mining
53 pages
data mining unit I notes
No ratings yet
data mining unit I notes
24 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
Revision (ques.only)
No ratings yet
Revision (ques.only)
2 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
DATA MINING
No ratings yet
DATA MINING
7 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
DM 5th unit ppt
No ratings yet
DM 5th unit ppt
54 pages
CS822-DataMining-Week1 (1)
No ratings yet
CS822-DataMining-Week1 (1)
97 pages
2013 COMP5318 Lecture1
No ratings yet
2013 COMP5318 Lecture1
21 pages
Data Mining IMP Objective Questions_Sep 2023
No ratings yet
Data Mining IMP Objective Questions_Sep 2023
4 pages
Unit-4 Introduction To Data Mining
No ratings yet
Unit-4 Introduction To Data Mining
26 pages
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
poem annotation
No ratings yet
poem annotation
26 pages
Important Techniques for Analyzing Visual Tex
No ratings yet
Important Techniques for Analyzing Visual Tex
6 pages
Self Inflected Wound
No ratings yet
Self Inflected Wound
13 pages
apologia final
No ratings yet
apologia final
12 pages
ACT I
No ratings yet
ACT I
3 pages
Finance Ch 3 booklet- financial statements (2)
No ratings yet
Finance Ch 3 booklet- financial statements (2)
8 pages
Interesting Python
No ratings yet
Interesting Python
5 pages
Demystifying Innovation in the Value Chain (2)
No ratings yet
Demystifying Innovation in the Value Chain (2)
8 pages
Lecture 11 Chapter 6 Part 2 Big Data Processing Concepts (1)
No ratings yet
Lecture 11 Chapter 6 Part 2 Big Data Processing Concepts (1)
14 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
26 pages
Lecture 6 Chapter 5 Part 2 Big Data Storage Concepts (4)
No ratings yet
Lecture 6 Chapter 5 Part 2 Big Data Storage Concepts (4)
6 pages
Lecture 5 Chapter 5 Part 1 Big Data Storage Concepts (5)
No ratings yet
Lecture 5 Chapter 5 Part 1 Big Data Storage Concepts (5)
19 pages
docx (4)
No ratings yet
docx (4)
15 pages
Evolution 5.1 natural selection edit
No ratings yet
Evolution 5.1 natural selection edit
3 pages
chapter-9-test-bank
No ratings yet
chapter-9-test-bank
31 pages
chem practice
No ratings yet
chem practice
2 pages
English a Language and Literature Internal Assessment Class of 2022
No ratings yet
English a Language and Literature Internal Assessment Class of 2022
6 pages
mis-laudon-14-chapter-4-test-bank (1)
No ratings yet
mis-laudon-14-chapter-4-test-bank (1)
29 pages
ds-bida
No ratings yet
ds-bida
2 pages
Cells IB
No ratings yet
Cells IB
37 pages
Microsoft Word - Lecture 1.Docx
No ratings yet
Microsoft Word - Lecture 1.Docx
55 pages
A symbiotic relationship Biology 4.4
No ratings yet
A symbiotic relationship Biology 4.4
1 page
Teilnehmerliste_Mündlicher Ausdruck_Labs
No ratings yet
Teilnehmerliste_Mündlicher Ausdruck_Labs
14 pages
Soft Skills Summary
No ratings yet
Soft Skills Summary
17 pages
PCA - Colab
No ratings yet
PCA - Colab
2 pages
Water
No ratings yet
Water
15 pages
Here are the answers to your questions
No ratings yet
Here are the answers to your questions
3 pages
Death of a Salesman - Act 2 Questions
No ratings yet
Death of a Salesman - Act 2 Questions
2 pages
Document
No ratings yet
Document
3 pages
Exam
No ratings yet
Exam
2 pages
CSC 201 Design & Analysis of Algorithms: Khalid Mahmood Lectu Rer
No ratings yet
CSC 201 Design & Analysis of Algorithms: Khalid Mahmood Lectu Rer
43 pages
Rakib Project
No ratings yet
Rakib Project
14 pages
Data science
No ratings yet
Data science
2 pages
Quantitative Techniques PDF
No ratings yet
Quantitative Techniques PDF
7 pages
1-s2.0-S0957417421017255-main
No ratings yet
1-s2.0-S0957417421017255-main
13 pages
Mathematics 11 00820
No ratings yet
Mathematics 11 00820
38 pages
Silver Peak Security Algorithms v3 PDF
No ratings yet
Silver Peak Security Algorithms v3 PDF
7 pages
Kelompok 2 - Linear Programming (English)
No ratings yet
Kelompok 2 - Linear Programming (English)
37 pages
EE 402 Lecture 1
No ratings yet
EE 402 Lecture 1
7 pages
20bci0250 VL2021220502045 Pe003
No ratings yet
20bci0250 VL2021220502045 Pe003
50 pages
22 A Comparison of Some Multivariate Linear Regression Estimation Methods
No ratings yet
22 A Comparison of Some Multivariate Linear Regression Estimation Methods
9 pages
Detecting Pneumonia Using Convolutions and Dynamic Capsule Routing For Chest X-Ray Images
No ratings yet
Detecting Pneumonia Using Convolutions and Dynamic Capsule Routing For Chest X-Ray Images
30 pages
Easy Explanation of Data Modelling in Python
No ratings yet
Easy Explanation of Data Modelling in Python
2 pages
Sciencedirect: © 2019, Ifac (International Federation of Automatic Control) Hosting by Elsevier Ltd. All Rights Reserved
No ratings yet
Sciencedirect: © 2019, Ifac (International Federation of Automatic Control) Hosting by Elsevier Ltd. All Rights Reserved
6 pages
SSRN Id4460036
No ratings yet
SSRN Id4460036
22 pages
DAA Question Bank
No ratings yet
DAA Question Bank
7 pages
Mcq Assignm
No ratings yet
Mcq Assignm
17 pages
Assignment 1: Chiranjeev Sharma A2324717001 Btech Cse+ Mba 7cse3
100% (1)
Assignment 1: Chiranjeev Sharma A2324717001 Btech Cse+ Mba 7cse3
13 pages
Final EE3150 2015 Fall
No ratings yet
Final EE3150 2015 Fall
2 pages
A Hybrid Forecasting Model For Prediction of Stock Value of Tata Steel Using Support Vector Regression and Particle Swarm Optimization
No ratings yet
A Hybrid Forecasting Model For Prediction of Stock Value of Tata Steel Using Support Vector Regression and Particle Swarm Optimization
10 pages
Computational Methods in Physics: Seminar
No ratings yet
Computational Methods in Physics: Seminar
4 pages
Exam3 Solutions
No ratings yet
Exam3 Solutions
5 pages
Derivation of The Lorentz Transformation
No ratings yet
Derivation of The Lorentz Transformation
3 pages
Granger Slides
No ratings yet
Granger Slides
9 pages
Unit 2 - RELATIONAL MODEL
No ratings yet
Unit 2 - RELATIONAL MODEL
28 pages
CH 0 To 3
No ratings yet
CH 0 To 3
145 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
InTech-Robot Control by Fuzzy Logic
No ratings yet
InTech-Robot Control by Fuzzy Logic
22 pages
Assessment Summative
No ratings yet
Assessment Summative
3 pages
Quadratic Equation Class 10 TH
0% (1)
Quadratic Equation Class 10 TH
15 pages