Lecture 1
Lecture 1
2. How does Data Mining fit into the knowledge discovery process?
Data Mining is an essential step in the knowledge discovery process, which
includes data preparation (cleaning, integration, transformation, selection), data
mining (applying intelligent methods to extract patterns), pattern evaluation (to
identify interesting patterns), and knowledge presentation (visualization and
representation).
1. What are the corresponding Data Mining techniques for the different
data types?
Structured data: Sequential pattern mining, relational data mining
Semi-structured data: Graph pattern mining, information network mining
1. Unstructured data: Text mining, image and video recognition (deep learning )
Answer:
1. What are the types of data that can be mined in Data Mining, and what
are the corresponding Data Mining techniques for each type?
Diversity of data types can be mined, including structured data, unstructured data,
and semi-structured data. It then outlines the specific Data Mining techniques that
are applicable for each data type, such as sequential pattern mining and relational
data mining for structured data, graph pattern mining and information network
mining for semi-structured data, and text mining as well as image/video
recognition using deep learning for unstructured data.
4. How does the concept of similarity and distance measures fit into the
overall knowledge discovery process in Data Mining?
The lecture situates the discussion of similarity and distance measures within the
broader context of the knowledge discovery process in Data Mining. It highlights
how understanding the types of data, their characteristics, and the appropriate
similarity measures is a crucial step in the data preparation and preprocessing
phase, which then enables the effective application of Data Mining techniques to
extract useful patterns and insights from the data.
Chapter 2: Similarity and Data
processing
Questions and Answers on Data, Measurements, and Preprocessing
1. What are similarity measures, and why are they important in data
mining?
Answer: Similarity measures quantify how alike two objects are, often based on
their attributes. They are crucial in data mining for tasks such as clustering,
classification, and recommendation systems, as they help identify patterns and
relationships within data.
6. What methods can be used to handle missing values during data cleaning?
Answer: Methods include:
- Ignore the tuple: Exclude incomplete records.
- Fill in missing values
- Manually: Enter values based on domain knowledge.
- Global constant: Use a fixed value for all missing entries.
- Central tendency: Use mean, median, or mode to fill gaps.
7. What is the entity identification problem, and how does it affect data
integration?
- Answer: The entity identification problem occurs when the same
entity is represented differently across data sources (e.g., customer_id
vs. cust_number). This can lead to confusion and inaccuracies during
data integration, making it difficult to create a unified dataset.
12.In what scenarios would you prefer using distance measures over
similarity measures, or vice versa?
- Answer: Distance measures are preferred when the magnitude of
differences is crucial (e.g., Euclidean distance in spatial data). Similarity
measures are preferred in cases where relationships and patterns are more
important than absolute differences (e.g., cosine similarity in text
analysis).
General Understanding
Answer:
Step 1: Identify Unique Words
- First, let's list the unique words in each document.
- Document 1: "The cat chased the mouse."
- Unique words: [the, cat, chased, mouse]
- Document 2: "The dog barked at the cat."
- Unique words: [the, dog, barked, at, cat]
- Document 3: "The bird sang in the tree."
- Unique words: [the, bird, sang, in, tree]
- Document 4: "The fish swam in the pond."
- Unique words: [the, fish, swam, in, pond]
- Vectors:
- Doc 1: [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- Doc 2: [1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
- Doc 3: [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0]
- Doc 4: [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
1. Dot Product
˙
Product (Doc 1 , Doc 2)}=(1 ×1)+(1 ×1)+(1 × 0)+(1× 0)+(0 × 1)+(0 ×1)+(0 ×1)+(0× 0)+(0 × 0)+(0
Step 5: Interpretation
The cosine similarity between Document 1 and Document 2 is
approximately 0.447. This value indicates a moderate level of similarity
between the two documents, meaning they share some common words
but are not identical.
- Summary
The operation captures the relationships between gender and power, demonstrating
how semantic meanings can be dynamically derived and visualized within a
conceptual framework. The 3D model helps illustrate these relationships and
transformations effectively.
Pattern Mining Ch.3&4
Example: Chi-Square Test
Scenario
Suppose we want to investigate whether there is a relationship between gender
(Male, Female) and preference for a type of beverage (Coffee, Tea). We collect
the following data:
Male 30(22.22) 10 40
Female 20 30 50
Total 50 40 90
(50)×(50) 2500
E= = ≈ 27.78
90 90
(50)×(40) 2000
E= = ≈ 22.22
90 90
Total 50 40 90
∑ x 2 ≈ 2.73+3.41+2.18+2.73 ≈ 11.05
- Step 5: Calculate Degrees of Freedom
- Using the formula:
df =( r−1 ) × ( c−1 )
- Step 6: Conclusion
- Answer
- Closed Patterns:
- {B, C}
- {A, D}
- {B, D}
- {C, D}
- {A, B, C}
- {A, B, D}
- {B, C, D}
- Max Patterns:
- {B, C}
- {A, D}
- {B, D}
- {C, D}
- {A, B, C}
- {A, B, D}
- {B, C, D}
Part 2:
1. Explain the significance of the minsup threshold in frequent itemset mining.
- Answer: The minsup (minimum support) threshold determines the minimum
frequency an itemset must have to be considered frequent. It greatly affects the
number of itemsets generated; a low minsup can lead to an exponential number of
frequent itemsets.
Part 3:
2. How does the Apriori algorithm utilize candidate generation and testing?
Provide an example.
- Answer: The Apriori algorithm generates candidate itemsets by self-joining the
frequent itemsets from the previous iteration. For example, if L2 = {AB, AC, BC},
then C3 can be generated by combining these itemsets to form candidates like
{ABC}. Candidates are then tested against the transaction database to determine
their frequency.
3. Explain the concept of partitioning in the context of improving the Apriori
algorithm.
- Answer: Partitioning involves dividing the transaction database into smaller
subsets or partitions. The algorithm scans each partition to find local frequent
patterns in two passes. The first pass identifies local frequent itemsets, and the
second pass consolidates these to find global frequent patterns, thus reducing the
number of scans required.
Transaction Database
Summary of Results
- Frequent Items: A, B, C, D
- Frequent Pairs: {A, B}, {A, D}, {B, C}, {B, D}, {C, D}
- Frequent Triplets: {A, B, D}, {B, C, D}
-Conclusion
This corrected example now accurately reflects the frequent itemsets using the
Apriori algorithm with three scans.
Classification Ch.6+7
Medium Questions
3. Explain the concept of decision tree pruning and its
importance.
- Answer: Decision tree pruning is the process of removing sections of a
decision tree that provide little predictive power. It is important because it helps
prevent overfitting, improves the model's generalization to new data, and makes
the tree smaller and faster to use.
- Answer: You should prefer to split on X1 because it results in a pure split (all
instances are Y=t), whereas X2 leaves some impurity in the classification. This
indicates that X1 provides significantly more information for classification
compared to X2 and is the better choice for splitting.
- Answer:
Using Bayes' theorem:
P(Purchase| A )× P( A)
P( A| Purchase)
P(Purchase)
Where:
Thus, the probability of a customer being from City A given they made a
purchase is approximately 70.59%.
2. What is hard clustering and soft (fuzzy) clustering? How do they differ?
Answer: Hard clustering assigns each data point exclusively to one cluster; no
overlapping occurs between clusters. Soft (fuzzy) clustering allows a single
data point to belong to multiple clusters with varying degrees of membership
ranging from 0 (not belonging) to 1 (fully belonging).
3. Describe the k-means algorithm's basic steps.
Answer: The k-means algorithm involves four main steps:
- Initialization: Partition objects into (k) non-empty subsets
randomly.
- Centroid Calculation: Compute new centroids by calculating the
mean of all points assigned to each cluster.
- Assignment Step: Assign each object/data point to its nearest
centroid's cluster.
- Iteration: Repeat steps 2 and 3 until there’s no change in
assignments or centroids stabilize.
1. Hold-Out Method:
- Split your dataset into two parts: one for training and one for testing.
- Pros: Simple to implement and fast.
- Cons: High variance; results can change based on how you split the data.
4. Bootstrap Method:
- Involves randomly sampling from your dataset with replacement to create
multiple “bootstrap” datasets, which are then used to estimate model accuracy.
- Pros: Helps in understanding variability and can provide confidence intervals
around model performance estimates.
Overview of Cross-Validation
Cross-validation is a technique used to assess how well your model performs on
unseen data. It helps ensure that your model generalizes well beyond just the
training data. Here, we'll illustrate four common methods: Hold-Out, K-Fold,
Leave-One-Out (LOOCV), and Bootstrap.
1. Hold-Out Method
Concept: You split your dataset into two parts: one for training and one for
testing.
Example Dataset:
Imagine we have a small dataset with 6 samples:
Sample Feature 1 Feature 2 Label
1 2 3 A
2 4 5 B
3 5 7 A
4 6 -1 B
5 -1 -2 A
6 -3 -4 -B
In this method:
- Train your model using the training set.
- Test its performance using the testing set.
Example of Hold-Out:
Concept: The dataset is divided into (k) equal parts or "folds". The model is
trained on (k-1) folds and tested on the remaining fold. This process repeats
until each fold has been used as the test data once.
Example:
Iteration #1:
- Train on Folds B & C → Test on Fold A
- Train with Samples {3,4}, Test with Samples {1,2}
Iteration #2:
- Train on Folds A & C → Test on Fold B
- Train with Samples {1,2}, Test with Samples {3,4}
Iteration #3:
- Train on Folds A & B → Test on Fold C
- Train with Samples{(5),(6)}
After all iterations are complete you average out performance metrics like
accuracy from each iteration's result to get an overall assessment.
Example of K-Fold
If in each iteration you got accuracies like 70%,85%,90%, you'll average them
giving an estimated overall accuracy around:
1. Average Percentage=
∑ of Scores
Number of Scores
70+85+ 90
2. Substituting the values=
3
245
Simplify= =81.67 %
3
Concept: In LOOCV you leave out one sample from your dataset during each
iteration while using all other samples for training.
Example:
For our original sample size of six ( n = 6) , you'd perform these iterations:
Iterate over all samples leaving one out at a time until every sample has been
tested once.
For instance,
Iteration #1: Exclude Sample 1, train using {Samples (2),(3),(4),(5),(6)}; test
against {Sample(1)}
Next,
Iteration #2: Exclude Sample 2, train using {Samples (A)(C)(D)(E)(F)};test
against {Sample(Exclusion)...
And so forth till every sample has been processed!
Average results across all six iterations will give final performance metric.
Average Accuracy=
∑ of Accuracies
Number of Measurements
85+75+90+ 80+95+60
Substituting the values=
6
486
Simplify= Average Accuracy = =81 %
6
4. Bootstrap Method
- Example:
A single bootstrap sample might look like this after sampling from original data
randomly—some points appearing multiple times such as:
You repeat this process numerous times say reaching up to 1000 bootstraps,
computing metrics after modeling built off these sampled datasets provides an
estimate regarding confidence intervals/variation around certain metrics!
This helps estimate strength across random deviations instead relying solely
upon fixed validation splits!
Points Coordinates
A (1,3)
B (2,5)
C (4,8)
D (7,9)
Step for initialization of membership table might look like this:
Let’s recalculate the centroid for C 1, including point D with coordinates (7,9)
and an assumed membership value for C 1. Please confirm the membership value
for DD in c1c_1. For now, let’s assume γD 1 = 0.5.
The membership value for a point is calculated based on its distance from a
reference point. The formula is:
Where:
Distance is the Euclidean distance between the point and the reference
point (0.5, 0.5)
Membership value = 1 / (1 + Distance)
√(0.5 2+ 0.52)=0.707
( √ ( 1−0.5 ) +( 3−0.5 )
)
2 2
γA 1=1− =0.8
0.707
( √( 2−0.5 ) +( 5−0.5 )
)
2 2
γB 1=1− =0.7
0.707
( √ ( 4−0.5 ) +( 8−0.5 )
)
2 2
γC 1=1− =0.2
0.707
( √( 7−0.5 ) +( 9−0.5 )
)
2 2
γD 1=1− =0.5
0.707
γA 1 = 0.8
γB 1 = 0.7
γC 1 = 0.2
γD 1 = 0.5
Fuzziness Parameter:
m=2
Coordinates of Points :
A=(2 , 3)
B=(4 ,1)
C=(1, 5)
D=(7 , 9)
C 1 ≈(3.54 , 3.42)
Conclusion:
If this data represents customer locations, the centroid could indicate the
approximate "central location" of customers most closely aligned with Cluster
C 1. Businesses could use this information to determine where to allocate
resources, such as opening a new store or focusing marketing efforts.
The hold-out method is a simple and widely used approach for evaluating the
performance of a machine learning model. It involves splitting the available
dataset into two (or sometimes three) parts:
Key Steps:
1. Dataset Splitting:
o Training Set: A portion of the data is used to train the machine
learning model. This is where the model learns patterns from the input
data.
o Test Set: The remaining portion of the data is used to test the model's
performance on unseen data. This helps evaluate how well the model
generalizes to new, unseen data.
o Optionally, a validation set can be used if hyperparameter tuning or
model selection is involved.
2. Model Training:
o The model is trained only on the training set.
o During this phase, the model learns to fit the data based on its
algorithm.
3. Model Evaluation:
o Once the training is complete, the model is tested on the test set.
o Performance metrics (e.g., accuracy, precision, recall, F1-score, etc.)
are calculated to determine how well the model performs.
Dataset Split Ratio:
Advantages:
1. Simplicity:
o Easy to implement and interpret.
2. Fast:
o Works well when you have a large dataset where splitting doesn’t
significantly reduce the available training data.
Disadvantages:
1. Dependency on Split:
o Results can vary depending on how the data is split.
o If the test set isn’t representative, the evaluation may not be reliable.
2. Wasted Data:
o A portion of the data is left out during training, which could
potentially reduce the model’s ability to learn better.
Example in Context:
Visualization:
The plot shows the alignment between true labels (blue points) and predicted
labels (red points), confirming all predictions are accurate.
Step 1:
Initialize Membership Values
We start by initializing a membership matrix
with random values. Each value represents
the degree of membership of each data point
in each cluster.
Where:
C i = centroid of cluster i
u ij = membership degree of data point j in cluster i
m m = fuzziness parameter (usually set to 2)
xj x j = data point j
n n = number of data points
Calculation of Centroids
For Cluster 1:
Example Calculation
For data point (1,3) to centroid C 1 (1.57, 4.05):
Where:
Final Iteration
Repeat Steps 2-4 until the membership values stabilize (i.e., change is less than
a predefined tolerance).
This example illustrates how Fuzzy C-Means works step by step, using a simple dataset.
Each step involves calculations that help refine the clustering based on the membership
values and distances to centroids.
BIDA311: Data Mining Ch.12: DM
Applications Lecture 8
Question 1: (3 Marks)
Define sentiment analysis and explain its primary objective. How is it
related to opinion mining?
Answer:
Sentiment analysis, also known as opinion mining, is the process of
determining the attitude, polarity, or emotions expressed in a piece of text. The
primary objective of sentiment analysis is to answer the question, "What do
people feel about a certain topic?" by analyzing data related to opinions using
various automated tools. It identifies the sentiment polarity (positive, negative,
or neutral) and emotions (angry, sad, happy, etc.) expressed in the text.
Sentiment analysis is closely related to opinion mining because both aim to
extract subjective information from data, such as beliefs, views, and opinions,
to understand public sentiment on a given topic.
Question 2: (4 Marks)
Discuss the applications of sentiment analysis in business. Provide
examples of how it can be used in brand monitoring and customer service.
Answer:
Sentiment analysis has several applications in business, including brand
monitoring, customer service, and market research.
Question 3: (5 Marks)
Explain the four steps involved in the sentiment analysis process. Provide
an example for each step.
Answer:
The sentiment analysis process consists of four main steps:
Question 4: (4 Marks)
What are the challenges faced in sentiment analysis? How do these
challenges affect the accuracy of the analysis?
Answer:
Sentiment analysis faces several challenges that can affect its accuracy:
2. Rhetorical Devices: The use of sarcasm, irony, and implied meanings can
mislead sentiment analysis tools. For example, the sentence "Oh great, another
delay!" may seem positive but is actually negative due to sarcasm.
Question 5: (5 Marks)
Describe how sentiment analysis can be implemented using the VADER
Sentiment Analyzer in Python. Provide an example of how text polarity is
classified.
Answer:
Sentiment analysis can be implemented using the VADER (Valence Aware
Dictionary and sEntiment Reasoner) Sentiment Analyzer in Python. VADER is
a pre-trained model that provides sentiment scores for text, including positive,
negative, neutral, and a compound score.
Example:
For the text "I love this product! It works perfectly and makes my life easier,"
VADER would calculate a compound score greater than 0.05, classifying it as
"Positive." Similarly, for "This was a terrible experience," the compound score
would be less than -0.05, classifying it as "Negative."
BIDA311: Data Mining Ch.12: DM
Applications Lecture 8
1. What is sentiment analysis, and why is it important?
Answer:
Sentiment analysis is the process of determining the attitude, polarity, or
emotions expressed in a piece of text. It is important because it helps
businesses and organizations understand public opinion, customer feedback,
and emotions, enabling them to make informed decisions. Sentiment analysis is
used to gain insights into how people feel about a topic, product, or service. It
helps in applications like customer service, brand monitoring, and market
research.
Explanation:
Sentiment analysis helps businesses monitor public perception of their brand,
improve customer service, and gain insights into market trends and consumer
behavior.
Explanation:
By understanding customer sentiment, businesses can improve their offerings
and respond to feedback effectively.
Explanation:
Sentiment analysis can be misleading without proper context, and choosing the
right tools or models is critical for accuracy.
Example:
Text: "Oh great, another delay. Just what I needed!"
- Without context, this might be classified as positive due to the word "great,"
but the true sentiment is negative.
Explanation:
These models provide ready-to-use tools for sentiment analysis, saving time
and effort in training custom models.
12. How does sentiment detection work in the sentiment analysis process?
Answer:
Sentiment detection identifies whether a piece of text is objective (fact) or
subjective (opinion).
Example Code:
text = "The sky is blue."
# Detecting objectivity
if "opinion" in text.lower():
print("Subjective")
else:
print("Objective")
The code determines whether a given text expresses an opinion or a fact. In this
case, it identifies the statement as objective.
**Example:**
Text: "The camera quality of this phone is amazing."
- Target: "camera quality"
**Explanation:**
Accurate identification of the target ensures that the sentiment is correctly
attributed to the intended subject.
---
Example Code:
```python
sentiments = [0.5, -0.3, 0.2]
overall_sentiment = sum(sentiments) / len(sentiments)
print(f"Overall Sentiment: {overall_sentiment}")
```
**Explanation:**
The code calculates the average sentiment score from a list of sentiment values,
providing a holistic view of the sentiment expressed.
---
**Example:**
Text: "The stock market is showing positive growth."
- Sentiment: Positive
**Explanation:**
The sentiment analysis identifies positive sentiment, which could indicate
optimism in the financial market.
---
**Explanation:**
It ensures that the model is tested on unseen data, providing a better estimate of
its accuracy.
---
### **18. What are some tools or libraries used for sentiment analysis?**
**Answer:**
1. NLTK
2. TextBlob
3. VADER
4. Syuzhet
**Explanation:**
These tools provide functionalities for sentiment analysis, ranging from
lexicon-based to machine learning approaches.
---
### **19. How does machine learning enhance sentiment analysis for customer
service?**
**Answer:**
Machine learning automates the sorting of user emails, identifies frustrated
users, and prioritizes their issues.
**Example Code:**
```python
emails = ["I need help now!", "This is fine."]
for email in emails:
if "help" in email.lower():
print("Urgent")
else:
print("Non-Urgent")
```
**Explanation:**
The code detects urgency in emails based on keywords, helping prioritize
customer service tasks.
---
**Example:**
Text: "What a wonderful day to get stuck in traffic!"
- True Sentiment: Negative
- Predicted Sentiment (without context): Positive
**Explanation:**
Understanding context is crucial to accurately interpret sarcasm in sentiment
analysis.
---
This revised list provides explanations for code outputs without explicitly
showing them, focusing on the logic and interpretation of results.