⇶Data Mining--2
⇶Data Mining--2
→Data:
The most elementary description of things, events, activities and
transactions.
→Information:
Information is processed or organized data that provides meaning or context
for understanding.
→Knowledge:
Knowledge is the understanding and awareness gained through experience,
education, or the acquisition of information.
Business Intelligence:
Business intelligence (BI) refers to the processes, technologies, and tools used to
collect, analyze, and present data to help businesses make informed decisions,
optimize performance, and gain competitive advantages.
What is data mining?
Data mining is the process of discovering patterns, trends, and insights from large
sets of data using statistical techniques, machine learning, and algorithms.
3. Healthcare
4. E-commerce
5. Telecommunications
6. Manufacturing
2. Clustering
3. Association Rule Learning
4. Regression
6. Dimensionality Reduction
Data Warehousing:
Data warehousing is the process of collecting, storing, and managing large volumes
of structured data from different sources in a central repository, known as a data
warehouse, for analysis and reporting.
OLAP (Online Analytical Processing)
● OLAP tools allow users to interactively analyze data stored in the data
warehouse by performing complex queries and generating reports. OLAP
operations like drill-down (viewing detailed data), roll-up (viewing
aggregated data), and slice-and-dice (viewing data from different
perspectives) help in multi-dimensional analysis.
2.DB Miner
3.Oracle
Database Mined:
Relational Database Mining: Focuses on extracting patterns from structured,
table-based databases.
Object-Oriented Database Mining: Involves mining complex data stored in
object-oriented databases.
Transactional Database Mining: Deals with discovering patterns from
transactional data like sales records.
Spatial, Temporal, and Multimedia Database Mining: Analyzes specialized
databases containing spatial, time-series, or multimedia data.
Knowledge Mined:
Descriptive Data Mining: Focuses on summarizing the data and uncovering
patterns or relationships, like clustering, association rules, and summaries.
Predictive Data Mining: Aims to predict unknown or future outcomes based on
historical data, using techniques like classification, regression, and time-series
analysis.
Techniques Utilized:
Classification: Assigns data into predefined categories (e.g., decision trees, neural
networks).
Clustering: Groups similar data points into clusters without predefined labels (e.g.,
k-means, DBSCAN).
Regression: Predicts continuous values based on input data (e.g., linear regression,
support vector machines).
Association: Identifies relationships or patterns between variables (e.g., Apriori,
FP-Growth).
Application Adapted:
● Business:
● Healthcare:
● Finance:
● Retail:
KDD PROCESS:
The KDD (Knowledge Discovery in Databases) process is a series of steps for
discovering useful knowledge from large datasets. It involves the following stages:
1. Data Selection: Identify relevant data from the dataset, focusing on useful
attributes.
2. Data Preprocessing: Clean and preprocess data to handle missing values,
noise, and inconsistencies.
3. Data Transformation: Transform data into a suitable format for mining (e.g.,
normalization, feature extraction).
4. Data Mining: Apply algorithms (e.g., classification, clustering) to discover
patterns and relationships in the data.
5. Interpretation/Evaluation: Interpret the mined results, evaluate their validity,
and extract actionable knowledge.
The KDD process helps in turning raw data into valuable insights for
decision-making.
Advantages of data mining :
1. Better Decision-Making: Extracts valuable insights for informed decisions.
2. Predictive Analytics: Forecasts trends and behaviors for future planning.
3. Cost Efficiency: Reduces operational costs by automating analysis.
4. Fraud Detection: Identifies anomalies to prevent fraud.
5. Personalization: Enables customized services and marketing
Assignment:
Normalization table (1nf , 2nf, 3nf)
What is Database Normalization?
Normalization is a database design technique that reduces data redundancy and
eliminates undesirable characteristics like Insertion, Update and Deletion Anomalies.
(2NF):
● Rule 1- Be in 1NF
● Rule 2- Single Column Primary Key that does not functionally
dependent on any subset of candidate key relation
Itemset
An itemset is a set containing one or more items in the transaction
dataset. For instance, {Milk}, {Milk, Bread}, {Tea, Ketchup}, and
{Milk, Tea, Coffee} are all itemsets.
•An itemset can also be an empty set.
•An itemset can contain certain items even if they are not present
together in the transaction dataset.
Support:
Support indicates how frequently an item appears in the data.
Support({milk, bread}) = Number of transactions containing
{milk, bread} / Total number of transactions
= 100 / 1000
= 10%
Confidence:
Confidence is a measure of the likelihood that an itemset will
appear if another itemset appears.
Confidence("If a customer buys milk, they will also buy bread")
= Number of transactions containing
{milk, bread} / Number of transactions containing {milk}
= 100 / 200
= 50%
Lift:
Lift is a measure of the strength of the association between two
items, taking into account the frequency of both items in the
dataset.
Assignment:
Data set that utilizes support, confidence, and lift.