Model question paper and solution_DWDM.docx
Model question paper and solution_DWDM.docx
(2023)
Ans. Support and confidence are two measures of rule interestingness. They
respectively reflect the usefulness and certainty of discovered rules. A support of
2% for means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together. A confidence of 60% means that
60% of the customers who purchased a computer also bought the software.
Typically, association rules are considered interesting if they satisfy both a
minimum support threshold and a minimum confidence threshold. These
thresholds can be a set by users or domain experts. Additional analysis can be
performed to discover interesting statistical
correlations between associated items.
Support
In data mining, support refers to the relative frequency of an item set in a dataset.
For example, if an itemset occurs in 5% of the transactions in a dataset, it has a
support of 5%. Support is often used as a threshold for identifying frequent item
sets in a dataset, which can be used to generate association rules. For example, if
we set the support threshold to 5%, then any itemset that occurs in more than 5%
of the transactions in the dataset will be considered a frequent itemset.
The support of an itemset is the number of transactions in which the itemset
appears, divided by the total number of transactions. For example, suppose we
have a dataset of 1000 transactions, and the itemset {milk, bread} appears in 100
of those transactions. The support of the itemset {milk, bread} would be calculated
as follows:
Support({milk, bread}) = Number of transactions containing
{milk, bread} / Total number of transactions
= 100 / 1000
= 10%
So the support of the itemset {milk, bread} is 10%. This means that in 10% of the
transactions, the items milk and bread were both purchased.
In general, the support of an itemset can be calculated using the following formula:
Support(X) = (Number of transactions containing X) / (Total number of
transactions)
where X is the itemset for which you are calculating the support.
Confidence
In data mining, confidence is a measure of the reliability or support for a given
association rule. It is defined as the proportion of cases in which the association
rule holds true, or in other words, the percentage of times that the items in the
antecedent (the “if” part of the rule) appear in the same transaction as the items in
the consequent (the “then” part of the rule).
Confidence is a measure of the likelihood that an itemset will appear if another
itemset appears. For example, suppose we have a dataset of 1000 transactions, and
the itemset {milk, bread} appears in 100 of those transactions. The itemset {milk}
appears in 200 of those transactions. The confidence of the rule “If a customer
buys milk, they will also buy bread” would be calculated as follows:
Confidence("If a customer buys milk, they will also buy bread")
= Number of transactions containing
{milk, bread} / Number of transactions containing {milk}
= 100 / 200
= 50%
So the confidence of the rule “If a customer buys milk, they will also buy bread” is
50%. This means that in 50% of the transactions where milk was purchased, bread
was also purchased.
In general, the confidence of a rule can be calculated using the following formula:
Confidence(X => Y) = (Number of transactions containing X and Y) / (Number of
transactions containing X)
Q.2 Draw and explain various components of a 3-tier data warehouse
architecture. (2023)
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and
data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference
data model for a whole enterprise. At the same time, it separates the problems of
source data extraction and integration from those of data warehouse population. In
some cases, the reconciled layer is also directly used to accomplish better some
operational tasks, such as producing daily reports that cannot be satisfactorily
prepared using the corporate applications or generating data flows to feed external
processes periodically to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A
disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away
from being real-time.
2. The middle tier is an OLAP server that is typically implemented using either a
relational OLAP(ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or a
multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that
directly implements multidimensional data and operations).
3. The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and
so on).
The process begins with determining the KDD objectives and ends with the
implementation of the discovered knowledge. At that point, the loop is closed, and
the Active Data Mining starts. Subsequently, changes would need to be made in the
application domain. For example, offering various features to cell phone users in
order to reduce churn. This closes the loop, and the impacts are then measured on
the new data repositories, and the KDD process again. Following is a concise
description of the nine-step KDD process, Beginning with a managerial step:
1. Building up an understanding of the application domain
This is the initial preliminary step. It develops the scene for understanding what
should be done with the various decisions like transformation, algorithms,
representation, etc. The individuals who are in charge of a KDD venture need to
understand and characterize the objectives of the end-user and the environment
in which the knowledge discovery process will occur (involves relevant prior
knowledge).
Once defined the objectives, the data that will be utilized for the knowledge
discovery process should be determined. This incorporates discovering what data
is accessible, obtaining important data, and afterward integrating all the data
for knowledge discovery onto one set involves the qualities that will be considered
for the process. This process is important because of Data Mining learns and
discovers from the accessible data. This is the evidence base for building the
models. If some significant attributes are missing, at that point, then the entire
study may be unsuccessful from this respect, the more attributes are considered.
On the other hand, to organize, collect, and operate advanced data repositories is
expensive, and there is an arrangement with the opportunity for best understanding
the phenomena. This arrangement refers to an aspect where the interactive and
iterative aspect of the KDD is taking place. This begins with the best available data
sets and later expands and observes the impact in terms of knowledge discovery
and modeling.
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and
developed. Techniques here incorporate dimension reduction( for example, feature
selection and extraction and record sampling), also attribute transformation(for
example, discretization of numerical attributes and functional transformation). This
step can be essential for the success of the entire KDD project, and it is typically
very project-specific. For example, in medical assessments, the quotient of
attributes may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well as efforts
and transient issues. For example, studying the impact of advertising accumulation.
However, if we do not utilize the right transformation at the starting, then we may
acquire an amazing effect that insights to us about the transformation required in
the next iteration. Thus, the KDD process follows upon itself and prompts an
understanding of the transformation required.
We are now prepared to decide on which kind of Data Mining to use, for
example, classification, regression, clustering, etc. This mainly relies on the
KDD objectives, and also on the previous steps. There are two significant
objectives in Data Mining, the first one is a prediction, and the second one is the
description. Prediction is usually referred to as supervised Data Mining, while
descriptive Data Mining incorporates the unsupervised and visualization
aspects of Data Mining. Most Data Mining techniques depend on inductive
learning, where a model is built explicitly or implicitly by generalizing from an
adequate number of preparing models. The fundamental assumption of the
inductive approach is that the prepared model applies to future cases. The
technique also takes into account the level of meta-learning for the specific set of
accessible data.
Having the technique, we now decide on the strategies. This stage incorporates
choosing a particular technique to be used for searching patterns that include
multiple inducers. For example, considering precision versus understandability, the
previous is better with neural networks, while the latter is better with decision
trees. For each system of meta-learning, there are several possibilities of how it can
be succeeded. Meta-learning focuses on clarifying what causes a Data Mining
algorithm to be fruitful or not in a specific issue. Thus, this methodology attempts
to understand the situation under which a Data Mining algorithm is most suitable.
Each algorithm has parameters and strategies of leaning, such as ten folds
cross-validation or another division for training and testing.
At last, the implementation of the Data Mining algorithm is reached. In this stage,
we may need to utilize the algorithm several times until a satisfying outcome is
obtained. For example, by turning the algorithms control parameters, such as the
minimum number of instances in a single leaf of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the
objective characterized in the first step. Here we consider the preprocessing steps
as for their impact on the Data Mining algorithm results. For example, including a
feature in step 4, and repeat from there. This step focuses on the comprehensibility
and utility of the induced model. In this step, the identified knowledge is also
recorded for further use. The last step is the use, and overall feedback and
discovery results acquire by Data Mining.
Data transformation may be done using a variety of methods. Data cleansing, data
integration, and data reduction are the three basic categories that may be used to
group these procedures.
4.Normalization, where the attribute data are scaled so as to fall within a smaller
range, such as1.0 to 0.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are
replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g.,
youth, adult, senior). The labels, in turn, can be recursively organized into
higher-level concepts, resulting in a concept hierarchy for the numeric attribute.
.
6. Concept hierarchy generation for nominal data, where attributes such as
street can be generalized to higher-level concepts, like city or country. Many
hierarchies for nominal attributes are implicit within the database schema and can
be automatically
defined at the schema definition level.
Q.Imagine that you need to analyze “All Electronics “ sales and customer
data (Data related to the sales of electronic items). Note that many
tuples have no recorded values for several attributes such as customer
income. How can you go about filling in the missing values for this
attribute? Explain some of the methods to handle the problem.
Ans: Missing values analyze a situation where some tuples don’t contain any value
or have null value. In this case the analysis of the data becomes difficult. We may
handle the situation in following ways:
1. Ignore the tuple: This is usually done when the class label is missing (assuming
the mining task involves classification). This method is not very effective, unless
the tuple contains several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies considerably. By ignoring the
tuple, we do not make use of the remaining attributes’ values in the tuple. Such
data could have been useful to the task at hand.
2. Fill in the missing value manually: In general, this approach is time consuming
and may not be feasible given a large data set with many missing values.
3.Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant. If missing values are replaced by, say, “Unknown,”
then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common—that of “Unknown.” Hence,
although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or
median) to fill in the missing value: In this approach we fill the missing values by
the central value like mean or median of the remaining data.
5. Use the attribute mean or median for all samples belonging to the same
class as the given tuple: For example, if classifying customers according to credit
risk, we may replace the missing value with the mean income value for customers
in the same credit risk category as that of the given tuple. If the data distribution
for a given class is skewed, the median value is a better choice.
6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian formalism, or
decision tree induction. For example, using the other customer attributes in your
data set, you may construct a decision tree to predict the missing values for income.
1. Database Data
A database system, also called a database management system (DBMS), consists
of a collection of interrelated data, known as a database, and a set of software
programs to manage and access the data. The software programs provide
mechanisms for defining database structures and data storage; for specifying and
managing concurrent, shared, or distributed data access; and for ensuring
consistency and security of the information stored despite system crashes or
attempts at unauthorized access.
2.DataWarehouses
A data warehouse is usually modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an attribute or a set of
attributes in the schema, and each cell stores the value of some aggregate measure
such as count or sum.
3. Transactional Data
In general, each record in a transactional database captures a transaction, such as
a customer’s purchase, a flight booking, or a user’s clicks on a web page. A
transaction typically includes a unique transaction identity number (trans ID) and a
list of the items making up the transaction, such as the items purchased in the
transaction. A transactional database may have additional tables, which contain
other information related to the transactions, such as item description, information
about the salesperson or the branch, and so on.
4. Other Kinds of Data
Besides relational database data, data warehouse data, and transaction data, there
are many other kinds of data that have versatile forms and structures and rather
different semantic meanings. Such kinds of data can be seen in many applications:
time-related or sequence data (e.g., historical records, stock exchange data, and
time-series and biological sequence data), data streams (e.g., video surveillance
and sensor data, which are continuously transmitted), spatial data (e.g., maps),
engineering design data (e.g., the design of buildings, system components, or
integrated circuits), hypertext and multimedia data (including text, image, video,
and audio data), graph and networked data (e.g., social and information networks),
and the Web (a huge, widely distributed information repository made available by
the Internet). These applications bring about new challenges, like how to handle
data carrying special structures (e.g., sequences, trees, graphs, and networks) and
specific semantics (such as ordering, image, audio and video contents, and
connectivity), and how to mine patterns that carry rich structures and semantics.
Here, the pink coloured Dimension tables are the common ones among both the
star schemas. Green coloured fact tables are the fact tables of their respective star
schemas.
Ans: ETL, which stands for extract, transform and load, is a data integration
process that combines data from multiple data sources into a single, consistent data
store that is loaded into a data warehouse or other target system.
As the databases grew in popularity in the 1970s, ETL was introduced as a process
for integrating and loading data for computation and analysis, eventually becoming
the primary method to process data for data warehousing projects.
ETL provides the foundation for data analytics and machine learning workstreams.
Through a series of business rules, ETL cleanses and organizes data in a way
which addresses specific business intelligence needs, like monthly reporting, but it
can also tackle more advanced analytics, which can improve back-end processes
or end user experiences. ETL is often used by an organization to:
1. Improved data quality: ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.
2. Better data integration: ETL process helps to integrate data from multiple
sources and systems, making it more accessible and usable.
3. Increased data security: ETL process can help to improve data security by
controlling access to the data warehouse and ensuring that only authorized users
can access the data.
4. Improved scalability: ETL process can help to improve scalability by
providing a way to manage and analyze large amounts of data.
5. Increased automation: ETL tools and technologies can automate and simplify
the ETL process, reducing the time and effort required to load and update data
in the warehouse.
1. High cost: ETL process can be expensive to implement and maintain,
especially for organizations with limited resources.
2. Complexity: ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or resources.
3. Limited flexibility: ETL process can be limited in terms of flexibility, as it may
not be able to handle unstructured data or real-time data streams.
4. Limited scalability: ETL process can be limited in terms of scalability, as it
may not be able to handle very large amounts of data.
5. Data privacy concerns: ETL process can raise concerns about data privacy, as
large amounts of data are collected, stored, and analyzed.
Overall, ETL process is an essential process in data warehousing that helps to
ensure that the data in the data warehouse is accurate, complete, and up-to-date.
However, it also comes with its own set of challenges and limitations, and
organizations need to carefully consider the costs and benefits before implementing
them.
Q. How can you measure dispersion of data? Explain the concept of Range, Quartile, Outliers and
Boxplot.
The range is the easiest dispersion of data or measure of variability. The range can
measure by subtracting the lowest value from the massive Number. The wide range
indicates high variability, and the small range specifies low variability in the
distribution. To calculate a range, prepare all the values in ascending order, then subtract
the lowest value from the highest value.
IQR is a range (the boundary between the first and second quartile) and Q3 (the
boundary between the third and fourth quartile).IQR is preferred over a range as, like a
range, IQR does not influence by outliers. IQR is used to measure variability by splitting a
data set into four equal quartiles.
IQR uses a box plot to find the outliers. ”To estimating IQR, all the values form
(sort) in the ascending order else it will provide a negative value, and that
influences to find the outliers.”
Standard Deviation
Standard deviation is a squared root of the variance to get original values. Low
standard deviation indicates data points close to mean.
The normal distribution is conventional bits of help to understand the standard deviation.
Box Plot
It captures the summary of the data effectively and efficiently with only a
simple box and whiskers. Boxplot summarizes sample data using 25th, 50th,
and 75th percentiles. One can just get insights(quartiles, median, and outliers)
into the dataset by just looking at its boxplot.
In the above graph, can clearly see that values above 10 are acting as
outliers.
Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs aggregation
on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction.
Figure shows the result of a roll-up operation performed on the central cube by climbing up the concept
hierarchy for location given earlier. This hierarchy was defined as the total order “street <
city < province or state < country.” The roll-up operation shown aggregates the data by ascending
the location hierarchy from the level of city to the level of country. In other words, rather than
grouping the data by city, the resulting cube groups the data by country. When roll-up is performed by
dimension reduction, one or more dimensions are removed from the given cube. For example, consider a
sales data cube containing only the location and time dimensions. Roll-up may be performed by
removing, say,
the time dimension, resulting in an aggregation of the total sales by location, rather than by location and
by time.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions. Figure given shows the result of a drill-down operation performed on
the central cube by stepping down a concept hierarchy for time defined as “day < month < quarter <
year.” Drill-down
occurs by descending the time hierarchy from the level of quarter to the more detailed level of month. The
resulting data cube details the total sales per month rather than summarizing them by quarter. Because a
drill-down adds more detail to the given data, it can also be performed by adding new dimensions to a
cube.
Dice
OLAP Dice emphasizes two or more dimensions from a cube given and suggests a new sub-cube,
as well as Slice operation does. In order to locate a single value for a cube, it includes adding values
for each dimension.
The diagram below shows how Dice operation works:
Pivot
This OLAP operation rotates the axes of a cube to provide an alternative view of the data cube. Pivot
clusters the data with other dimensions which helps analyze the performance of a company or
enterprise.
Here’s an example of Pivot in operation:
● It is a database system
● It offers spatial data types (SDTs) in its data model and query language.
● It supports spatial data types in its implementation, providing at least spatial indexing
and efficient algorithms for spatial join.
Example
A road map is a visualization of geographic information. A road map is a
2-dimensional object which contains points, lines, and polygons that can
represent cities, roads, and political boundaries such as states or provinces.
● Vector data: This data is represented as discrete points, lines and polygons
● Rastor data: This data is represented as a matrix of square cells.
Clustering Methods
Clustering methods can be classified into the following categories −
● Partitioning Method
● Hierarchical Method
● Agglomerative Approach
● Divisive Approach
● Density-based Method
● Grid-Based Method
● Model-Based Method
● Constraint-based Method
Hierarchical Clustrering
Hierarchical clustering is another unsupervised machine learning algorithm, which is used
to group the unlabeled datasets into a cluster and also known as hierarchical cluster
analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar,
but they both differ depending on how they work. As there is no requirement to
predetermine the number of clusters as we did in the K-Means algorithm.
Density-based clustering
Density-Based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms. The data points in
the region separated by two clusters of low point density are considered as noise. The
surroundings with a radius ε of a given object are known as the ε neighborhood of the
object. If the ε neighborhood of the object comprises at least a minimum number, MinPts of
objects, then it is called a core object.
HoldOut
In the holdout method, the largest dataset is randomly divided into three subsets:
● A training set is a subset of the dataset which are been used to build
predictive models.
● The validation set is a subset of the dataset which is been used to assess
the performance of the model built in the training phase. It provides a test
platform for fine-tuning of the model’s parameters and selecting the
best-performing model. It is not necessary for all modeling algorithms to need
a validation set.
● Test sets or unseen examples are the subset of the dataset to assess the
likely future performance of the model. If a model is fitting into the training set
much better than it fits into the test set, then overfitting is probably the cause
that occurred here.
Basically, two-thirds of the data are been allocated to the training set and the
remaining one-third is been allocated to the test set.
Random Subsampling
Bootstrapping
Neural Network
It is a type of paradigm that is an image-processing system inspired by the human nervous
system. Like the human nervous system, we have neutral artificial neurons in the neural
network. The human brain has 10 billion neurons, each connected to an average of 10,000
other neurons. Each neuron receives a signal through a synapse, which controls the effect
of the sign on the neuron.
Backpropagation
It is a type of algorithm widely used in training in neural networks. It is used to compute
the loss function of the weight of the networks. It is so efficient that it directly computes
the gradient concerning each weight. With the help of this, it is also possible to use
gradient methods to train multi-layer networks and update weights to minimize loss;
variants such as gradient descent or stochastic gradient descent are often used.
The main work of the backpropagation algorithm is to compute the gradient of the loss
function to each weight via the chain rule, computing the gradient layer by layer and
iterating backwards from the last layer to avoid redundant computation of intermediate
terms in the chain rule.
Features of Backpropagation:
There are so many features of backpropagation. These features are as follows.
1. It is one of the gradient methods used to create the simple perceptron network
with the differentiable unit.
2. It is so much more difficult than another network. It is used to calculate the learning
period of the network.
3. There are three stages which are used in the training. Those three stages are as
follows.
o The feed-forward of the input training pattern, the calculation, and the
backpropagation of the error.
o Updation of the weight.
Neural networks generate output vectors from input vectors on which neural network
operates on. It compares generated output with the desired output and generates an
error report if the result does not match the generated output vector. Then it adjusts the
weights accordingly to get the desired output. It is based on gradient descent and
updates weights by minimizing the error between predicted and actual output.
Training of backpropagation consists of three stages:
● Forward propagation of input data.
● Backward propagation of error.
● Updating weights to reduce the error.
Let’s walk through an example of backpropagation in machine learning. Assume the
neurons use the sigmoid activation function for the forward and backward pass. The
target output is 0.5 and the learning rate is 1.
● wi,jwi,j represents the weights between the ithithinput and the jthjth neuron
o (output): After applying the activation function to a, we get the output of the
neuron:
ojoj = activation function(ajaj)
Q. Explain the concept of partitioning methods in cluster analysis. What are the
advantages and limitations of partitioning methods?
This clustering method classifies the information into multiple groups based on
the characteristics and similarity of the data. Its the data analysts to specify the
number of clusters that has to be generated for the clustering methods. In the
partitioning method when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data in which
each partition represents a cluster and a particular region. There are many
algorithms that come under partitioning method some of the popular ones are
K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large
Applications) etc. In this article, we will be seeing the working of K Mean
algorithm in detail. K-Mean (A centroid based Technique): The K means
algorithm takes the input parameter K from the user and partitions the dataset
containing N objects into K clusters so that resulting similarity among the data
objects inside the group (intracluster) is high but the similarity of data objects with
the data objects from outside the cluster is low (intercluster). The similarity of the
cluster is determined with respect to the mean value of the cluster. It is a type of
square error algorithm. At the start randomly k objects from the dataset are
chosen in which each of the objects represents a cluster mean(centre). For the
rest of the data objects, they are assigned to the nearest cluster based on their
distance from the cluster mean. The new mean of each of the cluster is then
calculated with the added data objects.
Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean
values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the
updated values.
4. Repeat Step 2 until no change occurs.
Figure – K-mean ClusteringFlowchart:
Figure
– K-mean Clustering
Example: Suppose we want to group the visitors to a website using just their age
as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66
Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the
clusters (16-29) and (36-66) as 2 clusters we get using K Mean Algorithm.
Advantages:
● Simple Implementation K-Means is relatively easy to understand and implement,
making it accessible to both novice and professional data miners.
● Fast Computation The algorithm is computationally efficient, allowing for quick
clustering of large datasets. It can handle a high volume of data points in a
reasonable amount of time.
● Scalability K-Means can handle datasets with a large number of dimensions without
sacrificing performance. This makes it suitable for analyzing complex data structures
found in various applications.
● Flexibility The algorithm allows for flexibility in defining the number of clusters
desired. Data analysts can select the appropriate number of clusters based on their
specific requirements.
● Robustness K-Means is robust to noise and outliers, as it uses the mean of the
cluster members as the centroid representation. This helps minimize the impact of
noisy data on the overall clustering result.
● Interpretable Results The output generated by K-Means is easy to interpret since
each cluster represents a distinct group or subset of the dataset based on similarity
or proximity.
● Versatility K-Means can be used for various types of data analysis tasks, including
customer segmentation, image compression, anomaly detection, and
recommendation systems.
● Incremental Updating The K-Means algorithm can be updated incrementally when
new data points are added or removed from the dataset, making it suitable for
real-time or streaming applications.
● Applicable to Large Datasets K-Means has been successfully applied to deal with big
data problems due to its efficiency and scalability.
● Widely Supported Many programming languages and software libraries provide
implementations for K-Means algorithm, making it readily available and applicable
across different platforms.
Disadvantages
● Sensitivity to initial cluster centers The outcome of K-Means clustering heavily
depends on the initial selection of cluster centers. Different initializations can lead to
different final results, making it challenging to obtain the optimal clustering solution.
● Assumes isotropic and spherical clusters K-Means assumes that clusters are
isotropic (having equal variance) and spherical in shape. This assumption may not
hold for all types of datasets, especially when dealing with irregularly shaped or
overlapping clusters.
● Difficulty handling categorical variables K-Means is primarily designed for numerical
data analysis and struggles with categorical variables. It cannot handle non-numeric
attributes directly since the distance between categorical values cannot be
calculated effectively.
● Influence of outliers Outliers can significantly impact the performance of K-Means
clustering. Since K-Means is sensitive to distance measures, outliers can distort the
centroids and affect cluster assignments, leading to less accurate results.
● Requires predefined number of clusters One major drawback of K-Means is that you
need to specify the number of desired clusters before running the algorithm.
Determining an appropriate number of clusters in advance can be challenging and
subjective, especially when working with complex datasets.
● Struggles with high-dimensional data As the dimensionality of data increases, so
does the "curse of dimensionality." In high-dimensional spaces, distances between
points become less meaningful, making it difficult for K-Means to find meaningful
clusters accurately.
● Lack of robustness against noise or outliers While mentioning this point earlier
regarding outliers, it's worth noting that even a small amount of noise or outliers can
severely impact the performance of K-Means clustering by leading to incorrect
cluster assignments.
● Limited applicability to non-linear data K-Means assumes that clusters are linearly
separable, which means it may not perform well on datasets with non-linear
structures where the decision boundaries are curved or irregular.
where P(H∣E)P(H∣E) is the posterior probability of the hypothesis given the event
E, P(E∣H)P(E∣H) is the likelihood or conditional probability of the event given the
hypothesis, P(H)P(H) is the prior probability of the hypothesis, and P(E)P(E) is
the probability of the event.
Q. Explain classification along with decision tree algorithm.
Decision tree is a simple diagram that shows different choices and their
possible results helping you make decisions easily. This article is all about
what decision trees are, how they work, their advantages and disadvantages
and their applications. It has a hierarchical tree structure starts with one main
question at the top called a node which further branches out into different
possible outcomes where:
● Root Node is the starting point that represents the entire dataset.
● Branches: These are the lines that connect nodes. It shows the flow from
one decision to another.
● Internal Nodes are Points where decisions are made based on the input
features.
● Leaf Nodes: These are the terminal nodes at the end of branches that
represent final outcomes or predictions
We have mainly two types of decision tree based on the nature of the target
variable: classification trees and regression trees.
● Classification trees: They are designed to predict categorical outcomes
means they classify data into different classes. They can determine
whether an email is “spam” or “not spam” based on various features of the
email.
● Regression trees : These are used when the target variable is continuous
It predict numerical values rather than categories. For example a
regression tree can estimate the price of a house based on its size,
location, and other features.
The leaves are the decisions or the final outcomes. And the decision nodes are
An example of a decision tree can be explained using above binary tree. Let’s say
you want to predict whether a person is fit given their information like age, eating
habit, and physical activity, etc. The decision nodes here are questions like ‘What’s
the age?’, ‘Does he exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which
are outcomes like either ‘fit’, or ‘unfit’. In this case this was a binary classification
problem (a yes no type problem).
● Business intelligence
● Financial intelligence
● Data with technology support to analyze
● Hardware
● Software
● User interface (UI)
● Telecommunications capability
Hardware
An EIS’s hardware should include input devices that executives can use to enter, check, and
update data; a central processing unit (CPU) that controls the entire system; data storage for
saving and archiving useful business information; and output devices (e.g., monitors, printers,
etc.) that show visual representations of the data executives need to keep or read.
Software
An EIS’s software should be able to integrate all available data into cohesive results. It should be
capable of handling text and graphics; connected to a database that contains all relevant internal
and external data; and have a model base that performs routine and special statistical, financial,
and other quantitative analyses.
● Detailed data – EIS provides absolute data from its existing database.
● Integrate external and internal data – EIS integrates integrate external and internal data. The
external data collected from various sources.
● Presenting information – EIS represents available data in graphical form which helps to
analyze it easily.
● Trend analysis – EIS helps executives of the organizations to data prediction based on trend
data.
● Easy to use – It is a very simplest system to use.
Advantages of EIS
● Trend Analysis
● Improvement of corporate performance in the marketplace
● Development of managerial leadership skills
● Improves decision-making
● Simple to use by senior executives
● Better reporting method
● Improved office efficiency
Disadvantage of EIS
Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a
relational back-end server and client front-end tools. They use a relational or extended-relational DBMS to
store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers
include optimization for each DBMS back end, implementation of aggregation navigation logic, and
additional tools and services. ROLAP technology tends to have greater scalability than MOLAP
technology. The DSS server of Micro strategy, for example, adopts the ROLAP approach.
Disadvantages
● Poor query performance.
● Some limitations of scalability depending on the technology architecture that is
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional data views through
array-based multidimensional storage engines. They map multidimensional views directly to data cube
array structures. The advantage of using a data cube is that it allows fast indexing to precomputed
summarized data. Notice that with multidimensional data stores, the storage utilization may be low if the
data set is sparse.
Many MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets:
Denser sub cubes are identified and stored as array structures, whereas sparse sub cubes employ
compression technology for efficient storage utilization.
MOLAP includes the following components −
● Database server.
● MOLAP server.
● Front-end tool.
Advantages
● MOLAP allows fastest indexing to the pre-computed summarized data.
● Helps the users connected to a network who need to analyze larger, less-defined data.
● Easier to use, therefore MOLAP is suitable for inexperienced users.
Disadvantages:
● MOLAP are not capable of containing detailed data.
● The storage utilization may be low if the data set is sparse.
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology,
benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example, a
HOLAP server may allow large volumes of detailed data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid
OLAP server.
Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases,
some database system vendors implement specialized SQL servers that provide advanced query
language and query processing support for SQL queries over star and snowflake schemas in a read-only
environment.
Following figure shows a summary fact table that contains both base fact data and aggregated data. The
schema is “(record identifier (RID), item, . . . , day, month, quarter, year, dollars sold),” where day, month,
quarter, and year define the sales date, and dollars sold is the sales amount. Consider the tuples with an
RID of 1001 and 1002, respectively. The data of these tuples are at the base fact level, where the sales
dates are October 15, 2010, and October 23, 2010, respectively. Consider the tuple with an RID of 5001.
This tuple is at a more general level of abstraction than the tuples 1001 and 1002. The day value has
been generalized to all, so that the corresponding time value is October 2010. That is, the dollars sold
amount shown is an aggregation representing the entire month of October 2010, rather than just October
15 or 23, 2010. The special value all is used to represent subtotals in summarized data.
MOLAP uses multidimensional array structures to store data for online analytical processing. Most data
warehouse systems adopt a client-server architecture. A relational data store always resides at the data
warehouse/data mart server site. A multidimensional data store can reside at either the database server
site or the client site.
Q.What do you understand by The FASMI test in context of OLAP?
Ans It represents the characteristics of an OLAP application in a specific method, without
dictating how it should be performed.
Fast − It defines that the system is targeted to produce most responses to users within about
five seconds, with the understandable analysis taking no more than one second and very few
taking more than 20 seconds.
Independent research in the Netherlands has shown that end-users consider that a process
has declined if results are not received with 30 seconds, and they are suitable to hit
‘ALT+Ctrl+Delete’ unless the system needs them that the report will take longer.
Analysis − It defines that the system can manage with any business logic and statistical
analysis that is appropriate for the application and the user, the keep it easy enough for the
target user. Although some pre-programming can be required, it does not think it acceptable if
all application definitions have to be completed using a professional 4GL.
It is necessary to enable the user to represent new ad hoc calculations as part of the analysis
and to report on the data in any desired method, without having to program, so it can exclude
products (like Oracle Discoverer) that do not enable the user to represent new ad hoc
calculations as an element of the analysis and to report on the data in any desired method,
without having to program, so it can exclude products (like Oracle Discoverer) that do not
enable adequate end-user oriented calculation flexibility.
Shared − It defines that the system implements all the security requirements for confidentiality
(probably down to cell level) and, multiple write access is required, concurrent update areas at
a suitable level. It is not all applications required users to write data back, but for the increasing
number that does, the system must be able to handle several updates in an appropriate, secure
manner. This is a major field of weakness in some OLAP products, which tend to consider that
all OLAP applications will be read-only, with simple security controls.
Multidimensional − The system should support a multidimensional conceptual view of the
data, including complete support for hierarchies and multiple hierarchies. It is not setting up a
specific minimum number of dimensions that should be managed as it is too software
dependent and most products seem to have enough for their target industry.
Information − Information is all of the data and derived data required, whether it is and
however much is relevant for the software. We are measuring the capacity of several products
in terms of how much input data can manage, not how many Gigabytes they take to save it.