0% found this document useful (0 votes)
5 views57 pages

Model question paper and solution_DWDM.docx

Support and confidence are key measures in association rule mining, indicating the usefulness and reliability of discovered rules. Support reflects the frequency of an itemset in transactions, while confidence measures the likelihood of an itemset appearing given another itemset. The document also discusses the three-tier data warehouse architecture and the knowledge discovery in databases (KDD) process, emphasizing data integration and transformation as essential components for effective data analysis.

Uploaded by

saiabhinav190404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views57 pages

Model question paper and solution_DWDM.docx

Support and confidence are key measures in association rule mining, indicating the usefulness and reliability of discovered rules. Support reflects the frequency of an itemset in transactions, while confidence measures the likelihood of an itemset appearing given another itemset. The document also discusses the three-tier data warehouse architecture and the knowledge discovery in databases (KDD) process, emphasizing data integration and transformation as essential components for effective data analysis.

Uploaded by

saiabhinav190404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

1.​ Define the term support and confidence in association rule mining.

(2023)
Ans. Support and confidence are two measures of rule interestingness. They
respectively reflect the usefulness and certainty of discovered rules. A support of
2% for means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together. A confidence of 60% means that
60% of the customers who purchased a computer also bought the software.
Typically, association rules are considered interesting if they satisfy both a
minimum support threshold and a minimum confidence threshold. These
thresholds can be a set by users or domain experts. Additional analysis can be
performed to discover interesting statistical
correlations between associated items.
Support
In data mining, support refers to the relative frequency of an item set in a dataset.
For example, if an itemset occurs in 5% of the transactions in a dataset, it has a
support of 5%. Support is often used as a threshold for identifying frequent item
sets in a dataset, which can be used to generate association rules. For example, if
we set the support threshold to 5%, then any itemset that occurs in more than 5%
of the transactions in the dataset will be considered a frequent itemset.
The support of an itemset is the number of transactions in which the itemset
appears, divided by the total number of transactions. For example, suppose we
have a dataset of 1000 transactions, and the itemset {milk, bread} appears in 100
of those transactions. The support of the itemset {milk, bread} would be calculated
as follows:
Support({milk, bread}) = Number of transactions containing
{milk, bread} / Total number of transactions
= 100 / 1000
= 10%
So the support of the itemset {milk, bread} is 10%. This means that in 10% of the
transactions, the items milk and bread were both purchased.
In general, the support of an itemset can be calculated using the following formula:
Support(X) = (Number of transactions containing X) / (Total number of
transactions)
where X is the itemset for which you are calculating the support.
Confidence
In data mining, confidence is a measure of the reliability or support for a given
association rule. It is defined as the proportion of cases in which the association
rule holds true, or in other words, the percentage of times that the items in the
antecedent (the “if” part of the rule) appear in the same transaction as the items in
the consequent (the “then” part of the rule).
Confidence is a measure of the likelihood that an itemset will appear if another
itemset appears. For example, suppose we have a dataset of 1000 transactions, and
the itemset {milk, bread} appears in 100 of those transactions. The itemset {milk}
appears in 200 of those transactions. The confidence of the rule “If a customer
buys milk, they will also buy bread” would be calculated as follows:
Confidence("If a customer buys milk, they will also buy bread")
= Number of transactions containing
{milk, bread} / Number of transactions containing {milk}
= 100 / 200
= 50%
So the confidence of the rule “If a customer buys milk, they will also buy bread” is
50%. This means that in 50% of the transactions where milk was purchased, bread
was also purchased.
In general, the confidence of a rule can be calculated using the following formula:
Confidence(X => Y) = (Number of transactions containing X and Y) / (Number of
transactions containing X)
Q.2 Draw and explain various components of a 3-tier data warehouse
architecture. (2023)
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and
data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference
data model for a whole enterprise. At the same time, it separates the problems of
source data extraction and integration from those of data warehouse population. In
some cases, the reconciled layer is also directly used to accomplish better some
operational tasks, such as producing daily reports that cannot be satisfactorily
prepared using the corporate applications or generating data flows to feed external
processes periodically to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A
disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away
from being real-time.

1. The bottom tier is a warehouse database server that is almost always a


relational database system. Back-end tools and utilities are used to feed data into
the bottom tier from operational databases or other external sources (e.g., customer
profile information provided by external consultants). These tools and utilities
perform data extraction, cleaning, and transformation (e.g., to merge similar data
from different sources into a unified format), as well as load and refresh functions
to update the data warehouse. The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying DBMS
and allows client programs to generate SQL code to be executed at a server.
Examples of gateways include ODBC (Open Database Connection) and OLEDB
(Object Linking and Embedding Database) by Microsoft and JDBC (Java Database
Connection).
This tier also contains a metadata repository, which stores information about
the data warehouse and its contents.

2. The middle tier is an OLAP server that is typically implemented using either a
relational OLAP(ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or a
multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that
directly implements multidimensional data and operations).

3. The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and
so on).

Q. Explain knowledge discovery in database (KDD) in detail with the help of


diagram. (2020)
KDD represents Knowledge Discovery in Databases. It defines the broad process
of discovering knowledge in data and emphasizes the high-level applications of
definite data mining techniques. It is an area of interest to researchers in several
fields, such as artificial intelligence, machine learning, pattern recognition,
databases, statistics, knowledge acquisition for professional systems, and data
visualization.
The main objective of the KDD process is to extract data from information in
the context of huge databases. It does this by utilizing Data Mining algorithms to
recognize what is considered knowledge. The Knowledge Discovery in Databases
is treated as a programmed, exploratory analysis and modeling of huge data
repositories. KDD is the organized process of recognizing valid, useful, and
understandable design from large and difficult data sets.
KDD is the non-trivial procedure of identifying valid, novel, probably useful,
and basically logical designs in data. The process indicates that KDD includes
many steps, which include data preparation, search for patterns, knowledge
evaluation, and refinement, all repeated in multiple iterations. By non-trivial, it
means that some search or inference is contained; namely, it is not an easy
computation of predefined quantities like calculating the average value of a set of
numbers.
Data Mining is the root of the KDD procedure, such as the inferring of
algorithms that investigate the records, develop the model, and discover
previously unknown patterns. The model is used for extracting the knowledge
from the information, analyzing the information, and predicting the information.
Data mining is a step in the KDD process that includes applying data analysis and
discovery algorithms that, under acceptable computational efficiency limitations,
make a specific enumeration of patterns (or models) over the data.
The field of patterns is often infinite, and the enumeration of patterns contains
some form of search in this space. Practical computational constraints place
serious limits on the subspace that can be analyzed by a data-mining algorithm.
The KDD process contains using the database along with some required selection,
preprocessing, subsampling, and transformations of it; using data-mining methods
(algorithms) to enumerate patterns from it; and computing the products of data
mining to recognize the subset of the enumerated patterns deemed knowledge.
The data-mining component of the KDD process is concerned with the algorithmic
method by which patterns are extracted and enumerated from records. The
complete KDD process contains the evaluation and possible interpretation of the
mined patterns to decide which patterns can be treated with new knowledge.

The knowledge discovery process(illustrates in the given figure) is iterative and


interactive, comprises of nine steps. The process is iterative at each stage,
implying that moving back to the previous actions might be required. The process
has many imaginative aspects in the sense that one can’t presents one formula or
make a complete scientific categorization for the correct decisions for each step
and application type. Thus, it is needed to understand the process and the different
requirements and possibilities in each stage.

The process begins with determining the KDD objectives and ends with the
implementation of the discovered knowledge. At that point, the loop is closed, and
the Active Data Mining starts. Subsequently, changes would need to be made in the
application domain. For example, offering various features to cell phone users in
order to reduce churn. This closes the loop, and the impacts are then measured on
the new data repositories, and the KDD process again. Following is a concise
description of the nine-step KDD process, Beginning with a managerial step:
1. Building up an understanding of the application domain

This is the initial preliminary step. It develops the scene for understanding what
should be done with the various decisions like transformation, algorithms,
representation, etc. The individuals who are in charge of a KDD venture need to
understand and characterize the objectives of the end-user and the environment
in which the knowledge discovery process will occur (involves relevant prior
knowledge).

2. Choosing and creating a data set on which discovery will be performed

Once defined the objectives, the data that will be utilized for the knowledge
discovery process should be determined. This incorporates discovering what data
is accessible, obtaining important data, and afterward integrating all the data
for knowledge discovery onto one set involves the qualities that will be considered
for the process. This process is important because of Data Mining learns and
discovers from the accessible data. This is the evidence base for building the
models. If some significant attributes are missing, at that point, then the entire
study may be unsuccessful from this respect, the more attributes are considered.
On the other hand, to organize, collect, and operate advanced data repositories is
expensive, and there is an arrangement with the opportunity for best understanding
the phenomena. This arrangement refers to an aspect where the interactive and
iterative aspect of the KDD is taking place. This begins with the best available data
sets and later expands and observes the impact in terms of knowledge discovery
and modeling.

3. Preprocessing and cleansing

In this step, data reliability is improved. It incorporates data clearing, for


example, Handling the missing quantities and removal of noise or outliers. It
might include complex statistical techniques or use a Data Mining algorithm in this
context. For example, when one suspects that a specific attribute of lacking
reliability or has many missing data, at this point, this attribute could turn into the
objective of the Data Mining supervised algorithm. A prediction model for these
attributes will be created, and after that, missing data can be predicted. The
expansion to which one pays attention to this level relies upon numerous factors.
Regardless, studying the aspects is significant and regularly revealing by itself, to
enterprise data frameworks.

4. Data Transformation

In this stage, the creation of appropriate data for Data Mining is prepared and
developed. Techniques here incorporate dimension reduction( for example, feature
selection and extraction and record sampling), also attribute transformation(for
example, discretization of numerical attributes and functional transformation). This
step can be essential for the success of the entire KDD project, and it is typically
very project-specific. For example, in medical assessments, the quotient of
attributes may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well as efforts
and transient issues. For example, studying the impact of advertising accumulation.
However, if we do not utilize the right transformation at the starting, then we may
acquire an amazing effect that insights to us about the transformation required in
the next iteration. Thus, the KDD process follows upon itself and prompts an
understanding of the transformation required.

5. Prediction and description

We are now prepared to decide on which kind of Data Mining to use, for
example, classification, regression, clustering, etc. This mainly relies on the
KDD objectives, and also on the previous steps. There are two significant
objectives in Data Mining, the first one is a prediction, and the second one is the
description. Prediction is usually referred to as supervised Data Mining, while
descriptive Data Mining incorporates the unsupervised and visualization
aspects of Data Mining. Most Data Mining techniques depend on inductive
learning, where a model is built explicitly or implicitly by generalizing from an
adequate number of preparing models. The fundamental assumption of the
inductive approach is that the prepared model applies to future cases. The
technique also takes into account the level of meta-learning for the specific set of
accessible data.

6. Selecting the Data Mining algorithm

Having the technique, we now decide on the strategies. This stage incorporates
choosing a particular technique to be used for searching patterns that include
multiple inducers. For example, considering precision versus understandability, the
previous is better with neural networks, while the latter is better with decision
trees. For each system of meta-learning, there are several possibilities of how it can
be succeeded. Meta-learning focuses on clarifying what causes a Data Mining
algorithm to be fruitful or not in a specific issue. Thus, this methodology attempts
to understand the situation under which a Data Mining algorithm is most suitable.
Each algorithm has parameters and strategies of leaning, such as ten folds
cross-validation or another division for training and testing.

7. Utilizing the Data Mining algorithm

At last, the implementation of the Data Mining algorithm is reached. In this stage,
we may need to utilize the algorithm several times until a satisfying outcome is
obtained. For example, by turning the algorithms control parameters, such as the
minimum number of instances in a single leaf of a decision tree.

8. Evaluation

In this step, we assess and interpret the mined patterns, rules, and reliability to the
objective characterized in the first step. Here we consider the preprocessing steps
as for their impact on the Data Mining algorithm results. For example, including a
feature in step 4, and repeat from there. This step focuses on the comprehensibility
and utility of the induced model. In this step, the identified knowledge is also
recorded for further use. The last step is the use, and overall feedback and
discovery results acquire by Data Mining.

9. Using the discovered knowledge


Now, we are prepared to include the knowledge into another system for further
activity. The knowledge becomes effective in the sense that we may make changes
to the system and measure the impacts. The accomplishment of this step decides
the effectiveness of the whole KDD process. There are numerous challenges in this
step, such as losing the "laboratory conditions" under which we have worked. For
example, the knowledge was discovered from a certain static depiction, it is usually
a set of data, but now the data becomes dynamic. Data structures may change
certain quantities that become unavailable, and the data domain might be modified,
such as an attribute that may have a value that was not expected previously.

Q. Discuss data integration and data transformation in detail.​ ​ ​


(2023)
Data integration is the process of combining data from multiple sources into a
cohesive and consistent view. This process involves identifying and accessing the
different data sources, mapping the data to a common format, and reconciling any
inconsistencies or discrepancies between the sources. The goal of data integration
is to make it easier to access and analyze data that is spread across multiple
systems or platforms, in order to gain a more complete and accurate understanding
of the data.
Data integration can be challenging due to the variety of data formats, structures,
and semantics used by different data sources. Different data sources may use
different data types, naming conventions, and schemas, making it difficult to
combine the data into a single view. Data integration typically involves a
combination of manual and automated processes, including data profiling, data
mapping, data transformation, and data reconciliation.
Data integration is used in a wide range of applications, such as business
intelligence, data warehousing, master data management, and analytics. Data
integration can be critical to the success of these applications, as it enables
organizations to access and analyze data that is spread across different systems,
departments, and lines of business, in order to make better decisions, improve
operational efficiency, and gain a competitive advantage.
There are mainly 2 major approaches for data integration – one is the “tight
coupling approach” and another is the “loose coupling approach”.
Tight Coupling:
This approach involves creating a centralized repository or data warehouse to store
the integrated data. The data is extracted from various sources, transformed and
loaded into a data warehouse. Data is integrated in a tightly coupled manner,
meaning that the data is integrated at a high level, such as at the level of the entire
dataset or schema. This approach is also known as data warehousing, and it enables
data consistency and integrity, but it can be inflexible and difficult to change or
update.
●​ Here, a data warehouse is treated as an information retrieval component.
●​ In this coupling, data is combined from different sources into a single physical
location through the process of ETL – Extraction, Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of
individual data elements or records. Data is integrated in a loosely coupled manner,
meaning that the data is integrated at a low level, and it allows data to be integrated
without having to create a central repository or data warehouse. This approach is
also known as data federation, and it enables data flexibility and easy updates, but
it can be difficult to maintain consistency and integrity across multiple data
sources.
●​ Here, an interface is provided that takes the query from the user, transforms it in
a way the source database can understand, and then sends the query directly to
the source databases to obtain the result.
●​ And the data only remains in the actual source databases.
Data transformation is an essential phase in the data mining process. It entails
transforming unprocessed data into an analytically useful format. Data
transformation seeks to enhance the consistency and relevance of the data for the
desired analysis while reducing redundancy and improving data quality. Data
transformation is an essential element of data mining for several reasons. Firstly,
analyzing unstructured, erroneous or incomplete raw data can be challenging and
time-consuming. Therefore, the primary objective of data transformation is to tidy
up and organize the data to facilitate further analysis.
Second, data transformation aids in bringing down the data's complexity. For data
mining algorithms to find patterns, trends, and linkages, structured data is
necessary. By eliminating superfluous or unnecessary information and translating
the data into an appropriate format, data transformation aids in the data's
simplification.
Thirdly, data transformation makes sure the data is reliable and pertinent for the
analysis that is being performed. Different data sources may use different formats,
scales, and measurement units. Data transformation aids in standardizing the data,
allowing for better comparison and analysis.
The accuracy and efficiency of data mining algorithms may also be increased with
the aid of data transformation. Data mining algorithms can more precisely and
successfully find patterns and trends by translating the data into a suitable format.

Data transformation may be done using a variety of methods. Data cleansing, data
integration, and data reduction are the three basic categories that may be used to
group these procedures.

Data Cleaning: Finding and fixing data


mistakes, inconsistencies, and inaccuracies is known as data cleaning.
Data Integration
Data integration is the process of merging information from many datasets.
Data Reduction
Reducing the quantity and complexity of the data is known as data reduction.
for data transformation include the following:
1.​ Smoothing, which works to remove noise from the data. Techniques
include binning, regression, and clustering.
2.​ Attribute construction (or feature construction), where new
attributes are constructed and added from the given set of attributes to
help the mining process.

3. Aggregation, where summary or aggregation operations are applied to the data.


For
example, the daily sales data may be aggregated so as to compute monthly and
annual total amounts. This step is typically used in constructing a data cube for
data analysis at multiple abstraction levels.

4.Normalization, where the attribute data are scaled so as to fall within a smaller
range, such as1.0 to 0.0, or 0.0 to 1.0.

5. Discretization, where the raw values of a numeric attribute (e.g., age) are
replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g.,
youth, adult, senior). The labels, in turn, can be recursively organized into
higher-level concepts, resulting in a concept hierarchy for the numeric attribute.
.
6. Concept hierarchy generation for nominal data, where attributes such as
street can be generalized to higher-level concepts, like city or country. Many
hierarchies for nominal attributes are implicit within the database schema and can
be automatically
defined at the schema definition level.

Q.​Imagine that you need to analyze “All Electronics “ sales and customer
data (Data related to the sales of electronic items). Note that many
tuples have no recorded values for several attributes such as customer
income. How can you go about filling in the missing values for this
attribute? Explain some of the methods to handle the problem.
Ans: Missing values analyze a situation where some tuples don’t contain any value
or have null value. In this case the analysis of the data becomes difficult. We may
handle the situation in following ways:

1. Ignore the tuple: This is usually done when the class label is missing (assuming
the mining task involves classification). This method is not very effective, unless
the tuple contains several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies considerably. By ignoring the
tuple, we do not make use of the remaining attributes’ values in the tuple. Such
data could have been useful to the task at hand.

2. Fill in the missing value manually: In general, this approach is time consuming
and may not be feasible given a large data set with many missing values.
3.Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant. If missing values are replaced by, say, “Unknown,”
then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common—that of “Unknown.” Hence,
although this method is simple, it is not foolproof.

4. Use a measure of central tendency for the attribute (e.g., the mean or
median) to fill in the missing value: In this approach we fill the missing values by
the central value like mean or median of the remaining data.
5. Use the attribute mean or median for all samples belonging to the same
class as the given tuple: For example, if classifying customers according to credit
risk, we may replace the missing value with the mean income value for customers
in the same credit risk category as that of the given tuple. If the data distribution
for a given class is skewed, the median value is a better choice.

6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian formalism, or
decision tree induction. For example, using the other customer attributes in your
data set, you may construct a decision tree to predict the missing values for income.

Q. What are different types of databases/ information repositories on which


the process of data mining can be executed /
Ans: Different Types of Data that can be mined:

1. Database Data
A database system, also called a database management system (DBMS), consists
of a collection of interrelated data, known as a database, and a set of software
programs to manage and access the data. The software programs provide
mechanisms for defining database structures and data storage; for specifying and
managing concurrent, shared, or distributed data access; and for ensuring
consistency and security of the information stored despite system crashes or
attempts at unauthorized access.

2.DataWarehouses
A data warehouse is usually modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an attribute or a set of
attributes in the schema, and each cell stores the value of some aggregate measure
such as count or sum.
3. Transactional Data
In general, each record in a transactional database captures a transaction, such as
a customer’s purchase, a flight booking, or a user’s clicks on a web page. A
transaction typically includes a unique transaction identity number (trans ID) and a
list of the items making up the transaction, such as the items purchased in the
transaction. A transactional database may have additional tables, which contain
other information related to the transactions, such as item description, information
about the salesperson or the branch, and so on.
4. Other Kinds of Data
Besides relational database data, data warehouse data, and transaction data, there
are many other kinds of data that have versatile forms and structures and rather
different semantic meanings. Such kinds of data can be seen in many applications:
time-related or sequence data (e.g., historical records, stock exchange data, and
time-series and biological sequence data), data streams (e.g., video surveillance
and sensor data, which are continuously transmitted), spatial data (e.g., maps),
engineering design data (e.g., the design of buildings, system components, or
integrated circuits), hypertext and multimedia data (including text, image, video,
and audio data), graph and networked data (e.g., social and information networks),
and the Web (a huge, widely distributed information repository made available by
the Internet). These applications bring about new challenges, like how to handle
data carrying special structures (e.g., sequences, trees, graphs, and networks) and
specific semantics (such as ordering, image, audio and video contents, and
connectivity), and how to mine patterns that carry rich structures and semantics.

Q. Explain the data models used in data warehouse.


Ans: A data mining model is a virtual structure in the field of computer science
that represents grouped data for predictive analysis. It is different from data tables
as it interprets data and stores statistical information about the rules and patterns
learned from training the model. The model is trained by feeding existing
information and trends to it, and the results are stored rather than the actual data
used for training.
Various multidimensional models are:
1.​ star schema,
2.​ snowflake schema, and
3.​ fact constellation.
The models use two different types of table: Fact Table and Dimension Table.
dimensions are the perspectives or entities with respect to which an organization
wants to keep records. For example, an Enterprise may create a sales data
warehouse in order to keep records of the store’s sales with respect to the
dimensions time, item, branch, and location. These dimensions allow the store to
keep track of things like monthly sales of items and the branches and locations at
which the items were sold. Each dimension may have a table associated with it,
called a dimension table. The fact table contains the names of the facts/subjects,
or measures, as well as keys to each of the related dimension tables.
Star schema: The most common modeling paradigm is the star schema, in which
the data warehouse contains a large central table (fact table) containing the bulk of
the data, with no redundancy, and (2) a set of smaller attendant tables (dimension
tables), one for each dimension. The schema graph resembles a starburst, with the
dimension tables displayed in a radial pattern around the central fact table. In this
example Sales are considered along four dimensions: time, item, branch, and
location. The schema contains a central fact table for sales that contains keys to
each of the four dimensions, along with two measures: dollars sold and units sold.
in the star schema, each dimension is represented by only one table, and each table
contains a set of attributes. For example, the location dimension table contains
the attribute set {location key, street, city, province or state, country}. This
constraint may introduce some redundancy. For example, “Urbana” and “Chicago”
are both cities in the state of Illinois, USA. Entries for such cities in the location
dimension table will create redundancy among the attributes province or state and
country; that is, ...., Urbana, IL, USA/ and ...., Chicago, IL, USA/. Moreover, the
attributes within a dimension table may form either a hierarchy (total order) or a
lattice (partial order).
Snowflake schema: The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further splitting the data
into additional tables. The resulting schema graph forms a shape similar to a
snowflake. The major difference between the snowflake and star schema models is
that the dimension tables of the snowflake model may be kept in normalized form
to reduce redundancies. Such a table is easy to maintain and saves storage space.
However, this space savings is negligible in comparison to the typical magnitude of
the fact table. Furthermore, the snowflake structure can reduce the effectiveness of
browsing, since more joins will be needed to execute a query. Consequently, the
system performance may be adversely impacted.Hence, although the snowflake
schema reduces redundancy, it is not as popular as the star schema in data
warehouse design.
Here, the sales fact table is identical to that of the star schema. The main difference
between the two schemas is in the definition of dimension tables. The single
dimension table for item in the star schema is normalized in the snowflake schema,
resulting in new item and supplier tables. For example, the item dimension table
now contains the attributes item key, item name, brand, type, and supplier key,
where supplier key is linked to the supplier dimension table, containing supplier
key and supplier type information. Similarly, the single dimension table for
location in the star schema can be normalized into two new tables: location and
city. The city key in the new location table links to the city dimension. Notice that,
when desirable, further normalization can be performed on province or state and
country in the snowflake schema.

Fact constellation: Sophisticated applications may require multiple fact tables to


share dimension tables. This kind of schema can be viewed as a collection of stars,
and hence is called a galaxy schema or a fact constellation.
It is a schema for representing multidimensional model. It is a collection of
multiple fact tables having some common dimension tables. It can be viewed as a
collection of several star schemas and hence, also known as Galaxy schema. It is
one of the widely used schema for Data warehouse designing and it is much more
complex than star and snowflake schema. For complex systems, we require fact
constellations.
Figure – General structure of Fact Constellation

Here, the pink coloured Dimension tables are the common ones among both the
star schemas. Green coloured fact tables are the fact tables of their respective star
schemas.

Q. Discuss the issues and benefits of data mining.


Ans: Data mining is a dynamic and fast-expanding field with great strengths.
we briefly outline the major issues in data mining research, partitioning them into
five groups:
1​ mining methodology,
a. Mining various and new kind of knowledge
b. Mining Knowledge in multidimensional space
2​ user interaction,
a. Interactive mining
b. Incorporation of background knowledge
c. Ad hoc datamining and data mining query language
d. Presentation and visualization of data mining results
3​ efficiency and scalability,
a. Efficiency of data mining algorithms
b. Parallel, distributed and incremental algorithms
4​ diversity of data types,
a.​ Handling complex type of data
b.​ Mining dynamic and global repositories
5​ data mining and society.

1.​ Mining Methodology


Researchers have been vigorously developing new data mining methodologies.
This involves the investigation of new kinds of knowledge, mining in
multidimensional space, integrating methods from other disciplines, and the
consideration of semantic ties among data objects. In addition, mining
methodologies should consider issues such as data uncertainty, noise, and
incompleteness. Some mining methods explore how user specified measures can
be used to assess the interestingness of discovered patterns as well as guide the
discovery process.

Various aspects of mining methodology:


a.​ Mining various and new kinds of knowledge: Data mining
covers a wide spectrum of data analysis and knowledge
discovery tasks, from data characterization and discrimination
to association and correlation analysis, classification,
regression, clustering, outlier analysis, sequence analysis, and
trend and evolution analysis. These tasks may use the same
database in different ways and require the development of
numerous data mining techniques. Due to the diversity of
applications, new mining tasks continue to emerge, making data
mining a dynamic and fast-growing field. For example, for
effective knowledge discovery in information networks,
integrated clustering and ranking may lead to the discovery of
high-quality clusters and object ranks in large networks.
b.​ Mining knowledge in multidimensional space: When
searching for knowledge in large data sets, we can explore the
data in multidimensional space. That is, we can search for
interesting patterns among combinations of dimensions
(attributes) at varying levels of abstraction. Such mining is
known as (exploratory) multidimensional data mining. In many
cases, data can be aggregated or viewed as a multidimensional
data cube. Mining knowledge in cube space can substantially
enhance the power and flexibility of data mining.

2.​ User Interaction


The user plays an important role in the data mining process. Interesting areas of
research include how to interact with a data mining system, how to incorporate a
user’s background knowledge in mining, and how to visualize and comprehend
data mining results.
a.​ Interactive mining: The data mining process should be highly interactive.
Thus, it is important to build flexible user interfaces and an exploratory
mining environment, facilitating the user’s interaction with the system. A
user may like to first sample a set of data, explore general characteristics of
the data, and estimate potential mining results. Interactive mining should
allow users to dynamically change the focus of a search, to refine mining
requests based on returned results, and to drill, dice, and pivot through the
data and knowledge space interactively, dynamically exploring “cube space”
while mining.
b.​ Incorporation of background knowledge: Background knowledge,
constraints, rules, and other information regarding the domain under study
should be incorporated into the knowledge discovery process. Such
knowledge can be used for pattern evaluation as well as to guide the search
toward interesting patterns.
c.​ Ad hoc data mining and data mining query languages: Query languages
(e.g., SQL) have played an important role in flexible searching because they
allow users to pose ad hoc queries. Similarly, high-level data mining query
languages or other high-level flexible user interfaces will give users the
freedom to define ad hoc data mining tasks. This should facilitate
specification of the relevant sets of data for analysis, the domain knowledge,
the kinds of knowledge to be mined, and the conditions and constraints to be
enforced on the discovered patterns. Optimization of the processing of such
flexible mining requests is another promising area of study.
d.​ Presentation and visualization of data mining results: How can a data
mining system present data mining results, vividly and flexibly, so that the
discovered knowledge can be easily understood and directly usable by
humans? This is especially crucial if the data mining process is interactive. It
requires the system to adopt expressive knowledge representations,
user-friendly interfaces, and visualization techniques.
3.​ Efficiency and Scalability
Efficiency and scalability are always considered when comparing data mining
algorithms. As data amounts continue to multiply, these two factors are especially
critical.
a.​ Efficiency and scalability of data mining algorithms: Data mining
algorithms must be efficient and scalable in order to effectively extract
information from huge amounts of data in many data repositories or in
dynamic data streams. In other words, the running time of a data mining
algorithm must be predictable, short, and acceptable by applications.
Efficiency, scalability, performance, optimization, and the ability to execute
in real time are key criteria that drive the development of many new data
mining algorithms.
b.​ Parallel, distributed, and incremental mining algorithms: The humongous
size of many data sets, the wide distribution of data, and the computational
complexity of some data mining methods are factors that motivate the
development of parallel and distributed data-intensive mining
algorithms. Such algorithms first partition the data into “pieces.” Each piece
is processed, in parallel, by searching for patterns. The parallel processes
may interact with one another. The patterns from each partition are
eventually merged.
c.​ Cloud computing and cluster computing, which use computers in a
distributed
and collaborative way to tackle very large-scale computational tasks, are also
active research themes in parallel data mining. In addition, the high cost of some
data mining processes and the incremental nature of input promote incremental
data mining, which incorporates new data updates without having to mine the
entire data “from scratch.” Such methods perform knowledge modification
incrementally to amend and strengthen what was previously discovered.
4.​ Diversity of Database Types
The wide diversity of database types brings about challenges to data mining. These
Include
a.​ Handling complex types of data: Diverse applications generate a wide
spectrum of new data types, from structured data such as relational and data
warehouse data to semi-structured and unstructured data; from stable data
repositories to dynamic data streams; from simple data objects to temporal
data, biological sequences, sensor data, spatial data, hypertext data,
multimedia data, software program code, Web data, and social network
data. It is unrealistic to expect one data mining system to mine all kinds of
data, given the diversity of data types and the different goals of data mining.
Domain- or application-dedicated data mining systems are being constructed for in
depth mining of specific kinds of data. The construction of effective and efficient
data mining tools for diverse applications remains a challenging and active area of
research.
b.​ Mining dynamic, networked, and global data repositories: Multiple sources
of data are connected by the Internet and various kinds of networks, forming
gigantic, distributed, and heterogeneous global information systems and
networks. The discovery of knowledge from different sources of
structured, semi-structured, or unstructured yet interconnected data
with diverse data semantics poses great challenges to data mining.
Mining such gigantic, interconnected information networks may help
disclose many more patterns and knowledge in heterogeneous data sets than
can be discovered from a small set of isolated data repositories. Web mining,
multisource data mining, and information network mining have become
challenging and fast-evolving data mining fields.

5.​ Data Mining and Society


How does data mining impact society?
What steps can data mining take to preserve the privacy of individuals?
Do we use data mining in our daily lives without even knowing that we do?
These questions raise the following issues:
Social impacts of data mining: With data mining penetrating our everyday lives, it
is important to study the impact of data mining on society.
How can we use data mining technology to benefit society?
How can we guard against its misuse?
The improper disclosure or use of data and the potential violation of individual
privacy and data protection rights are areas of concern that need to be addressed.
Privacy-preserving data mining: Data mining will help scientific discovery,
business management, economy recovery, and security protection (e.g., the
real-time discovery of intruders and cyberattacks). However, it poses the risk of
disclosing an individual’s personal information. Studies on privacy-preserving data
publishing and data mining are ongoing. The philosophy is to observe data
sensitivity and preserve people’s privacy while performing successful data mining.

Q. What do you understand by ETL. What is its significance in data mining?

Ans: ETL, which stands for extract, transform and load, is a data integration
process that combines data from multiple data sources into a single, consistent data
store that is loaded into a data warehouse or other target system.
As the databases grew in popularity in the 1970s, ETL was introduced as a process
for integrating and loading data for computation and analysis, eventually becoming
the primary method to process data for data warehousing projects.

ETL provides the foundation for data analytics and machine learning workstreams.
Through a series of business rules, ETL cleanses and organizes data in a way
which addresses specific business intelligence needs, like monthly reporting, but it
can also tackle more advanced analytics, which can improve back-end processes
or end user experiences. ETL is often used by an organization to:

●​ Extract data from legacy systems


●​ Cleanse the data to improve data quality and establish consistency
●​ Load data into a target database
The easiest way to understand how ETL works is to understand what happens in
each step of the process.
Extract
During data extraction, raw data is copied or exported from source locations to a
staging area. Data management teams can extract data from a variety of data
sources, which can be structured or unstructured. Those sources include but are not
limited to:
●​ SQL or NoSQL servers
●​ CRM and ERP systems
●​ Flat files
●​ Email
●​ Web pages
Transform
This is also called the staging area. In the staging area, the raw data undergoes
data processing. Here, the data is transformed and consolidated for its intended
analytical use case. This phase can involve the following tasks:
●​ Filtering, cleansing, de-duplicating, validating, and authenticating the data.
●​ Performing calculations, translations, or summarizations based on the raw
data. This can include changing row and column headers for consistency,
converting currencies or other units of measurement, editing text strings, and
more.
●​ Conducting audits to ensure data quality and compliance
●​ Removing, encrypting, or protecting data governed by industry or
governmental regulators
●​ Formatting the data into tables or joined tables to match the schema of the
target data warehouse.
Load
In this last step, the transformed data is moved from the staging area into a target
data warehouse. Typically, this involves an initial loading of all data, followed by
periodic loading of incremental data changes and, less often, full refreshes to erase
and replace data in the warehouse. For most organizations that use ETL, the
process is automated, well-defined, continuous and batch-driven. Typically, ETL
takes place during off-hours when traffic on the source systems and the data
warehouse is at its lowest.
ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse
builder, CloverETL, and MarkLogic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake,
Redshift, BigQuery, and Firebolt.

ADVANTAGES and DISADVANTAGES:

Advantages of ETL process in data warehousing:

1.​ Improved data quality: ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.
2.​ Better data integration: ETL process helps to integrate data from multiple
sources and systems, making it more accessible and usable.
3.​ Increased data security: ETL process can help to improve data security by
controlling access to the data warehouse and ensuring that only authorized users
can access the data.
4.​ Improved scalability: ETL process can help to improve scalability by
providing a way to manage and analyze large amounts of data.
5.​ Increased automation: ETL tools and technologies can automate and simplify
the ETL process, reducing the time and effort required to load and update data
in the warehouse.

Disadvantages of ETL process in data warehousing:

1.​ High cost: ETL process can be expensive to implement and maintain,
especially for organizations with limited resources.
2.​ Complexity: ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or resources.
3.​ Limited flexibility: ETL process can be limited in terms of flexibility, as it may
not be able to handle unstructured data or real-time data streams.
4.​ Limited scalability: ETL process can be limited in terms of scalability, as it
may not be able to handle very large amounts of data.
5.​ Data privacy concerns: ETL process can raise concerns about data privacy, as
large amounts of data are collected, stored, and analyzed.
Overall, ETL process is an essential process in data warehousing that helps to
ensure that the data in the data warehouse is accurate, complete, and up-to-date.
However, it also comes with its own set of challenges and limitations, and
organizations need to carefully consider the costs and benefits before implementing
them.

Q. How can you measure dispersion of data? Explain the concept of Range, Quartile, Outliers and
Boxplot.

Ans: Dispersion of data used to understands the distribution of data. It helps to


understand the variation of data and provides a piece of information about the
distribution data. Range, IOR, Variance, and Standard Deviation are the methods
used to understand the distribution data. Dispersion of data helps to identify
outliers in a given dataset. Various methods of measuring dispersion is Range,
Inter Quartile Range, Boxplot, outlier analysis etc.
Range

The range is the easiest dispersion of data or measure of variability. The range can
measure by subtracting the lowest value from the massive Number. The wide range
indicates high variability, and the small range specifies low variability in the
distribution. To calculate a range, prepare all the values in ascending order, then subtract
the lowest value from the highest value.

Range = Highest_value – Lowest_value


The range of marks is 18.
The range can influence by outliers. If there is one extreme value that can
change the value of a range.

Interquartile Range (IQR)

IQR is a range (the boundary between the first and second quartile) and Q3 (the
boundary between the third and fourth quartile).IQR is preferred over a range as, like a
range, IQR does not influence by outliers. IQR is used to measure variability by splitting a
data set into four equal quartiles.
IQR uses a box plot to find the outliers. ”To estimating IQR, all the values form
(sort) in the ascending order else it will provide a negative value, and that
influences to find the outliers.”

Formula to find outliers


[Q1 – 1.5 * IQR, Q3 + 1.5 * IQR]
If the value does not fall in the above range it considers outliers.
IQR uses a box plot to find the outliers. ”To estimating IQR, all the values form
(sort) in the ascending order else it will provide a negative value, and that
influences to find the outliers.”

Formula to find outliers


[Q1 – 1.5 * IQR, Q3 + 1.5 * IQR]
If the value does not fall in the above range it considers outliers.

Standard Deviation
Standard deviation is a squared root of the variance to get original values. Low
standard deviation indicates data points close to mean.

The normal distribution is conventional bits of help to understand the standard deviation.

X indicates the mean value

Box Plot
It captures the summary of the data effectively and efficiently with only a
simple box and whiskers. Boxplot summarizes sample data using 25th, 50th,
and 75th percentiles. One can just get insights(quartiles, median, and outliers)
into the dataset by just looking at its boxplot.

In the above graph, can clearly see that values above 10 are acting as
outliers.

Q. What is meant by slice and dice in context of OLAP. Explain.

Ans: Slice and Dice are two important operations of OLA.


1.​ Dice: It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube is
selected by selecting following dimensions with criteria:
●​ Location = “Delhi” or “Kolkata”
●​ Time = “Q1” or “Q2”
●​ Item = “Car” or “Bus”
2.​ Slice: It selects a single dimension from the OLAP cube which results in a
new sub-cube creation. In the cube given in the overview section, Slice is
performed on the dimension Time = “Q1”.​

Q. Compare OLTP an OLAP systems.


OLAP stands for On-Line Analytical Processing. It is used for analysis of database
information from multiple database systems at one time such as sales analysis and
forecasting, market research, budgeting and etc. Data Warehouse is the example of OLAP
system.
OLTP stands for On-Line Transactional processing. It is used for maintaining the online
transaction and record integrity in multiple access environments. OLTP is a system that
manages very large number of short online transactions for example, ATM.

Sr. No. Key OLAP OLTP


It is used to manage very large
1 Basic It is used for data analysis number of online short
transactions

2 Database Type It uses data warehouse It uses traditional DBMS

Data It manages all insert, update


3 It is mainly used for data reading
Modification and delete transaction

4 Response time Processing is little slow In Milliseconds

Tables in OLAP database are Tables in OLTP database are


5 Normalization
not normalized. normalized.
Q. What is OLAP ? What are different operations performed in OLAP? Explain.
ANS: OLAP(Online Analytical Process)
Data warehouse systems serve users or knowledge workers in the role of data analysis and decision
making. Such systems can organize and present data in various formats in order to accommodate the
diverse needs of different users. These systems are known as online analytical processing (OLAP)
systems. In the multidimensional model, data are organized into multiple dimensions, and each dimension
contains multiple levels of abstraction defined by concept hierarchies. This organization provides users
with the flexibility to view data from different perspectives. A number of OLAP data cube operations exist
to materialize these different views, allowing interactive querying and analysis of the data at hand. OLAP
operations −
●​ Roll-up
●​ Drill-down/roll down
●​ Slice and dice
●​ Pivot (rotate)

Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs aggregation
on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction.
Figure shows the result of a roll-up operation performed on the central cube by climbing up the concept
hierarchy for location given earlier. This hierarchy was defined as the total order “street <
city < province or state < country.” The roll-up operation shown aggregates the data by ascending
the location hierarchy from the level of city to the level of country. In other words, rather than
grouping the data by city, the resulting cube groups the data by country. When roll-up is performed by
dimension reduction, one or more dimensions are removed from the given cube. For example, consider a
sales data cube containing only the location and time dimensions. Roll-up may be performed by
removing, say,
the time dimension, resulting in an aggregation of the total sales by location, rather than by location and
by time.

Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions. Figure given shows the result of a drill-down operation performed on
the central cube by stepping down a concept hierarchy for time defined as “day < month < quarter <
year.” Drill-down
occurs by descending the time hierarchy from the level of quarter to the more detailed level of month. The
resulting data cube details the total sales per month rather than summarizing them by quarter. Because a
drill-down adds more detail to the given data, it can also be performed by adding new dimensions to a
cube.

Here’s a typical example of a Drill-up or roll up OLAP operations example:


Drill down
OLAP Drill-down is an operation opposite to Drill-up. It is carried out either by descending a concept
hierarchy for a dimension or by adding a new dimension. It lets a user deploy highly detailed data
from a less detailed cube. Consequently, when the operation is run, one or more dimensions from
the data cube must be appended to provide more information elements.
Have a look at an OLAP Drill-down example in use:
Slice
The next pair we are going to discuss is slice and dice operations in OLAP. The Slice OLAP
operations takes one specific dimension from a cube given and represents a new sub-cube, which
provides information from another point of view.It can create a new sub-cube by choosing one or
more dimensions. The use of Slice implies the specified granularity level of the dimension.
OLAP Slice example will look the following way:

Dice
OLAP Dice emphasizes two or more dimensions from a cube given and suggests a new sub-cube,
as well as Slice operation does. In order to locate a single value for a cube, it includes adding values
for each dimension.
The diagram below shows how Dice operation works:
Pivot
This OLAP operation rotates the axes of a cube to provide an alternative view of the data cube. Pivot
clusters the data with other dimensions which helps analyze the performance of a company or
enterprise.
Here’s an example of Pivot in operation:

Q.What do you mean by spatial database? Explain.


spatial database saves a huge amount of space-related data, including maps, preprocessed
remote sensing or medical imaging records, and VLSI chip design data. Spatial databases
have several features that distinguish them from relational databases. They carry topological
and/or distance information, usually organized by sophisticated, multidimensional spatial
indexing structures that are accessed by spatial data access methods and often require spatial
reasoning, geometric computation, and spatial knowledge representation techniques.
Spatial data mining refers to the extraction of knowledge, spatial relationships, or other
interesting patterns not explicitly stored in spatial databases. Such mining demands the
unification of data mining with spatial database technologies. It can be used for learning spatial
records, discovering spatial relationships and relationships among spatial and nonspatial records,
constructing spatial knowledge bases, reorganizing spatial databases, and optimizing spatial
queries.
It is expected to have broad applications in geographic data systems, marketing, remote sensing,
image database exploration, medical imaging, navigation, traffic control, environmental studies,
and many other areas where spatial data are used.
A central challenge to spatial data mining is the exploration of efficient spatial data mining
techniques because of the large amount of spatial data and the difficulty of spatial data types and
spatial access methods. Statistical spatial data analysis has been a popular approach to analyzing
spatial data and exploring geographic information.
The term geostatistics is often associated with continuous geographic space, whereas the term
spatial statistics is often associated with discrete space. In a statistical model that manages
non-spatial records, one generally considers statistical independence among different areas of
data.
There is no such separation among spatially distributed records because, actually spatial objects
are interrelated, or more exactly spatially co-located, in the sense that the closer the two objects
are placed, the more likely they send the same properties. For example, natural resources,
climate, temperature, and economic situations are likely to be similar in geographically closely
located regions.
Such a property of close interdependency across nearby space leads to the notion of spatial
autocorrelation. Based on this notion, spatial statistical modeling methods have been developed
with success. Spatial data mining will create spatial statistical analysis methods and extend them
for large amounts of spatial data, with more emphasis on effectiveness, scalability, cooperation
with database and data warehouse systems, enhanced user interaction, and the discovery of new
kinds of knowledge.

Spatial data is associated with geographic locations such as cities,towns etc. A


spatial database is optimized to store and query data representing objects.
These are the objects which are defined in a geometric space.

Characteristics of Spatial Database


A spatial database system has the following characteristics

●​ It is a database system
●​ It offers spatial data types (SDTs) in its data model and query language.
●​ It supports spatial data types in its implementation, providing at least spatial indexing
and efficient algorithms for spatial join.

Example
A road map is a visualization of geographic information. A road map is a
2-dimensional object which contains points, lines, and polygons that can
represent cities, roads, and political boundaries such as states or provinces.

In general, spatial data can be of two types −

●​ Vector data: This data is represented as discrete points, lines and polygons
●​ Rastor data: This data is represented as a matrix of square cells.

What is Cluster analysis? Discuss major clustering methods.


o​ A cluster is a subset of similar objects. Clustering is the process of making a
group of abstract objects into classes of similar objects. Cluster analysis,
also known as clustering, is a method of data mining that groups similar
data points together. The goal of cluster analysis is to divide a dataset
into groups (or clusters) such that the data points within each group are
more similar to each other than to data points in other groups. This
process is often used for exploratory data analysis and can help identify
patterns or relationships within the data that may not be immediately
obvious. There are many different algorithms used for cluster analysis,
such as k-means, hierarchical clustering, and density-based clustering.
The choice of algorithm will depend on the specific requirements of the
analysis and the nature of the data being analyzed. It is the method of
converting a group of abstract objects into classes of similar objects.

Clustering Methods
Clustering methods can be classified into the following categories −

●​ Partitioning Method
●​ Hierarchical Method
●​ Agglomerative Approach
●​ Divisive Approach

●​ Density-based Method
●​ Grid-Based Method
●​ Model-Based Method
●​ Constraint-based Method

Partitioning Method: This clustering method classifies the information into


multiple groups based on the characteristics and similarity of the data. Its the
data analysts to specify the number of clusters that has to be generated for the
clustering methods. In the partitioning method when database(D) that contains
multiple(N) objects then the partitioning method constructs user-specified(K)
partitions of the data in which each partition represents a cluster and a particular
region. There are many algorithms that come under partitioning method some of
the popular ones are K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering
Large Applications) etc. In this article, we will be seeing the working of K Mean
algorithm in detail. K-Mean (A centroid based Technique): The K means
algorithm takes the input parameter K from the user and partitions the dataset
containing N objects into K clusters so that resulting similarity among the data
objects inside the group (intracluster) is high but the similarity of data objects with
the data objects from outside the cluster is low (intercluster).

Hierarchical Clustrering
Hierarchical clustering is another unsupervised machine learning algorithm, which is used
to group the unlabeled datasets into a cluster and also known as hierarchical cluster
analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look similar,
but they both differ depending on how they work. As there is no requirement to
predetermine the number of clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1.​ Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm


starts with taking all data points as single clusters and merging them until one
cluster is left.
2.​ Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is
a top-down approach.

Density-based clustering
Density-Based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms. The data points in
the region separated by two clusters of low point density are considered as noise. The
surroundings with a radius ε of a given object are known as the ε neighborhood of the
object. If the ε neighborhood of the object comprises at least a minimum number, MinPts of
objects, then it is called a core object.

What is prediction accuracy and how is measured in classification and prediction


tasks?

Data Mining can be referred to as knowledge mining from data, knowledge


extraction, data/pattern analysis, data archaeology, and data dredging.

Different methods of accuracy measurement are:

HoldOut

In the holdout method, the largest dataset is randomly divided into three subsets:
●​ A training set is a subset of the dataset which are been used to build
predictive models.
●​ The validation set is a subset of the dataset which is been used to assess
the performance of the model built in the training phase. It provides a test
platform for fine-tuning of the model’s parameters and selecting the
best-performing model. It is not necessary for all modeling algorithms to need
a validation set.
●​ Test sets or unseen examples are the subset of the dataset to assess the
likely future performance of the model. If a model is fitting into the training set
much better than it fits into the test set, then overfitting is probably the cause
that occurred here.
Basically, two-thirds of the data are been allocated to the training set and the
remaining one-third is been allocated to the test set.

Random Subsampling

●​ Random subsampling is a variation of the holdout method. The holdout


method is been repeated K times.
●​ The holdout subsampling involves randomly splitting the data into a training
set and a test set.
●​ On the training set the data is been trained and the mean square error (MSE)
is been obtained from the predictions on the test set.
●​ As MSE is dependent on the split, this method is not recommended. So a new
split can give you a new MSE.
●​ The overall accuracy is been calculated as E = 1/K \sum_{k}^{i=1} E_{i}
Cross-Validation

●​ K-fold cross-validation is been used when there is only a limited amount of


data available, to achieve an unbiased estimation of the performance of the
model.
●​ Here, we divide the data into K subsets of equal sizes.
●​ We build models K times, each time leaving out one of the subsets from the
training, and use it as the test set.
●​ If K equals the sample size, then this is called a “Leave-One-Out”

Bootstrapping

Bootstrapping is one of the techniques which is used to make the estimations


from the data by taking an average of the estimates from smaller data samples.

The bootstrapping method involves the iterative resampling of a dataset with


replacement.

●​ On resampling instead of only estimating the statistics once on complete data,


we can do it many times.
●​ Repeating this multiple times helps to obtain a vector of estimates.
●​ Bootstrapping can compute variance, expected value, and other relevant
statistics of these estimates.

Q.What is classification by back propagation How does it apply to neural


networks?
Back propagation is an algorithm that propagates the error from the output nodes to the
input nodes. We can say that it is used to propagate back all the errors. We can propagate
back to the error in the various applications in the data mining, such as Character
recognition, Signature verification, etc.

Neural Network
It is a type of paradigm that is an image-processing system inspired by the human nervous
system. Like the human nervous system, we have neutral artificial neurons in the neural
network. The human brain has 10 billion neurons, each connected to an average of 10,000
other neurons. Each neuron receives a signal through a synapse, which controls the effect
of the sign on the neuron.
Backpropagation
It is a type of algorithm widely used in training in neural networks. It is used to compute
the loss function of the weight of the networks. It is so efficient that it directly computes
the gradient concerning each weight. With the help of this, it is also possible to use
gradient methods to train multi-layer networks and update weights to minimize loss;
variants such as gradient descent or stochastic gradient descent are often used.

The main work of the backpropagation algorithm is to compute the gradient of the loss
function to each weight via the chain rule, computing the gradient layer by layer and
iterating backwards from the last layer to avoid redundant computation of intermediate
terms in the chain rule.

Features of Backpropagation:
There are so many features of backpropagation. These features are as follows.

1.​ It is one of the gradient methods used to create the simple perceptron network
with the differentiable unit.
2.​ It is so much more difficult than another network. It is used to calculate the learning
period of the network.
3.​ There are three stages which are used in the training. Those three stages are as
follows.​

o​ The feed-forward of the input training pattern, the calculation, and the
backpropagation of the error.
o​ Updation of the weight.
Neural networks generate output vectors from input vectors on which neural network
operates on. It compares generated output with the desired output and generates an
error report if the result does not match the generated output vector. Then it adjusts the
weights accordingly to get the desired output. It is based on gradient descent and
updates weights by minimizing the error between predicted and actual output.
Training of backpropagation consists of three stages:
●​ Forward propagation of input data.
●​ Backward propagation of error.
●​ Updating weights to reduce the error.
Let’s walk through an example of backpropagation in machine learning. Assume the
neurons use the sigmoid activation function for the forward and backward pass. The
target output is 0.5 and the learning rate is 1.

The weighted sum at each node is calculated using:


aj​=∑(wi​,j∗xi​)
Where,
●​ ajaj is the weighted sum of all the inputs and weights at each node

●​ wi,jwi,j represents the weights between the ithithinput and the jthjth neuron

●​ xixi represents the value of the ithith input


o (output): After applying the activation function to a, we get the output of the
neuron:
ojoj = activation function(ajaj)
​ ​

Q. Explain the concept of partitioning methods in cluster analysis. What are the
advantages and limitations of partitioning methods?
This clustering method classifies the information into multiple groups based on
the characteristics and similarity of the data. Its the data analysts to specify the
number of clusters that has to be generated for the clustering methods. In the
partitioning method when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data in which
each partition represents a cluster and a particular region. There are many
algorithms that come under partitioning method some of the popular ones are
K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large
Applications) etc. In this article, we will be seeing the working of K Mean
algorithm in detail. K-Mean (A centroid based Technique): The K means
algorithm takes the input parameter K from the user and partitions the dataset
containing N objects into K clusters so that resulting similarity among the data
objects inside the group (intracluster) is high but the similarity of data objects with
the data objects from outside the cluster is low (intercluster). The similarity of the
cluster is determined with respect to the mean value of the cluster. It is a type of
square error algorithm. At the start randomly k objects from the dataset are
chosen in which each of the objects represents a cluster mean(centre). For the
rest of the data objects, they are assigned to the nearest cluster based on their
distance from the cluster mean. The new mean of each of the cluster is then
calculated with the added data objects.

Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters
Method:
1.​ Randomly assign K objects from the dataset(D) as cluster centres(C)
2.​ (Re) Assign each object to which object is most similar based upon mean
values.
3.​ Update Cluster means, i.e., Recalculate the mean of each cluster with the
updated values.
4.​ Repeat Step 2 until no change occurs.
Figure – K-mean ClusteringFlowchart:

Figure
– K-mean Clustering

Example: Suppose we want to group the visitors to a website using just their age
as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66
Initial Cluster:
K=2

Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66
Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the
clusters (16-29) and (36-66) as 2 clusters we get using K Mean Algorithm.

Advantages:
●​ Simple Implementation K-Means is relatively easy to understand and implement,
making it accessible to both novice and professional data miners.
●​ Fast Computation The algorithm is computationally efficient, allowing for quick
clustering of large datasets. It can handle a high volume of data points in a
reasonable amount of time.
●​ Scalability K-Means can handle datasets with a large number of dimensions without
sacrificing performance. This makes it suitable for analyzing complex data structures
found in various applications.
●​ Flexibility The algorithm allows for flexibility in defining the number of clusters
desired. Data analysts can select the appropriate number of clusters based on their
specific requirements.
●​ Robustness K-Means is robust to noise and outliers, as it uses the mean of the
cluster members as the centroid representation. This helps minimize the impact of
noisy data on the overall clustering result.
●​ Interpretable Results The output generated by K-Means is easy to interpret since
each cluster represents a distinct group or subset of the dataset based on similarity
or proximity.
●​ Versatility K-Means can be used for various types of data analysis tasks, including
customer segmentation, image compression, anomaly detection, and
recommendation systems.
●​ Incremental Updating The K-Means algorithm can be updated incrementally when
new data points are added or removed from the dataset, making it suitable for
real-time or streaming applications.
●​ Applicable to Large Datasets K-Means has been successfully applied to deal with big
data problems due to its efficiency and scalability.
●​ Widely Supported Many programming languages and software libraries provide
implementations for K-Means algorithm, making it readily available and applicable
across different platforms.

Disadvantages
●​ Sensitivity to initial cluster centers The outcome of K-Means clustering heavily
depends on the initial selection of cluster centers. Different initializations can lead to
different final results, making it challenging to obtain the optimal clustering solution.
●​ Assumes isotropic and spherical clusters K-Means assumes that clusters are
isotropic (having equal variance) and spherical in shape. This assumption may not
hold for all types of datasets, especially when dealing with irregularly shaped or
overlapping clusters.
●​ Difficulty handling categorical variables K-Means is primarily designed for numerical
data analysis and struggles with categorical variables. It cannot handle non-numeric
attributes directly since the distance between categorical values cannot be
calculated effectively.
●​ Influence of outliers Outliers can significantly impact the performance of K-Means
clustering. Since K-Means is sensitive to distance measures, outliers can distort the
centroids and affect cluster assignments, leading to less accurate results.
●​ Requires predefined number of clusters One major drawback of K-Means is that you
need to specify the number of desired clusters before running the algorithm.
Determining an appropriate number of clusters in advance can be challenging and
subjective, especially when working with complex datasets.
●​ Struggles with high-dimensional data As the dimensionality of data increases, so
does the "curse of dimensionality." In high-dimensional spaces, distances between
points become less meaningful, making it difficult for K-Means to find meaningful
clusters accurately.
●​ Lack of robustness against noise or outliers While mentioning this point earlier
regarding outliers, it's worth noting that even a small amount of noise or outliers can
severely impact the performance of K-Means clustering by leading to incorrect
cluster assignments.
●​ Limited applicability to non-linear data K-Means assumes that clusters are linearly
separable, which means it may not perform well on datasets with non-linear
structures where the decision boundaries are curved or irregular.

Q. Differentiate classification and clustering.Explain the major idea of Bayesian


classification.
Both Classification and Clustering is used for the categorization of objects into
one or more classes based on the features. They appear to be a similar
process as the basic difference is minute. In the case of Classification, there
are predefined labels assigned to each input instance according to their
properties whereas in clustering those labels are missing.

Comparison between Classification and Clustering:


Parameter CLASSIFICATION CLUSTERING

Type used for supervised learning used for unsupervised learning

process of classifying the input grouping the instances based on


Basic instances based on their their similarity without the help of
corresponding class labels class labels
Parameter CLASSIFICATION CLUSTERING

it has labels so there is need of


there is no need of training and
Need training and testing dataset for
testing dataset
verifying the model created

more complex as compared to less complex as compared to


Complexity
clustering classification

k-means clustering algorithm,


Logistic regression, Naive Bayes
Example Fuzzy c-means clustering
classifier, Support vector machines,
Algorithms algorithm, Gaussian (EM)
etc.
clustering algorithm, etc.

Differences between Classification and Clustering


1.​ Classification is used for supervised learning whereas clustering is used for
unsupervised learning.
2.​ The process of classifying the input instances based on their
corresponding class labels is known as classification whereas grouping the
instances based on their similarity without the help of class labels is known
as clustering.
3.​ As Classification have labels so there is need of training and testing
dataset for verifying the model created but there is no need for training and
testing dataset in clustering.
4.​ Classification is more complex as compared to clustering as there are
many levels in the classification phase whereas only grouping is done in
clustering.
5.​ Classification examples are Logistic regression, Naive Bayes classifier,
Support vector machines, etc. Whereas clustering examples are k-means
clustering algorithm, Fuzzy c-means clustering algorithm, Gaussian (EM)
clustering algorithm, etc.

Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the


statistical classifiers. Bayesian classifiers can predict class membership
probabilities such as the probability that a given tuple belongs to a particular
class.
The main idea behind the Naive Bayes classifier is to use Bayes’ Theorem to
classify data based on the probabilities of different classes given the features
of the data. It is used mostly in high-dimensional text classification
●​ The Naive Bayes Classifier is a simple probabilistic classifier and it has
very few number of parameters which are used to build the ML models that
can predict at a faster speed than other classification algorithms.
●​ It is a probabilistic classifier because it assumes that one feature in the
model is independent of existence of another feature. In other words, each
feature contributes to the predictions with no relation between each other.
●​ Naïve Bayes Algorithm is used in spam filtration, Sentimental analysis,
classifying articles and many more.
●​ Bayes' theorem states that the probability of a hypothesis H given some
observed event E is proportional to the likelihood of the evidence given the
hypothesis, multiplied by the prior probability of the hypothesis, as shown
below -

where P(H∣E)P(H∣E) is the posterior probability of the hypothesis given the event
E, P(E∣H)P(E∣H) is the likelihood or conditional probability of the event given the
hypothesis, P(H)P(H) is the prior probability of the hypothesis, and P(E)P(E) is
the probability of the event.
Q. Explain classification along with decision tree algorithm.

Decision tree is a simple diagram that shows different choices and their
possible results helping you make decisions easily. This article is all about
what decision trees are, how they work, their advantages and disadvantages
and their applications. It has a hierarchical tree structure starts with one main
question at the top called a node which further branches out into different
possible outcomes where:
●​ Root Node is the starting point that represents the entire dataset.
●​ Branches: These are the lines that connect nodes. It shows the flow from
one decision to another.
●​ Internal Nodes are Points where decisions are made based on the input
features.
●​ Leaf Nodes: These are the terminal nodes at the end of branches that
represent final outcomes or predictions

We have mainly two types of decision tree based on the nature of the target
variable: classification trees and regression trees.
●​ Classification trees: They are designed to predict categorical outcomes
means they classify data into different classes. They can determine
whether an email is “spam” or “not spam” based on various features of the
email.
●​ Regression trees : These are used when the target variable is continuous
It predict numerical values rather than categories. For example a
regression tree can estimate the price of a house based on its size,
location, and other features.
The leaves are the decisions or the final outcomes. And the decision nodes are

where the data is split.

An example of a decision tree can be explained using above binary tree. Let’s say
you want to predict whether a person is fit given their information like age, eating
habit, and physical activity, etc. The decision nodes here are questions like ‘What’s
the age?’, ‘Does he exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which
are outcomes like either ‘fit’, or ‘unfit’. In this case this was a binary classification
problem (a yes no type problem).

Q.​Explain EIS in context of business information.


Ans: An executive information system (EIS) is a decision support system (DSS) used to assist
senior executives in the decision-making process. It does this by providing easy access to
important data needed to achieve strategic goals in an organization. An EIS normally features
graphical displays on an easy-to-use interface. Executive information systems can be used in
many ifferent types of organizations to monitor enterprise performance as well as to identify
opportunities and problems.
Current EIS data is available on local area networks (LANs) throughout the company or
enterprise, facilitated by personal computers and workstations. Employees can access company
data to help make decisions in their workplaces, departments, divisions, etc. This enables
employees to provide relevant information and ideas above and below the level of their ompany.
Executive support systems are intended to be used directly by senior managers to support
unscheduled strategic management decisions. Often such information is external, unstructured
and even uncertain. Often, the exact scope and context of such information are not known in
advance.
This information is based on data,

●​ Business intelligence
●​ Financial intelligence
●​ Data with technology support to analyze

An EIS has four major components, which are:

●​ Hardware
●​ Software
●​ User interface (UI)
●​ Telecommunications capability

Hardware
An EIS’s hardware should include input devices that executives can use to enter, check, and
update data; a central processing unit (CPU) that controls the entire system; data storage for
saving and archiving useful business information; and output devices (e.g., monitors, printers,
etc.) that show visual representations of the data executives need to keep or read.

Software
An EIS’s software should be able to integrate all available data into cohesive results. It should be
capable of handling text and graphics; connected to a database that contains all relevant internal
and external data; and have a model base that performs routine and special statistical, financial,
and other quantitative analyses.

User interface (UI)


This component should be capable of producing scheduled reports, FAQs, and other information.
It would be best if it’s menu-driven, too, allowing executives to pick from predetermined choices
for their needs. And since not all executives are tech-savvy, it’s ideal for the UI to accept inputs
and produce outputs using programming (i.e., for the tech-savvy) and natural language (i.e., for
the not tech-savvy).
Telecommunications capability
Since most executives often travel, an EIS should have telecommunications capability. That way,
it remains accessible regardless of location.

Executive Information System-Key Characteristics


The below mentioned figure describes about key characterisitics of EIS,

●​ Detailed data – EIS provides absolute data from its existing database.
●​ Integrate external and internal data – EIS integrates integrate external and internal data. The
external data collected from various sources.
●​ Presenting information – EIS represents available data in graphical form which helps to
analyze it easily.
●​ Trend analysis – EIS helps executives of the organizations to data prediction based on trend
data.
●​ Easy to use – It is a very simplest system to use.

Advantages of EIS

●​ Trend Analysis
●​ Improvement of corporate performance in the marketplace
●​ Development of managerial leadership skills
●​ Improves decision-making
●​ Simple to use by senior executives
●​ Better reporting method
●​ Improved office efficiency

Disadvantage of EIS

●​ Due to technical functions, not to easy to use by everyone


●​ Executives may encounter overload of information
●​ Difficult to manage database due to the large size of data
●​ Excessive costs for small business organizations

Q.​Explain different types of OLAP servers.


Ans: We have four types of OLAP servers −
●​ Relational OLAP (ROLAP)
●​ Multidimensional OLAP (MOLAP)
●​ Hybrid OLAP (HOLAP)
●​ Specialized SQL Servers

Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a
relational back-end server and client front-end tools. They use a relational or extended-relational DBMS to
store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers
include optimization for each DBMS back end, implementation of aggregation navigation logic, and
additional tools and services. ROLAP technology tends to have greater scalability than MOLAP
technology. The DSS server of Micro strategy, for example, adopts the ROLAP approach.

Relational OLAP Architecture


ROLAP includes the following components −
●​ Database server
●​ ROLAP server
●​ Front-end tool.
Advantages
●​ ROLAP servers can be easily used with existing RDBMS.
●​ Data can be stored efficiently, since no zero facts can be stored.
●​ ROLAP tools do not use pre-calculated data cubes.
●​ DSS server of micro-strategy adopts the ROLAP approach.

Disadvantages
●​ Poor query performance.
●​ Some limitations of scalability depending on the technology architecture that is

Multidimensional OLAP (MOLAP) servers: These servers support multidimensional data views through
array-based multidimensional storage engines. They map multidimensional views directly to data cube
array structures. The advantage of using a data cube is that it allows fast indexing to precomputed
summarized data. Notice that with multidimensional data stores, the storage utilization may be low if the
data set is sparse.

Many MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets:
Denser sub cubes are identified and stored as array structures, whereas sparse sub cubes employ
compression technology for efficient storage utilization.
MOLAP includes the following components −
●​ Database server.
●​ MOLAP server.
●​ Front-end tool.
Advantages
●​ MOLAP allows fastest indexing to the pre-computed summarized data.
●​ Helps the users connected to a network who need to analyze larger, less-defined data.
●​ Easier to use, therefore MOLAP is suitable for inexperienced users.
Disadvantages:
●​ MOLAP are not capable of containing detailed data.
●​ The storage utilization may be low if the data set is sparse.

Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology,
benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example, a
HOLAP server may allow large volumes of detailed data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid
OLAP server.

Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases,
some database system vendors implement specialized SQL servers that provide advanced query
language and query processing support for SQL queries over star and snowflake schemas in a read-only
environment.

Following figure shows a summary fact table that contains both base fact data and aggregated data. The
schema is “(record identifier (RID), item, . . . , day, month, quarter, year, dollars sold),” where day, month,
quarter, and year define the sales date, and dollars sold is the sales amount. Consider the tuples with an
RID of 1001 and 1002, respectively. The data of these tuples are at the base fact level, where the sales
dates are October 15, 2010, and October 23, 2010, respectively. Consider the tuple with an RID of 5001.
This tuple is at a more general level of abstraction than the tuples 1001 and 1002. The day value has
been generalized to all, so that the corresponding time value is October 2010. That is, the dollars sold
amount shown is an aggregation representing the entire month of October 2010, rather than just October
15 or 23, 2010. The special value all is used to represent subtotals in summarized data.
MOLAP uses multidimensional array structures to store data for online analytical processing. Most data
warehouse systems adopt a client-server architecture. A relational data store always resides at the data
warehouse/data mart server site. A multidimensional data store can reside at either the database server
site or the client site.
Q.What do you understand by The FASMI test in context of OLAP?
Ans It represents the characteristics of an OLAP application in a specific method, without
dictating how it should be performed.
Fast − It defines that the system is targeted to produce most responses to users within about
five seconds, with the understandable analysis taking no more than one second and very few
taking more than 20 seconds.
Independent research in the Netherlands has shown that end-users consider that a process
has declined if results are not received with 30 seconds, and they are suitable to hit
‘ALT+Ctrl+Delete’ unless the system needs them that the report will take longer.
Analysis − It defines that the system can manage with any business logic and statistical
analysis that is appropriate for the application and the user, the keep it easy enough for the
target user. Although some pre-programming can be required, it does not think it acceptable if
all application definitions have to be completed using a professional 4GL.
It is necessary to enable the user to represent new ad hoc calculations as part of the analysis
and to report on the data in any desired method, without having to program, so it can exclude
products (like Oracle Discoverer) that do not enable the user to represent new ad hoc
calculations as an element of the analysis and to report on the data in any desired method,
without having to program, so it can exclude products (like Oracle Discoverer) that do not
enable adequate end-user oriented calculation flexibility.
Shared − It defines that the system implements all the security requirements for confidentiality
(probably down to cell level) and, multiple write access is required, concurrent update areas at
a suitable level. It is not all applications required users to write data back, but for the increasing
number that does, the system must be able to handle several updates in an appropriate, secure
manner. This is a major field of weakness in some OLAP products, which tend to consider that
all OLAP applications will be read-only, with simple security controls.
Multidimensional − The system should support a multidimensional conceptual view of the
data, including complete support for hierarchies and multiple hierarchies. It is not setting up a
specific minimum number of dimensions that should be managed as it is too software
dependent and most products seem to have enough for their target industry.
Information − Information is all of the data and derived data required, whether it is and
however much is relevant for the software. We are measuring the capacity of several products
in terms of how much input data can manage, not how many Gigabytes they take to save it.

You might also like