0% found this document useful (0 votes)
21 views57 pages

DW notes

The document provides an overview of On-Line Transaction Processing (OLTP) systems, detailing their characteristics, advantages, disadvantages, and challenges, as well as the data warehouse development lifecycle, features, and tools. OLTP systems are designed for transaction-oriented applications, facilitating quick data entry and retrieval, while data warehouses centralize data for analysis and reporting. The document also outlines the steps involved in data warehouse project planning and development, emphasizing the importance of business requirements and system design.

Uploaded by

Madara Uchiha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views57 pages

DW notes

The document provides an overview of On-Line Transaction Processing (OLTP) systems, detailing their characteristics, advantages, disadvantages, and challenges, as well as the data warehouse development lifecycle, features, and tools. OLTP systems are designed for transaction-oriented applications, facilitating quick data entry and retrieval, while data warehouses centralize data for analysis and reporting. The document also outlines the steps involved in data warehouse project planning and development, emphasizing the importance of business requirements and system design.

Uploaded by

Madara Uchiha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

1. what is oltp / advantages disadvantage/ challenges?

Ans-
On-Line Transaction Processing (OLTP) System refers to the system that manage transaction oriented applications.
These systems are designed to support on-line transaction and process query quickly on the Internet.
For example: POS (point of sale) system of any supermarket is a OLTP System.
Every industry in today’s world use OLTP system to record their transactional data. The main concern of OLTP
systems is to enter, store and retrieve the data. They covers all day to day operations such as purchasing,
manufacturing, payroll, accounting, etc.of an organization. Such systems have large numbers of user which conduct
short transaction. It supports simple database query so the response time of any user action is very fast.
The data acquired through an OLTP system is stored in commercial RDBMS, which can be used by an OLAP System
for data analytics and other business intelligence operations.
Some other examples of OLTP systems include order entry, retail sales, and financial transaction systems.
Advantages of an OLTP System:
 OLTP Systems are user friendly and can be used by anyone having basic understanding
 It allows its user to perform operations like read, write and delete data quickly.
 It responds to its user actions immediately as it can process query very quickly.
 This systems are original source of the data.
 It helps to administrate and run fundamental business tasks
 It helps in widening customer base of an organization by simplifying individual processes
Dis Advantages of an OLTP System:
 OLTP lacks proper methods of transferring products to buyers by themselves.
 OLTP systems are prone to hackers and cybercriminals due to worldwide availability.
 Server failure can lead to the loss of a large amount of data from the system.
 The number of queries and updates to the system is limited.
 In business-to-business (B2B) transactions, some transactions must go offline to complete some stages,
leading to buyers and suppliers losing some OLTP efficiency benefits.
OLTP Challenges:
 Performance: OLTP systems require high-performance hardware and optimized software to maintain
responsiveness and support real-time transaction processing.
 Data Security: Ensuring the confidentiality, integrity, and availability of sensitive data is crucial in OLTP
systems, necessitating robust security measures.
 Scalability: As transaction volumes increase, OLTP systems must scale to accommodate growing workloads
without compromising performance or data integrity.
 Concurrency Control: OLTP systems need to manage multiple users accessing and modifying data
simultaneously while maintaining data consistency.
 System Maintenance: OLTP systems require regular maintenance, such as backups, updates, and tuning, to
ensure optimal performance and reliability.
OLTP Characteristics
1. Short response time
OLTP systems maintain very short response times to be effective for users. For example, responses from an ATM
operation need to be quick to make the process effective, worthwhile, and convenient.
2. Process small transactions
OLTP systems support numerous small transactions with a small amount of data executed simultaneously over the
network. It can be a mixture of queries and Data Manipulation Language (DML) overload. The queries normally
include insertions, deletions, updates, and related actions. Response time measures the effectiveness of OLTP
transactions, and millisecond responses are becoming common.
3. Data maintenance operations
Data maintenance operations are data-intensive computational reporting and data update programs that run
alongside OLTP systems without interfering with user queries.
2. data warehouse development lifecycle?
Ans-
Data warehouse development life cycle
Data Warehousing is a flow process used to gather and handle structured and unstructured data from multiple
sources into a centralized repository to operate actionable business decisions. With all of your data in one place, it
becomes easier to perform analysis, reporting and discover meaningful insights at completely different combination
levels. A data warehouse setting includes extraction, transformation, and loading (ELT) resolution, an online
analytical processing (OLAP) engine, consumer analysis tools, and different applications that manage the method of
gathering data and delivering it to business. The term data warehouse life-cycle is used to indicate the steps a data
warehouse system goes through between when it is built. The following is the Life-cycle of Data Warehousing:
Requirement Specification: It is the first step in the development of the Data Warehouse and is done by business
analysts. In this step, Business Analysts prepare business requirement specification documents. More than 50% of
requirements are collected from the client side and it takes 3-4 months to collect all the requirements. After the
requirements are gathered, the data modeler starts recognizing the dimensions, facts & combinations based on the
requirements. We can say that this is an overall blueprint of the data warehouse. But, this phase is more about
determining business needs and placing them in the data warehouse.
Data Modelling: This is the second step in the development of the Data Warehouse. Data Modelling is the process of
visualizing data distribution and designing databases by fulfilling the requirements to transform the data into a
format that can be stored in the data warehouse. For example, whenever we start building a house, we put all the
things in the correct position as specified in the blueprint. That’s what data modeling is for data warehouses. Data
modelling helps to organize data, creates connections between data sets, and it’s useful for establishing data
compliance and its security that line up with data warehousing goals. It is the most complex phase of data
warehouse development. And, there are many data modelling techniques that businesses use for warehouse design.
Data modelling typically takes place at the data mart level and branches out in a data warehouse. It’s the logic of
how the data is stored concerning other data. There are three data models for data warehouses:
Star Schema
Snowflake Schema
Galaxy Schema.
ELT Design and Development: This is the third step in the development of the Data Warehouse. ETL or Extract,
Transfer, Load tool may extract data from various source systems and store it in a data lake. An ETL process can
extract the data from the lake, after that transform it and load it into a data warehouse for reporting. For optimal
speeds, good visualization, and the ability to build easy, replicable, and consistent data pipelines between all of the
existing architecture and the new data warehouse, we need ELT tools. This is where ETL tools like SAS Data
Management, IBM Information Server, Hive, etc. come into the picture. A good ETL process can be helpful in
constructing a simple yet functional data warehouse that’s valuable throughout every layer of the organization.
OLAP Cubes: This is the fourth step in the development of the Data Warehouse. An OLAP cube, also known as a
multidimensional cube or hypercube, is a data structure that allows fast analysis of data according to the multiple
dimensions that define a business problem. A data warehouse would extract information from multiple data sources
and formats like text files, excel sheets, multimedia files, etc. The extracted data is cleaned and transformed and is
loaded into an OLAP server (or OLAP cube) where information is pre-processed in advance for further analysis.
Usually, data operations and analysis are performed using a simple spreadsheet, where data values are arranged in
row and column format. This is ideal for two-dimensional data. However, OLAP contains multidimensional data, with
data typically obtained from different and unrelated sources. Employing a spreadsheet isn’t an optimum choice. The
cube will store and analyze multidimensional data in a logical and orderly manner. Now, data warehouses are now
offered as a fully built product that is configurable and capable of staging multiple types of data. OLAP cubes are
becoming outdated as OLAP cubes can’t deliver real-time analysis and reporting, as businesses are now expecting
something with high performance.
UI Development: This is the fifth step in the development of the Data Warehouse. So far, the processes discussed
have taken place at the backend. There is a need for a user interface for how the user and a computer system
interact, in particular the use of input devices and software, to immediately access the data warehouse for analysis
and generating reports. The main aim of a UI is to enable a user to effectively manage a device or machine they’re
interacting with. There are plenty of tools in the market that helps with UI development. BI tools like Tableau or
PowerBI for those using BigQuery are great choices.
Maintenance: This is the sixth step in the development of the Data Warehouse. In this phase, we can update or
make changes to the schema and data warehouse’s application domain or requirements. Data warehouse
maintenance systems must provide means to keep track of schema modifications as well, for instance, modifications.
At the schema level, we can perform operations for the Insertion, and change dimensions and categories. Changes
are, for example, adding or deleting user-defined attributes.
Test and Deployment: This is often the ultimate step in the Data Warehouse development cycle. Businesses and
organizations test data warehouses to ensure whether the required business problems are implemented successfully
or not. The warehouse testing involves the scrutiny of enormous volumes of data. Data that has to be compared
comes from heterogeneous data sources like relational databases, flat files, operational data, etc. The overall data
warehouse project testing phases include: Data completeness, Data Transformation, Data is loaded by means of ETL
tools, Data integrity, etc. After testing the data warehouse, we deployed it so that users could immediately access
the data and perform analysis. Basically, in this phase, the data warehouse is turned on and lets the user take the
benefit of it. At the time of data warehouse deployment, most of its functions are implemented. The data
warehouses can be deployed at their own data center or on the cloud.
3. Data warehouse features(charechteristic)/advtange disadvantage?
Ans-
Characteristics of a Data Warehouse
Integrated Data
One of the key characteristics of a data warehouse is that it contains integrated data. This means that the data is
collected from various sources, such as transactional systems, and then cleaned, transformed, and consolidated into
a single, unified view. This allows for easy access and analysis of the data, as well as the ability to track data over
time.
Subject-Oriented
A data warehouse is also subject-oriented, which means that the data is organized around specific subjects, such as
customers, products, or sales. This allows for easy access to the data relevant to a specific subject, as well as the
ability to track the data over time.
Non-Volatile
Another characteristic of a data warehouse is that it is non-volatile. This means that the data in the warehouse is
never updated or deleted, only added to. This is important because it allows for the preservation of historical data,
making it possible to track trends and patterns over time.
Time-Variant
A data warehouse is also time-variant, which means that the data is stored with a time dimension. This allows for
easy access to data for specific time periods, such as last quarter or last year. This makes it possible to track trends
and patterns over time.
Advantages of a Data Warehouse:
 Data warehouses facilitate end users' access to a variety of data.
 Assist in the operation of applications for decision support systems such as trend reports, for instance,
obtaining the products that have sold the most in a specific area over the past two years; exception reports,
reports that compare the actual outcomes to the predetermined goals.
 Using numerous data warehouses can increase the operational value of business systems, especially
customer relationship management.
 Makes selections with higher quality.
 For the medium and long term, it is especially helpful.
 Installing these systems is quite straightforward if the data sources and goals are clear.
DisAdvantages of a Data Warehouse:
 The data warehouses may project substantial expenditures throughout his life. The data warehouse is
typically not stationary. Costs for maintenance are considerable.
 Data warehouses could soon become outdated.
 They occasionally need to provide complete information before a request for information, which also costs
the organization money.
 Between data warehouses and operational systems, there is frequently a fine line. It is necessary to
determine which of these features can be used and which ones should be implemented in the data
warehouse since it would be expensive to carry out needless activities or to stop carrying out those that
would be required.
 It could be more useful for making decisions in real-time due to the prolonged processing time it may
require. In any event, the trend of modern products (along with technological advancements) addresses this
issue by turning the drawback into a benefit.
 Regarding the various objectives a company seeks to achieve, challenges may arise during implementation.
steps involved in planning and development of project?
Ans-
Major steps in DW Projects
Business and Technical Justification – In this phase, the project’s sponsors detail the business justification,
opportunities and benefits as well as the technical justifications for the DW project. Staffing and other necessary
resources are identified.
Business Justification:
– Review business initiatives and processes
– Enlist BI sponsors and stakeholders (e.g., potential BI users)
– Document business benefits and outcomes in terms of adding business value
– Project scope and Budgeting
Technical Justification:
– Product evaluations for proof-of-concept and technology roadmaps
– Assessing necessary technical skills/expertise
– Assessing data quality

Gathering Business Requirements (KPI’s) – In this phase business users are interviewed to determine what
measurements / metrics they require. These are called Key Performance Indicators (KPI’s) and are generally
calculated by summing and combining OLTP transactions data. This stage clearly requires deep involvement of the
business managers and others in higher level decision making positions.
Examples of KPIs for Telecommunications Industry
System Design / Modeling – In this phase, the overall system is designed using conceptual modeling at three levels:
System Architecture Design – Overall technology architecture (hardware and DBMS software integration) are
designed. This step can be done in parallel with data and application design.
Data Modeling – Data models (Dimensions and facts) are created and mappings / pipelines from existing operational
systems are designed.
BI Application Design – Applications are designed at the conceptual level. For example, reports, user interfaces, etc.
can be mocked up and reviewed by users.
This stage is carried out by systems analysts in conjunction with the business stakeholders.
System Development – In this phase, the designs are implemented in hardware and software. DBMS vendors are
selected, data warehouse schemas are created, ETL code is written/configured, and BI applications are coded. This
stage is carried out almost exclusively by technologists (programmers, DBAs, etc.) although business users may be
called upon for testing.
Phased deployment – In this phase, users are brought on-line (on boarded) to the data warehouse.
Maintenance and evolution – Data warehouses undergo continuous evolution as new KPI’s and data sources are
defined and integrated.
tools for datawarehouse/components?
Ans-
1.Snowflake
Snowflake is a cloud-based data warehousing platform that offers a fully managed and scalable solution for data
storage, processing, and analysis. It is designed to address the challenges of traditional on-premises data
warehousing by providing a modern and cloud-native architecture. Here are the key features of Snowflake:
Snowflake is built from the ground up for the cloud. It runs entirely in cloud environments like AWS, Azure, and
Google Cloud Platform (GCP).
The platform uses a multi-cluster, shared data architecture, which means that multiple users and workloads can
concurrently access and analyze the same data without interference.
2. SAP Datawarehouse Cloud
SAP Data Warehouse Cloud is a cloud-based data warehousing solution developed by SAP. It is designed to provide
organizations with a modern, scalable, and integrated platform for data storage, data modeling, data integration,
and analytics. Here are key features and aspects of SAP Data Warehouse Cloud:
The platform allows you to integrate data from a wide range of sources, including on-premises databases, cloud-
based applications, spreadsheets, and more
Data Warehouse Cloud features a semantic layer that abstracts complex data structures and provides a business-
friendly view of data.
3. Oracle Exadata
Oracle Autonomous Data Warehouse (ADW) is a cloud-based data warehousing service offered by Oracle
Corporation. It is designed to simplify data management and analytics tasks by automating many of the traditionally
complex and time-consuming processes associated with data warehousing. Here are key aspects and features of
Oracle Autonomous Data Warehouse:
It supports data integration and ETL (Extract, Transform, Load) processes with built-in features for data loading and
transformation.
4. Panoply
Panoply is a managed ELT and a cloud data warehouse platform that allows users to set up a data warehouse
architecture. The cloud data warehouse eliminates the need for you to set up and maintain your own on-premises
data warehouse, saving time and resources.
Here are the key features of Panoply:
Various built-in connectors to ingest data from multiple sources
Built-in scheduler for automation
5. Teradata Vantage
Teradata Vantage is a data warehousing and analytics platform designed to handle large volumes of data and
support complex analytical workloads. The platform uses SQL as its primary query language, which means it is
mostly meant for users with SQL skills. Here are some key aspects of Teradata Vantage for data warehousing:
Various sources, including data warehouses, data lakes, on-premises systems, and cloud platforms.

Components of Data Warehouse Architecture and their tasks :


1. Operational Source –
An operational Source is a data source consists of Operational Data and External Data.
Data can come from Relational DBMS like Informix, Oracle.
2. Load Manager –
The Load Manager performs all operations associated with the extraction of loading data in the data warehouse.
These tasks include the simple transformation of data to prepare data for entry into the warehouse.
3. Warehouse Manage –
The warehouse manager is responsible for the warehouse management process.
The operations performed by the warehouse manager are the analysis, aggregation, backup and collection of data,
de-normalization of the data.
4. Query Manager –
Query Manager performs all the tasks associated with the management of user queries.
The complexity of the query manager is determined by the end-user access operations tool and the features
provided by the database.
5. Detailed Data –
It is used to store all the detailed data in the database schema.
Detailed data is loaded into the data warehouse to complement the data collected.
6. Summarized Data –
Summarized Data is a part of the data warehouse that stores predefined aggregations
These aggregations are generated by the warehouse manager.
7. Archive and Backup Data –
The Detailed and Summarized Data are stored for the purpose of archiving and backup.
The data is relocated to storage archives such as magnetic tapes or optical disks.
8. Metadata –
Metadata is basically data stored above data.
It is used for extraction and loading process, warehouse, management process, and query management process.
9. End User Access Tools –
End-User Access Tools consist of Analysis, Reporting, and mining.
By using end-user access tools users can link with the warehouse.
federeation and types of federation with diagram?
Ans-
A federated data warehouse is a practical approach to achieving the “single version of the truth” across the
organization. The federated data warehouse is used to integrate key business measures and dimensions. The
foundations of the federated data warehouse are the common business model and common staging area.
The architecture of federated data warehouse
Regional federation possible in federated data warehouse
The big organization has various regions that provide businesses to customers globally. Different regional data
warehouses were built for each region to meet the specific business needs. A global data warehouse also was built
to provide analytical capabilities to the executive at the global level.
The difference between the regional and global data warehouse systems is the nature of data resided at each system
level. In the regional federated data warehouse architecture picture below, there are two data flows between
regional and global data warehouses:
Upward federation – only fact data are moved from regional data warehouse to global data warehouse. The
aggregation of data can take place at a global data warehouse after data integrated or during data movement.
Downward federation – in the downward federation, the reference flows from the global to the regional level. This
ensures the consistency and integrity of data across the organization. Transactional data from corporate operational
systems such as ERP and CRM are sourced at the global level and then extracted, transformed, and loaded into a
respective regional data warehouse.

Functional federation possible in federated data warehouse


A functional federated data warehouse is used when the organizations have different data warehouses system was
built for specific applications such as ERP, CRM, or subject-specific. The components of functional federated data
warehouse architecture include data marts, custom-built data warehouses, ETL tools, cross-function reporting
systems, real-time data store, and reporting as
Benefits of federated data warehouse
Ease of implementation – Federated data warehouse integrated all legacy data warehouses, business intelligence
systems into a newer system that provides analytical capabilities across the function. Federated data warehouse
data do not try to rebuild a new system which potentially causes the major point of conflict.
Shorter implementation time – By integrating all legacy BI systems, the federated data warehouse approach has a
shorter implementation time in comparison with the lengthy processing of building an enterprise data warehouse.
Cross-functional analytics requirements – Cross function analytics requirements accomplished using common
business modules across different BI systems of each department. A Federated data warehouse is the dynamic
cooperation of various business intelligence systems to make them talk to each other.
detail ER modelling and dimenstion modelling / differnce?
Ans-
ER model is used for logical representation or the conceptual view of data. It is a high level of the conceptual data
model. It forms a virtual representation of data that describes how all the data are related to each other. It is a
complex diagram that is used to represent multiple processes. It helps to describe entities, attributes, and
relationships. It helps to analyze data requirements systematically to produce a well-designed database. At the view
level, the ER model is considered a good option for designing databases.

Data in a warehouse are usually in the multidimensional form. Dimensional modeling prefers keeping the table
denormalized. The primary purpose of dimensional modeling is to optimize the database for faster retrieval of the
data. The concept of Dimensional Modelling was developed by Ralph Kimball and consists of “fact” and “dimension”
tables. The primary purpose of dimensional modeling is to enable business intelligence (BI) reporting, query, and
analysis.

Dimensional modeling is a form of modeling of data that is more flexible from the perspective of the user. These
dimensional and relational models have their unique way of data storage that has specific advantages. Dimensional
models are built around business processes. They need to ensure that dimension tables use a surrogate key.
Dimension tables store the history of the dimensional information.

Dimensional Modeling ER Modeling

It is transaction-oriented. It is subject-oriented.

Entities and Relationships. Fact Tables and Dimension Tables.

Few levels of granularity. Multiple levels of granularity.

Real-time information. Historical information.

It eliminates redundancy. It plans for redundancy.

High transaction volumes using few Low transaction volumes using many
records at a time. records at a time.

Highly Volatile data. Non-volatile data.

Physical and Logical Model. Physical Model.

Normalization is suggested. De-Normalization is suggested.

OLTP Application. OLAP Application.


inside dimenstion table and fact table/ attribute hirrachy?
Ans-
A Dimension Table is a crucial component of a data warehouse or star schema used in Data Warehousing and
Business Intelligence (BI) systems. It stores dimensions or descriptive attributes related to a specific business
domain, such as time, geography, or products. These attributes help data scientists and analysts interpret facts in
Fact Tables, and facilitate the organization of large amounts of data for efficient querying, reporting, and analytics.
Dimension Tables provide unique features that improve data processing and analytics, such as:
 Descriptive attributes: Attributes in the Dimension Table offer context to numerical data in Fact Tables.
 Primary key: Each Dimension Table has a primary key to uniquely identify records and establish relationships
with Fact Tables.
 Denormalization: Dimension Tables are often denormalized, meaning attributes are grouped together,
simplifying querying, and reducing the need for join operations.
 Hierarchy and aggregation: Attributes in Dimension Tables can be organized into hierarchies for better
aggregation, facilitating roll-up and drill-down operations on data.
A Fact Table is a central table in a star schema or snowflake schema of a dimensional data model used in data
warehousing and Business Intelligence (BI) applications. It contains quantitative data (known as facts) and foreign
keys that are connected to related dimension tables. Fact tables are essential for data processing and analytics,
enabling users to analyze business measures across various dimensions, such as time, geography, and products.
Fact Table serves as the foundation for analytical queries and reporting. Key features include:
 Facts: Quantifiable data related to business processes, such as sales figures, revenue, or product costs.
 Foreign Keys: Columns that refer to primary keys in dimension tables, establishing relationships between
tables.
 Granularity: The level of detail represented by the facts in the table.
 Aggregation: Fact tables store aggregated data, which provides a summarized view of the data for analysis.

Attribute Hierarchies
By default, attribute members are organized into two level hierarchies, consisting of a leaf level and an All level. The
All level contains the aggregated value of the attribute's members across the measures in each measure group to
which the dimension of which the attribute is related is a member. However, if the IsAggregatable property is set to
False, the All level is not created. For more information, see Dimension Attribute Properties Reference.

Attributes can be, and typically are, arranged into user-defined hierarchies that provide the drill-down paths by
which users can browse the data in the measure groups to which the attribute is related. In client applications,
attributes can be used to provide grouping and constraint information. When attributes are arranged into user-
defined hierarchies, you define relationships between hierarchy levels when levels are related in a many-to-one or a
one-to-one relationship (called a natural relationship). For example, in a Calendar Time hierarchy, a Day level should
be related to the Month level, the Month level related to the Quarter level, and so on. Defining relationships
between levels in a user-defined hierarchy enables Analysis Services to define more useful aggregations to increase
query performance and can also save memory during processing performance
diff star schema vs snowflake?
Ans-
Star Schema Snowflake Schema

In star schema, The fact tables and While in snowflake schema, The fact tables,
the dimension tables are contained. dimension tables as well as sub dimension tables
are contained.

Star schema is a top-down model. While it is a bottom-up model.

Star schema uses more space. While it uses less space.

It takes less time for the execution While it takes more time than star schema for the
of queries. execution of queries.

In star schema, Normalization is not While in this, Both normalization and


used. denormalization are used.

It’s design is very simple. While it’s design is complex.

The query complexity of star While the query complexity of snowflake schema is
schema is low. higher than star schema.

It’s understanding is very simple. While it’s understanding is difficult.

It has less number of foreign keys. While it has more number of foreign keys.

It has high data redundancy. While it has low data redundancy.


metadata/categories/requirment gathering(managment)?
Ans-
Metadata is data that describes and contextualizes other data. It provides information about the content, format,
structure, and other characteristics of data, and can be used to improve the organization, discoverability, and
accessibility of data.
Metadata can be stored in various forms, such as text, XML, or RDF, and can be organized using metadata standards
and schemas. There are many metadata standards that have been developed to facilitate the creation and
management of metadata, such as Dublin Core, schema.org, and the Metadata Encoding and Transmission Standard
(METS). Metadata schemas define the structure and format of metadata and provide a consistent framework for
organizing and describing data.
Metadata can be used in a variety of contexts, such as libraries, museums, archives, and online platforms. It can be
used to improve the discoverability and ranking of content in search engines and to provide context and additional
information about search results. Metadata can also support data governance by providing information about the
ownership, use, and access controls of data, and can facilitate interoperability by providing information about the
content, format, and structure of data, and by enabling the exchange of data between different systems and
applications. Metadata can also support data preservation by providing information about the context, provenance,
and preservation needs of data, and can support data visualization by providing information about the data’s
structure and content, and by enabling the creation of interactive and customizable visualizations.
Types/categories/ of Metadata
Metadata in a data warehouse fall into three major parts:
 Operational Metadata
 Extraction and Transformation Metadata
 End-User Metadata
Operational Metadata
As we know, data for the data warehouse comes from various operational systems of the enterprise. These source
systems include different data structures. The data elements selected for the data warehouse have various fields
lengths and data types.
In selecting information from the source systems for the data warehouses, we divide records, combine factor of
documents from different source files, and deal with multiple coding schemes and field lengths. When we deliver
information to the end-users, we must be able to tie that back to the source data sets. Operational metadata
contains all of this information about the operational data sources.
Extraction and Transformation Metadata
Extraction and transformation metadata include data about the removal of data from the source systems, namely,
the extraction frequencies, extraction methods, and business rules for the data extraction. Also, this category of
metadata contains information about all the data transformation that takes place in the data staging area.
End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It enables the end-users to find data from
the data warehouses. The end-user metadata allows the end-users to use their business terminology and look for
the information in those ways in which they usually think of the business.

Effective metadata management/ Requirment Gathering


provides rich context to enterprise data assets, powering their efficient use and reuse. It is also a requirement for
government agencies across the world to ensure data governance. According to the 2020 Dataversity Report on
Trends in Data Management, many organizations place increasing value on data governance and metadata
management to build a data landscape that informs about all the data assets in the organization. Early adoption of
best practices can go a long way in ensuring complete success.
1. Define a metadata strategy
Metadata management begins with defining a metadata strategy to provide a foundation for assessing the long-term
value. Gartner Research recommends that metadata management initiatives must clearly support the organization’s
business vision and resulting business objectives.
Some critical questions to ask to arrive at a metadata strategy are:
 For what type of information do you want to manage metadata, now and in the future?
 For what type of information are you not able to manage metadata?
 What are your current and prospective use cases for metadata?
 What are your use cases for showing relationships between metadata objects?
A robust metadata strategy can offer alignment with business objectives, identify high priority activities and help
evaluate an implementation methodology. It is also important to link metadata management efforts to digital
transformation efforts, such as digitization, omnichannel enablement, or enterprise-resource-planning
modernization as these efforts typically depend on data availability and quality.
2. Establish scope and ownership
Identifying the scope is always a best practice to channel the efforts in the right direction and focus on high-priority
activities.
 Identify your potential use cases for metadata – data governance, data quality, data analysis, risk and
compliance are some of the top use cases.
 Capture functional metadata requirements – examples include resource discovery, digital identification or
archiving.
 Specify all possible ways you will use the proposed solution – metadata capture and storage, integration,
publication and more.
 Define the scope considering essential requirements and critical use cases.
 Your organization will have various roles of metadata creators, consumers, and managers. Defining clear
ownership and responsibilities ensures accountability for metadata quality. Well-articulated roles will help
you optimize resource utilization. Critical data is no more than 10 to 20 percent of total data in most
organizations. Prioritize data assets and focus metadata leadership accordingly.

3. Add value with the right metadata management tool


Dataversity report on Emerging Trends in Metadata Management states that more than 69% of organizations look
for search and storage as the top functionalities for metadata management tools. Based on the metadata strategy
and scope that you have defined, you can identify the key functionalities you need. The first blog in this series
discusses the critical capabilities of fully automated metadata management tools.

4. Adopt the metadata standards


The recently published DoD Data Strategy emphasizes metadata tagging and common metadata standards for data-
centric organizations. Common metadata standards assure uniform usage or interpretation with your vendor and
customer communities.

Metadata standards have been evolving over the years and they vary in levels of details and complexity. The general
metadata standards like the Dublin Core Metadata Element Set apply to broader communities and make your data
more interoperable with other standards. The subject-specific metadata standards, on the other hand, help search
data more easily. For example, the ISO 19115 standard works well for the geospatial community. You can evaluate
which standards align the best with your use cases and your communities.

5. Make it an enterprise-wide ongoing process


Any best practices list will not be complete if you do not engage the people. Start from the top and onboard all to
get total support across the organization. When people are excited and committed to the vision of data enablement,
they’re more likely to help ensure that data is high quality and safe. Communicate your objectives, progress.
Metadata tools/classifcation?
ANS-
TOOLS
Atlan: active metadata management
Atlan, "the company reinventing data management for the cloud era,” is a leading active metadata management
platform. They offer a personalized metadata experience, powerful collaboration capabilities, and an open API
architecture to support greater connectivity.
Main Atlan active metadata management offerings are
 Data Discovery,
 Column-level Lineage,
 Data Governance,
 Data Glossary, and others.
Collibra: complex data governance for various workflows
Collibra offers the Data Intelligence Cloud Platform that simplifies and automates key data management aspects. It
was positioned as a leader in the IDC MarketScape: Worldwide Data Catalog Software 2022 Vendor Assessment. Its
suite of products includes
 Data Catalog,
 Data Governance,
 Data Privacy,
 Data Lineage, and
 Data Quality & Observability.
Alation: support for self-service analytics and BI
Alation is an industry-recognized provider whose data management solutions focus primarily on fueling self-service
analytics, data governance, and cloud data migration. They cater to data specialists (data engineers, data scientists,
and data analysts) and business leaders alike, driving the transition “from data-rich to data-driven.”
Alation supports active metadata management with its Data Governance App and Data Catalog tools. The platform
helps capture, organize, understand, retrieve, and exchange metadata. It serves as a database for all the company
data, allowing users to run queries that are then used in analytics and BI tools. You can also leverage data lineage,
impact analysis, and other handy functionality.
Informatica: data management software with ML-based data cataloging
Informatica is another provider of a full-fledged data management system – Intelligent Data Management Cloud
(IDMC). It supports data integration, data quality, master data management, and other aspects including metadata
management.Its award-winning Enterprise Data Catalog tool is built on the ML-based discovery engine to scan and
catalog digital assets across multiple sources. It provides data consumers with
 robust search functionality,
 automated relationship discovery,
 detailed data lineage,
 profiling statistics,
 data quality scorecards,
 data similarity recommendations,
 impact analysis feature, and
 an integrated business glossary.
 Active metadata serves as the unifying foundation for IDMC, fueling further analytics and other data
management processes.

Users especially highlight the data curation and autocorrection features as well as general ease of use, though some
pointed out the insufficient feature list and poor platform performance.
Classification Of metadata?
Ans-Metadata comes in many shapes and flavors, carrying additional information about where a resource was
produced, by whom, when was the last time it was accessed, what is it about and many more details around it.
Similar to the library cards, describing a book, metadata describes objects and adds more granularity to the way they
are represented. Three main types of metadata exist: descriptive, structural and administrative.
 Descriptive metadata adds information about who created a resource, and most importantly – what the
resource is about, what it includes. This is best applied using semantic annotation.
 Structural metadata includes additional data about the way data elements are organized – their
relationships and the structure they exist in.
 Administrative metadata provides information about the origin of resources, their type and access rights.
tool selection for data warehouse?
Ans-
While the specifics are, well, specific to every company, there are six key criteria to keep in mind when choosing a
data warehouse:
 Cloud vs. on-prem
 Tool ecosystem
 Implementation cost and time
 Ongoing costs and maintenance
 Ease of scalability
 Support and community
In many cases, these criteria are really trade-offs— for example, a data warehouse that's quick to implement may be
a pain to scale. But you'll be better prepared to make the right decision if you understand what you're getting into
before you buy.
Cloud vs. on-premise storage
Even as recently as a few years ago, you might have struggled with whether to go with a cloud-based or on premises
approach. Today, the battle is over.
There are a few circumstances where it still makes sense to consider an on-prem approach. For example, if most of
your critical databases are on-premises and they are old enough that they don't work well with cloud-based data
warehouses, an on-premises approach might be the way to go. Or if your company is subject to byzantine regulatory
requirements that make on-prem your only choice.
Data tool ecosystem
If you work at a company that’s already heavily invested in a data tool ecosystem and doesn't have a lot of data
sources residing outside of it, you're probably going to pick that ecosystem's tool.
Data warehouse implementation
They say the devil is in the details, and that’s doubly true when it comes to data warehouse implementation. Here
are some of the finer points you should consider:
Cost-When deciding between data warehouse tools, money is often a major driver.
Time- Cost matters, but time often matters more—especially for startups that are trying to move as quickly as they
can.
Ongoing storage costs and maintenance
In addition to the costs of getting started, you'll also need to take into account ongoing costs—which sometimes can
be substantially higher than the resources you allocate at the beginning.
There are several ongoing costs you'll need to consider:
Storage and compute: As your data and usage grows, so will your monthly storage bill. Having a good sense of how
Ease of scalability
If you’re part of a fast-growing business, one of the things you want to find out is, what's involved in scaling up your
data warehouse?To figure that out, first you need to get a rough sense of what your current business needs are,
including how much data you currently have, how quickly your needs are likely to grow, and how much confidence
you have in your assessment of your needs for scale.
Support and community
When you run into trouble with your data warehouse, how likely are you to get the help you need when you need it?
While no one chooses a data warehouse tool based solely on the support they can get, if two data warehouse
systems are pretty equal, it could be the deciding factor.
Granularity?
Ans-
Granularity refers to the level of detail or depth present in a dataset or database. In data warehousing, granularity
determines the extent to which data is broken down. Higher granularity means data is dissected into finer, more
detailed parts, while lower granularity results in broader, less detailed data aggregation. The choice of granularity
level depends on the business's particular needs and data analysis goals.

Functionality and Features


Granularity in data warehousing is not a tool or software but a concept that guides data storage strategy. It's a
crucial factor when designing a data warehouse because it affects storage space, data processing speed, and the
type of queries that can be answered. High granularity allows for detailed, complex analysis, while low granularity is
useful for overview-type queries and saves storage space.

Benefits and Use Cases


Granularity comes with several advantages, primarily being able to tailor the level of detail in the data to specific use
cases. For instance, high granularity is ideal for fine-tuned analysis, anomalies detection, and precise forecasting. On
the other hand, low granularity is beneficial for aggregative reporting and trend analysis across larger datasets.

Challenges and Limitations


Choosing the right granularity level can be challenging as it involves a trade-off between storage space and analytical
capabilities. High granularity requires larger storage and slower processing times due to the high data volume.
Additionally, it might lead to sparsity, where there's a large number of records, but only a tiny portion is useful or
relevant.

Integration with Data Lakehouse


Granularity is a valuable concept within a data lakehouse environment. While traditional data warehouses have fixed
granularity, data lakehouses provide the flexibility to adjust granularity levels according to specific analytic needs.
This feature enhances scalability and performance, allowing for more intricate analysis without overburdening
storage or compute resources.

Security Aspects
While granularity itself doesn't offer any inherent security measures, granularity decisions can impact data
management and security. For example, high granularity may necessitate stricter access controls to safeguard
detailed, potentially sensitive data.

Performance
The performance of a data warehousing system is directly impacted by its granularity. Higher granularity data might
require more computational resources for processing, potentially slowing down query response times. Conversely,
less detailed, lower granularity data can usually be processed faster, resulting in quicker response times.
Consequently, striking a balance between detail level and system performance is a key consideration when
determining granularity.
Olap Fasmi/characteristic/olap vs datawarehouse?
Ans-The FASMI Test
It can represent the characteristics of an OLAP application in a specific method, without dictating how it should be
performed.
Fast − It defines that the system is targeted to produce most responses to users within about five seconds, with the
understandable analysis taking no more than one second and very few taking more than 20 seconds.
Independent research in the Netherlands has shown that end-users consider that a process has declined if results
are not received with 30 seconds, and they are suitable to hit ‘ALT+Ctrl+Delete’ unless the system needs them that
the report will take longer.
Analysis − It defines that the system can manage with any business logic and sta s cal analysis that is appropriate
for the application and the user, the keep it easy enough for the target user. Although some pre-programming can
be required, it does not think it acceptable if all application definitions have to be completed using a professional
4GL. It is necessary to enable the user to represent new ad hoc calculations as part of the analysis and to report on
the data in any desired method, without having to program, so it can exclude products (like Oracle Discoverer) that
do not enable the user to represent new ad hoc calculations as an element of the analysis and to report on the data
in any desired method, without having to program, so it can exclude products (like Oracle Discoverer) that do not
enable adequate end-user oriented calculation flexibility.
Shared − It defines that the system implements all the security requirements for confiden ality (probably down to
cell level) and, multiple write access is required, concurrent update areas at a suitable level. It is not all applications
required users to write data back, but for the increasing number that does, the system must be able to handle
several updates in an appropriate, secure manner. This is a major field of weakness in some OLAP products, which
tend to consider that all OLAP applications will be read-only, with simple security controls.
Multidimensional − The system should support a mul dimensional conceptual view of the data, including complete
support for hierarchies and multiple hierarchies. It is not setting up a specific minimum number of dimensions that
should be managed as it is too software dependent and most products seem to have enough for their target
industry.
Information − Informa on is all of the data and derived data required, whether it is and however much is relevant
for the software. We are measuring the capacity of several products in terms of how much input data can manage,
not how many Gigabytes they take to save it.

The main characteristics of OLAP are as follows:


 Multidimensional conceptual view: OLAP systems let business users have a dimensional and logical view of
the data in the data warehouse. It helps in carrying slice and dice operations.
 Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should provide normal
database operations, containing retrieval, update, adequacy control, integrity, and security.
 Accessibility: OLAP acts as a mediator between data warehouses and front-end. The OLAP operations should
be sitting between data sources (e.g., data warehouses) and an OLAP front-end.
 Storing OLAP results: OLAP results are kept separate from data sources.
 Uniform documenting performance: Increasing the number of dimensions or database size should not
significantly degrade the reporting performance of the OLAP system.
 OLAP provides for distinguishing between zero values and missing values so that aggregates are computed
correctly.
 OLAP system should ignore all missing values and compute correct aggregate values.
 OLAP facilitate interactive query and complex analysis for the users.
 OLAP allows users to drill down for greater details or roll up for aggregations of metrics along a single
business dimension or across multiple dimension.
 OLAP provides the ability to perform intricate calculations and comparisons.
 OLAP presents results in a number of meaningful ways, including charts and graphs.
Feature Data Warehousing OLAP
Definition A process of collecting, storing, A technology that allows users to analyze
and managing data from various information from multiple database
sources to provide meaningful systems at the same time based on the
business insights multi-dimensional data model
Purpose To make data accessible and To provide quick and interactive analysis
understandable for business users of data from multiple sources
Data Relational database Multidimensional data model
structure
Data source Multiple data sources Multiple data sources
Data type Historical data Current and historical data
Data Batch processing Real-time processing
processing
Operations Data cleaning, consolidation, Drill-down, roll-up, slice, dice, and pivot
integration, transformation,
analysis, and reporting
Cube creation Not applicable Cubes are created to support fast and
efficient analysis
Query Slower query performance due to Faster query performance due to pre-
performance complex querying and data aggregation and indexing
processing
User type Business users and data analysts Business users and data analysts
Use case Decision-making and strategic Real-time analysis and interactive
planning reporting
hypercube and multicube/ differnce?
Ans-
Hypercube
Hypercube model designing is basically a top-down approach/process. So multidimensional databases data can be
represented for an application using two types of cubes i.e. hypercube and multi-cube. As shown in the figure
(HYPERCUBE) below all the data appears logically as a single cube in a hypercube. All the parts of the manifold have
identical dimensionality which are been represented by this hypercube.
In hypercube each dimension belongs to only one cube. A dimension is normally owned by the hypercube. Hence,
this simplicity makes it easier for the users to use and understand it. As a hypercube is an top-down process, the
designing of hypercube includes three major steps. The three major steps included are as follows:
1.You need to first decide which process of the business you want to capture in these model, such as sales activity.
2.Now, you identify the values which you want to be captured, such as sales amount. This information is always
numeric in nature.
3.Now, identify the granularity of the data i.e. the lowest level of the data which you want to capture, this elements
are dimensions. Some of the common dimensions are geography, time, customer and product.
Multicube
In the multicube model, data is segmented into a set of smaller cubes, each of which is composed of a subset of the
available dimensions, as shown in Figure 2. They are used to handle multiple fact tables, each with different
dimensionality. A dimension can be part of multiple cubes. Dimensions are not owned by any one cube, like under
the hypercube model. Rather, they are available to all cubes, or there can be some dimensions that do not belong to
any cube. This makes it much more efficient and versatile. It is also a more efficient way of storing very sparse data,
and it can reduce the pre-calculation database explosion effect, which will be covered in a later section. The
drawback is that this is less straightforward than hypercube and can carry steeper learning curves. Some systems use
the combined approach of hypercube and multicubes, by separating the storage, processing, and presentation
layers. It stores data as multicubes but presents as a hypercube.
Multiple Hypercubes vs. True Multicubes
In a true multicube system, dimensions "exist" in the catalog or the schema. Different subcubes use a subset of these
dimensions. By contrast, in a multiple-hypercube system, the catalog or the schema has multiple cubes, each with its
own set of dimensions.
There is an important distinction between these two systems for OLE DB for OLAP schema rowsets. In the true
multicube system, if there are two rows in the DIMENSIONS rowset for which the DIMENSION_NAME value is the
same (and the CUBE_NAME value is different), these two rows represent the same dimension. This is because the
subcubes are built from the same pool of available dimensions. However, in a multiple hypercube scenario, it is
possible for two hypercubes to have a dimension of the same name, each of which has different characteristics. In
this case, the DIMENSION_UNIQUE_NAME value is guaranteed to be different. Consumers checking dimension
names between cubes for doing natural joins must make sure to check DIMENSION_UNIQUE_NAME values, not
DIMENSION_NAME values.
multi dimenstional data/ advantage disadvantage/ example?
Ans-
The multi-Dimensional Data Model is a method which is used for ordering data in the database along with good
arrangement and assembling of the contents in the database. The Multi Dimensional Data Model allows customers
to interrogate analytical questions associated with market or business trends, unlike relational databases which
allow customers to access data in the form of queries. They allow users to rapidly receive answers to the requests
which they made by creating and examining the data comparatively fast.OLAP (online analytical processing) and data
warehousing uses multi dimensional databases. It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from many dimensions
and perspectives. It is defined by dimensions and facts and is represented by a fact table. Facts are numerical
measures and fact tables contain measures of the related dimensional tables or names of the facts.
Advantages of Multi Dimensional Data Model
The following are the advantages of a multi-dimensional data model :
 A multi-dimensional data model is easy to handle.
 It is easy to maintain.
 Its performance is better than that of normal databases (e.g. relational databases).
 The representation of data is better than traditional databases. That is because the multi-dimensional
databases are multi-viewed and carry different types of factors.
 It is workable on complex systems and applications, contrary to the simple one-dimensional database
systems.
 The compatibility in this type of database is an upliftment for projects having lower bandwidth for
maintenance staff.
Disadvantages of Multi Dimensional Data Model
 The multi-dimensional Data Model is slightly complicated in nature and it requires professionals to recognize
and examine the data in the database.
 During the work of a Multi-Dimensional Data Model, when the system caches, there is a great effect on the
working of the system.
 It is complicated in nature due to which the databases are generally dynamic in design.
 The path to achieving the end product is complicated most of the time.
 As the Multi Dimensional Data Model has complicated systems, databases have a large number of databases
due to which the system is very insecure when there is a security break.
For Example :
1. Let us take the example of a firm. The revenue cost of a firm can be recognized on the basis of different factors
such as geographical location of firm’s workplace, products of the firm, advertisements done, time utilized to
flourish a product, etc. 2. Let us take the example of the data of a factory which sells products per quarter in
Bangalore. The data is represented in the table given below :

In the above given presentation, the factory’s sales for Bangalore are, for the time dimension, which is organized
into quarters and the dimension of items, which is sorted according to the kind of item which is sold. The facts here
are represented in rupees (in thousands).Now, if we desire to view the data of the sales in a three-dimensional table,
then it is represented in the diagram given below. Here the data of the sales is represented as a two dimensional
table. Let us consider the data according to item, time and location (like Kolkata, Delhi, Mumbai)
data preproccesing/ technique in details with daigarm?
Ans-
Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming, and
integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve the quality of
the data and to make it more suitable for the specific data mining task.
Some common steps in data preprocessing include:
Data preprocessing is an important step in the data mining process that involves cleaning and transforming raw data
to make it suitable for analysis. Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as missing values,
outliers, and duplicates. Various techniques can be used for data cleaning, such as imputation, removal, and
transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset. Data integration
can be challenging as it requires handling data with different formats, structures, and semantics. Techniques such as
record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format for analysis. Common techniques used
in data transformation include normalization, standardization, and discretization. Normalization is used to scale the
data to a common range, while standardization is used to transform the data to have zero mean and unit variance.
Discretization is used to convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important information. Data
reduction can be achieved through techniques such as feature selection and feature extraction. Feature selection
involves selecting a subset of relevant features from the dataset, while feature extraction involves transforming the
data into a lower-dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals. Discretization is
often used in data mining and machine learning algorithms that require categorical data. Discretization can be
achieved through techniques such as equal width binning, equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1 and 1.
Normalization is often used to handle data with different units and scales. Common normalization techniques
include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis results. The
specific steps involved in data preprocessing may vary depending on the nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become more accurate.
Preprocessing in Data Mining: Data preprocessing is a data mining technique which is used to transform the raw
data in a useful and efficient format.

Techniques-
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling
of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the
most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data
collection, data entry errors etc. It can be handled in following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and
then various methods are performed to complete the task. Each segmented is handled separately. One can replace
all data in a segment by its mean or boundary values can be used to complete the task.
Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one
independent variable) or multiple (having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves
following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city” can be
converted to “country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the dataset while
preserving the important information. This is done to improve the efficiency of data analysis and to avoid overfitting
of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature selection is often
performed to remove irrelevant or redundant features from the dataset. It can be done using various techniques
such as correlation analysis, mutual information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while preserving the
important information. Feature extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant analysis (LDA), and non-negative matrix
factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to reduce the size
of the dataset while preserving the important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often used to reduce the
size of the dataset by replacing similar data points with a representative centroid. It can be done using techniques
such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important information. Compression is
often used to reduce the size of the dataset for storage and transmission purposes. It can be done using techniques
such as wavelet compression, JPEG compression, and gzip compression.
data cleaning/ tasks/ methods?
Ans-
Data cleaning is an essential step in the data mining process. It is crucial to the construction of a model. The step that
is required, but frequently overlooked by everyone, is data cleaning. The major problem with quality information
management is data quality. Problems with data quality can happen at any place in an information system. Data
cleansing offers a solution to these issues.
Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly formatted, duplicated, or
insufficient data from a dataset. Even if results and algorithms appear to be correct, they are unreliable if the data is
inaccurate. There are numerous ways for data to be duplicated or incorrectly labeled when merging multiple data
sources.
In general, data cleaning lowers errors and raises the caliber of the data. Although it might be a time-consuming and
laborious operation, fixing data mistakes and removing incorrect information must be done. A crucial method for
cleaning up data is data mining. A method for finding useful information in data is data mining. Data quality mining is
a novel methodology that uses data mining methods to find and fix data quality issues in sizable databases. Data
mining mechanically pulls intrinsic and hidden information from large data sets. Data cleansing can be accomplished
using a variety of data mining approaches.
To arrive at a precise final analysis, it is crucial to comprehend and improve the quality of your data. To identify key
patterns, the data must be prepared. Exploratory data mining is understood. Before doing business analysis and
gaining insights, data cleaning in data mining enables the user to identify erroneous or missing data.
Data cleaning before data mining is often a time-consuming procedure that necessitates IT personnel to assist in the
initial step of reviewing your data due to how time-consuming data cleaning is. But if your final analysis is inaccurate
or you get an erroneous result, it's possible due to poor data quality.
Tasks-
Steps for Cleaning Data
You can follow these fundamental stages to clean your data even if the techniques employed may vary depending on
the sorts of data your firm stores:
1. Remove duplicate or irrelevant observations
Remove duplicate or pointless observations as well as undesirable observations from your dataset. The majority of
duplicate observations will occur during data gathering. Duplicate data can be produced when you merge data sets
from several sources, scrape data, or get data from clients or other departments. One of the most important factors
to take into account in this procedure is de-duplication. Those observations are deemed irrelevant when you
observe observations that do not pertain to the particular issue you are attempting to analyze.
You might eliminate those useless observations, for instance, if you wish to analyze data on millennial clients but
your dataset also includes observations from earlier generations. This can improve the analysis's efficiency, reduce
deviance from your main objective, and produce a dataset that is easier to maintain and use.
2. Fix structural errors
When you measure or transfer data and find odd naming practices, typos, or wrong capitalization, such are
structural faults. Mislabelled categories or classes may result from these inconsistencies. For instance, "N/A" and
"Not Applicable" might be present on any given sheet, but they ought to be analyzed under the same heading.
3. Filter unwanted outliers
There will frequently be isolated findings that, at first glance, do not seem to fit the data you are analyzing.
Removing an outlier if you have a good reason to, such as incorrect data entry, will improve the performance of the
data you are working with.
However, occasionally the emergence of an outlier will support a theory you are investigating. And just because
there is an outlier, that doesn't necessarily indicate it is inaccurate. To determine the reliability of the number, this
step is necessary. If an outlier turns out to be incorrect or unimportant for the analysis, you might want to remove it.
4. Handle missing data
Because many algorithms won't tolerate missing values, you can't overlook missing data. There are a few options for
handling missing data. While neither is ideal, both can be taken into account, for example:
Although you can remove observations with missing values, doing so will result in the loss of information, so proceed
with caution.
Again, there is a chance to undermine the integrity of the data since you can be working from assumptions rather
than actual observations when you input missing numbers based on other observations.
To browse null values efficiently, you may need to change the way the data is used.
5. Validate and QA
As part of fundamental validation, you ought to be able to respond to the following queries once the data cleansing
procedure is complete:
 Are the data coherent?
 Does the data abide by the regulations that apply to its particular field?
 Does it support or refute your working theory? Does it offer any new information?
 To support your next theory, can you identify any trends in the data?
 If not, is there a problem with the data's quality?
False conclusions can be used to inform poor company strategy and decision-making as a result of inaccurate or
noisy data. False conclusions can result in a humiliating situation in a reporting meeting when you find out your data
couldn't withstand further investigation. Establishing a culture of quality data in your organization is crucial before
you arrive. The tools you might employ to develop this plan should be documented to achieve this.

Techniques/Methods for Cleaning Data


The data should be passed through one of the various data-cleaning procedures available. The procedures are
explained below:Data Cleaning in Data Mining
Ignore the tuples: This approach is not very practical because it is only useful when a tuple has multiple
characteristics and missing values.
Fill in the missing value: This strategy is also not very practical or effective. Additionally, it could be a time-
consuming technique. One must add the missing value to the approach. The most common method for doing this is
manually, but other options include using attribute means or the most likely value.
Binning method: This strategy is fairly easy to comprehend. The values nearby are used to smooth the sorted data.
The information is subsequently split into several equal-sized parts. The various techniques are then used to finish
the assignment.
Regression: With the use of the regression function, the data is smoothed out. Regression may be multivariate or
linear. Multiple regressions have more independent variables than linear regressions, which only have one.
Clustering: This technique focuses mostly on the group. Data are grouped using clustering. After that, clustering is
used to find the outliers. After that, the comparable values are grouped into a "group" or "cluster".
data transformation/stratiges/task/types/need?
Ans-
Data transformation in data mining refers to the process of converting raw data into a format that is suitable for
analysis and modeling. The goal of data transformation is to prepare the data for data mining so that it can be used
to extract useful insights and knowledge. Data transformation typically involves several steps, including:
 Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the data.
 Data integration: Combining data from multiple sources, such as databases and spreadsheets, into a single
format.
 Data normalization: Scaling the data to a common range of values, such as between 0 and 1, to facilitate
comparison and analysis.
 Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant features or
attributes.
 Data discretization: Converting continuous data into discrete categories or bins.
 Data aggregation: Combining data at different levels of granularity, such as by summing or averaging, to
create new features or attributes.
 Data transformation is an important step in the data mining process as it helps to ensure that the data is in a
format that is suitable for analysis and modeling, and that it is free of errors and inconsistencies. Data
transformation can also help to improve the performance of data mining algorithms, by reducing the
dimensionality of the data, and by scaling the data to a common range of values.

Tasks-
1. Discovery
The first step is to identify and understand data in its original source format with the help of data profiling tools.
Finding all the sources and data types that need to be transformed. This step helps in understanding how the data
needs to be transformed to fit into the desired format.
2. Mapping
The transformation is planned during the data mapping phase. This includes determining the current structure, and
the consequent transformation that is required, then mapping the data to understand at a basic level, the way
individual fields would be modified, joined or aggregated.
3. Code generation
The code, which is required to run the transformation process, is created in this step using a data transformation
platform or tool.
4. Execution
The data is finally converted into the selected format with the help of the code. The data is extracted from the
source(s), which can vary from structured to streaming, telemetry to log files. Next, transformations are carried out
on data, such as aggregation, format conversion or merging, as planned in the mapping stage. The transformed data
is then sent to the destination system which could be a dataset or a data warehouse.
5. Review
The transformed data is evaluated to ensure the conversion has had the desired results in terms of the format of the
data.
It must also be noted that not all data will need transformation, at times it can be used as is.
Types-
here are several different ways to transform data, such as:
Scripting: Data transformation through scripting involves Python or SQL to write the code to extract and transform
data. Python and SQL are scripting languages that allow you to automate certain tasks in a program. They also allow
you to extract information from data sets. Scripting languages require less code than traditional programming
languages. Therefore, it is less intensive.
On-Premises ETL Tools: ETL tools take the required work to script the data transformation by automating the
process. On-premises ETL tools are hosted on company servers. While these tools can help save you time, using
them often requires extensive expertise and significant infrastructure costs.
Cloud-Based ETL Tools: As the name suggests, cloud-based ETL tools are hosted in the cloud. These tools are often
the easiest for non-technical users to utilize. They allow you to collect data from any cloud source and load it into
your data warehouse. With cloud-based ETL tools, you can decide how often you want to pull data from your source,
and you can monitor your usage.

Strategies-
1. Smoothing: It is a process that is used to remove noise from the dataset using some algorithms It allows for
highlighting important features present in the dataset. It helps in predicting the patterns. When collecting data, it
can be manipulated to eliminate or reduce any variance or any other noise form. The concept behind data
smoothing is that it will be able to identify simple changes to help predict different trends and patterns. This serves
as a help to analysts or traders who need to look at a lot of data which can often be difficult to digest for finding
patterns that they wouldn’t see otherwise.

2. Aggregation: Data collection or aggregation is the method of storing and presenting data in a summary format.
The data may be obtained from multiple data sources to integrate these data sources into a data analysis
description. This is a crucial step since the accuracy of data analysis insights is highly dependent on the quantity and
quality of the data used. Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning financing or business
strategy of the product, pricing, operations, and marketing strategies. For example, Sales, data may be aggregated to
compute monthly& annual total amounts.

3. Discretization: It is a process of transforming continuous data into set of small intervals. Most Data Mining
activities in the real world require continuous attributes. Yet many of the existing data mining frameworks are
unable to handle these attributes. Also, even if a data mining task can manage a continuous attribute, it can
significantly improve its efficiency by replacing a constant quality attribute with its discrete values. For example, (1-
10, 11-20) (age:- young, middle age, senior).

4. Attribute Construction: Where new attributes are created & applied to assist the mining process from the given
set of attributes. This simplifies the original data & makes the mining more efficient.

5. Generalization: It converts low-level data attributes to high-level data attributes using concept hierarchy. For
Example Age initially in Numerical form (22, 25) is converted into categorical value (young, old). For example,
Categorical attributes, such as house addresses, may be generalized to higher-level definitions, such as town or
country.

6. Normalization: Data normalization involves converting all data variables into a given range.
data reduction and integartion/startegies?
Ans-
Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the most
important information. This can be beneficial in situations where the dataset is too large to be processed efficiently,
or where the dataset contains a large amount of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data mining, including:
Data Sampling: This technique involves selecting a subset of the data to work with, rather than using the entire
dataset. This can be useful for reducing the size of a dataset while still preserving the overall trends and patterns in
the data.
Dimensionality Reduction: This technique involves reducing the number of features in the dataset, either by
removing features that are not relevant or by combining multiple features into a single feature.
Data Compression: This technique involves using techniques such as lossy or lossless compression to reduce the size
of a dataset.
Data Discretization: This technique involves converting continuous data into discrete data by partitioning the range
of possible values into intervals or bins.
Feature Selection: This technique involves selecting a subset of features from the dataset that are most relevant to
the task at hand.
It’s important to note that data reduction can have a trade-off between the accuracy and the size of the data. The
more data is reduced, the less accurate the model will be and the less generalizable it will be.

Methods strategis of data reduction:


1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine the information you gathered for
your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They
involve you in the annual sales, rather than the quarterly average, So we can summarize the data in such a way that
the resulting data summarizes the total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our analysis.
It reduces data size as it eliminates outdated or redundant features.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms (Huffman
Encoding & run-length Encoding). We can divide it into two types based on their compression techniques.
Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data size reduction. Lossless data
compression uses algorithms to restore the precise original data from the compressed data.
Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this
compression.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller representations of the
data instead of actual data, it is important to only store the model parameter. Or non-parametric methods such as
clustering, histogram, and sampling.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals.
We replace many constant values of the attributes by labels of small intervals. This means that mining results are
shown in a concise, and easily understandable way.
Data integration
Data integration in data mining refers to the process of combining data from multiple sources into a single, unified
view. This can involve cleaning and transforming the data, as well as resolving any inconsistencies or conflicts that
may exist between the different sources. The goal of data integration is to make the data more useful and
meaningful for the purposes of analysis and decision making. Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and data federation.
Data Integration is a data preprocessing technique that combines data from multiple heterogeneous data sources
into a coherent data store and provides a unified view of the data. These sources may include multiple data cubes,
databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
What is data integration :
Data integration is the process of combining data from multiple sources into a cohesive and consistent view. This
process involves identifying and accessing the different data sources, mapping the data to a common format, and
reconciling any inconsistencies or discrepancies between the sources. The goal of data integration is to make it
easier to access and analyze data that is spread across multiple systems or platforms, in order to gain a more
complete and accurate understanding of the data.
Mainly 2 methods– one is the “tight coupling approach” and another is the “loose coupling approach”.
Tight Coupling: This approach involves creating a centralized repository or data warehouse to store the integrated
data. The data is extracted from various sources, transformed and loaded into a data warehouse. Data is integrated
in a tightly coupled manner, meaning that the data is integrated at a high level, such as at the level of the entire
dataset or schema.
Loose Coupling: This approach involves integrating data at the lowest level, such as at the level of individual data
elements or records. Data is integrated in a loosely coupled manner, meaning that the data is integrated at a low
level, and it allows data to be integrated without having to create a central repository or data warehouse. This
approach is also known as data federation, and it enables data flexibility and easy updates, but it can be difficult to
maintain consistency and integrity across multiple data sources.
discrtization and concept hireracy generation?
Ans-
Data discretization refers to a method of converting a huge number of data values into smaller ones so that the
evaluation and management of data become easy. In other words, data discretization is a method of converting
attributes values of continuous data into a finite set of intervals with minimum data loss. There are two forms of
data discretization first is supervised discretization, and the second is unsupervised discretization. Supervised
discretization refers to a method in which the class data is used. Unsupervised discretization refers to a method
depending upon the way which operation proceeds. It means it works on the top-down splitting strategy and
bottom-up merging strategy.
Some Famous techniques of data discretization
Histogram analysis-Histogram refers to a plot used to represent the underlying frequency distribution of a
continuous data set. Histogram assists the data inspection for data distribution. For example, Outliers, skewness
representation, normal distribution representation, etc.
Binning- Binning refers to a data smoothing technique that helps to group a huge number of continuous values into
smaller values. For data discretization and the development of idea hierarchy, this technique can also be used.
Cluster Analysis - Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the
values of x numbers into clusters to isolate a computational feature of x.’

Conect Hirerachy-
n data mining, the concept of a concept hierarchy refers to the organization of data into a tree-like structure, where
each level of the hierarchy represents a concept that is more general than the level below it. This hierarchical
organization of data allows for more efficient and effective data analysis, as well as the ability to drill down to more
specific levels of detail when needed. The concept of hierarchy is used to organize and classify data in a way that
makes it more understandable and easier to analyze. The main idea behind the concept of hierarchy is that the same
data can have different levels of granularity or levels of detail and that by organizing the data in a hierarchical
fashion, it is easier to understand and perform analysis.
Types of Concept Hierarchies
 Schema Hierarchy: Schema Hierarchy is a type of concept hierarchy that is used to organize the schema of a
database in a logical and meaningful way, grouping similar objects together. A schema hierarchy can be used
to organize different types of data, such as tables, attributes, and relationships, in a logical and meaningful
way. This can be useful in data warehousing, where data from multiple sources needs to be integrated into a
single database.
 Set-Grouping Hierarchy: Set-Grouping Hierarchy is a type of concept hierarchy that is based on set theory,
where each set in the hierarchy is defined in terms of its membership in other sets. Set-grouping hierarchy
can be used for data cleaning, data pre-processing and data integration. This type of hierarchy can be used
to identify and remove outliers, noise, or inconsistencies from the data and to integrate data from multiple
sources.
 Operation-Derived Hierarchy: An Operation-Derived Hierarchy is a type of concept hierarchy that is used to
organize data by applying a series of operations or transformations to the data. The operations are applied
in a top-down fashion, with each level of the hierarchy representing a more general or abstract view of the
data than the level below it. This type of hierarchy is typically used in data mining tasks such as clustering
and dimensionality reduction. The operations applied can be mathematical or statistical operations such as
aggregation, normalization
 Rule-based Hierarchy: Rule-based Hierarchy is a type of concept hierarchy that is used to organize data by
applying a set of rules or conditions to the data. This type of hierarchy is useful in data mining tasks such as
classification, decision-making, and data exploration. It allows to the assignment of a class label or decision
to each data point based on its characteristics and identifies patterns and relationships between different
attributes of the data.
ETL steps in detail?
Ans-
The ETL process is an iterative process that is repeated as new data is added to the warehouse. The process is
important because it ensures that the data in the data warehouse is accurate, complete, and up-to-date. It also helps
to ensure that the data is in the format required for data mining and reporting.
Additionally, there are many different ETL tools and technologies available, such as Informatica, Talend, DataStage,
and others, that can automate and simplify the ETL process.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in which an ETL
tool extracts the data from various data source systems, transforms it in the staging area, and then finally, loads it
into the Data Warehouse system.
Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems is extracted which can
be in various formats like relational databases, No SQL, XML, and flat files into the staging area. It is important to
extract the data from various source systems and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be corrupted also. Hence loading it directly into
the data warehouse may damage it and rollback will be much more difficult. Therefore, this is one of the most
important steps of ETL process.
Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or functions are applied on the
extracted data to convert it into a single standard format. It may involve following processes/tasks:
Filtering – loading only certain attributes into the data warehouse.
Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States, and America into USA,
etc.
Joining – joining multiple attributes into one.
Splitting – splitting a single attribute into multiple attributes.
Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is finally loaded into the data
warehouse. Sometimes the data is updated by loading into the data warehouse very frequently and sometimes it is
done after longer but regular intervals. The rate and period of loading solely depends on the requirements and
varies from system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can transformed and during
that period some new data can be extracted. And while the transformed data is being loaded into the data
warehouse, the already extracted data can be transformed. The block diagram of the pipelining of ETL process is
shown below:
data extraction/ methods?
Ans-
Data extraction is the process of analyzing and crawling through data sources (such as databases) to recover vital
information in a specific pattern. Data is processed further, including metadata and other data integration; this is
another step in the data workflow.
Unstructured data sources and various data formats account for most data extraction. Tables, indexes, and analytics
can all be used to store unstructured data.
Data in a warehouse can come from various places, and a data warehouse must use three different approaches to
use it. Extraction, Transformation, and Loading are the terms for these procedures (ETL).
Data extraction entails retrieving information from disorganized data sources. The data extracts are subsequently
imported into the relational Database's staging area. The source system is queried for data via application
programming interfaces, and extraction logic is applied. Due to this process, the data is now ready to go through the
transformation phase of the ETL process.
Data extraction techniques
From a logical and physical standpoint, the projected amount of data to be extracted and the stage in the ETL
process (initial load or data maintenance) may also influence how to extract. Essentially, you must decide how to
conceptually and physically extract data.
Methods of Logical Extraction
Logic extraction can be divided into two types −
Full Extraction
The data is fully pulled from the source system. There's no need to keep track of data source changes because this
Extraction reflects all of the information saved on the source system after the last successful Extraction.
The source data will be delivered in its current state, with no further logical information (such as timestamps)
required on the source site. An export file of a specific table or a remote SQL query scanning the entire source table
is two examples of full extractions.
Incremental Extraction
Only data that has changed since a particular occurrence in the past will be extracted at a given time. This event
could be the end of the extraction process or a more complex business event such as the last day of a fiscal period's
bookings. To detect this delta change, there must be a way to identify all the changed information since this precise
time event.
This information can be provided by the source data itself, such as an application column indicating the last-changed
timestamp, or by a changing table in which a separate mechanism keeps track of the modifications in addition to the
originating transactions. Using the latter option, in most situations, entails adding extraction logic to the source
system.
Methods of Physical Extraction
Physically extracting the data can be done in two ways, depending on the chosen logical extraction method and the
source site's capabilities and limits. The data can be extracted online from the source system or offline from a
database. An offline structure like this could already exist or be created by an extraction routine.
Physical Extraction can be done in the following ways −
Online Extraction
The information is taken directly from the source system. The extraction procedure can link directly to the source
system to access the source tables or connect to an intermediate system to store the data in a predefined format
(for example, snapshot logs or change tables). It's worth noting that the intermediary system doesn't have to be
physically distinct from the source system.
It would be best to evaluate whether the distributed transactions use source objects or prepared source objects
when using online extractions.
Offline Extraction
The data is staged intentionally outside the source system rather than extracted straight from it. The data was either
created by an extraction method or already had a structure (redo logs, archive logs, or transportable tablespaces).
data loading/ techniques?
Ans-
The data warehouse is structured by the integration of data from different sources. Several factors separate the data
warehouse from the operational database. Since the two systems provide vastly different functionality and require
different types of data, it is necessary to keep the data database separate from the operational database. A data
warehouse is an exchequer of acquaintance gathered from multiple sources, picked under a unified schema, and
usually residing on a single site. A data warehouse is built through the process of data cleaning, data integration,
data transformation, data loading, and periodic data refresh. Loading is the ultimate step in the ETL process. In this
step, the extracted data and the transformed data are loaded into the target database. To make the data load
efficient, it is necessary to index the database and disable the constraints before loading the data. All three steps in
the ETL process can be run parallel. Data extraction takes time and therefore the second phase of the transformation
process is executed simultaneously. This prepared the data for the third stage of loading. As soon as some data is
ready, it is loaded without waiting for the previous steps to be completed.
Techniques/Methods-
Data Loading-
Data is physically moved to the data warehouse. The loading takes place within a “load window. The tendency is
close to real-time updates for data warehouses as warehouses are growing used for operational applications.

Loading the Dimension Tables


Procedure for maintaining the dimension tables includes two functions, initial loading of the tables and thereafter
applying the changes on an ongoing basis System geared keys are used in a data warehouse. The reeds in the source
system have their own keys. Therefore, before an initial load or an ongoing load, the production keys must be co to
system-generated keys in the data warehouse, Another issue is related to the application of Type 1, Type 2, and Type
3 changes to the data warehouse. Fig. shows how to handle it.

Loading changes to a dimension table


Loading the Fact tables: History and Incremental Loads
The key in the fact table is the concatenation of keys from the dimension tables.
So for this reason amplitude records are loaded first.
A concatenated key is created from the keys of the corresponding dimension tables.

Methods for data loading


Cloud-based: ETL solutions in the cloud are frequently able to process data in real-time and are designed for speed
and scalability. They also contain the vendor’s experience and ready-made infrastructure, which may offer advice on
best practices for each organization’s particular configuration and requirements.
Batch processing: Data is moved every day or every week via ETL systems that use batch processing. Large data sets
and organizations that don’t necessarily require real-time access to their data are the greatest candidates for it.
Open-source: Since their codebases are shared, editable, and publicly available, many open-source ETL systems are
extremely affordable. Despite being a decent substitute for commercial solutions, many tools may still need some
hand-coding or customization.
data mining/process/tasks/tools/application?
Ans-
Data mining is the process of extracting knowledge or insights from large amounts of data using various statistical
and computational techniques. The data can be structured, semi-structured or unstructured, and can be stored in
various forms such as databases, data warehouses, and data lakes.
The primary goal of data mining is to discover hidden patterns and relationships in the data that can be used to make
informed decisions or predictions. This involves exploring the data using various techniques such as clustering,
classification, regression analysis, association rule mining, and anomaly detection.
Data mining has a wide range of applications across various industries, including marketing, finance, healthcare, and
telecommunications. For example, in marketing, data mining can be used to identify customer segments and target
marketing campaigns, while in healthcare, it can be used to identify risk factors for diseases and develop
personalized treatment plans.
The data mining process typically involves the following steps:
Business Understanding: This step involves understanding the problem that needs to be solved and defining the
objectives of the data mining project. This includes identifying the business problem, understanding the goals and
objectives of the project, and defining the KPIs that will be used to measure success. This step is important because
it helps ensure that the data mining project is aligned with business goals and objectives.
Data Understanding: This step involves collecting and exploring the data to gain a better understanding of its
structure, quality, and content. This includes understanding the sources of the data, identifying any data quality
issues, and exploring the data to identify patterns and relationships. This step is important because it helps ensure
that the data is suitable for analysis.
Data Preparation: This step involves preparing the data for analysis. This includes cleaning the data to remove any
errors or inconsistencies, transforming the data to make it suitable for analysis, and integrating the data from
different sources to create a single dataset. This step is important because it ensures that the data is in a format that
can be used for modeling.
Modeling: This step involves building a predictive model using machine learning algorithms. This includes selecting
an appropriate algorithm, training the model on the data, and evaluating its performance. This step is important
because it is the heart of the data mining process and involves developing a model that can accurately predict
outcomes on new data.
Evaluation: This step involves evaluating the performance of the model. This includes using statistical measures to
assess how well the model is able to predict outcomes on new data. This step is important because it helps ensure
that the model is accurate and can be used in the real world.
Deployment: This step involves deploying the model into the production environment. This includes integrating the
model into existing systems and processes to make predictions in real-time. This step is important because it allows
the model to be used in a practical setting and to generate value for the organization.

Data Mining techniques:-

Classification:-
Classification technique of data mining is used to classify data in different classes.It is based on machine
learning.Classification is used to classify each item in a set of predefined data set of classes or groups.This technique
use mathematical concept such as decision trees, linear programming, neural network, and statistics. In
classification, there is a software that can learn how to classify the data items into groups.Algorithm used for
classification is Logistic Regression,Naive Bayes,K-Nearest Neighbor.
2) Clustering:-
Clustering technique of data mining used to identify data that are like each other.This technique helps to understand
similarities and differences between the data.This technique makes meaningful classes and objects and puts object
in each class which having similar characteristics.Algorithm used for clustering is Hierarchical Clustering
Algorithm,etc.
3) Regression:-
Regression is a technique of data mining which analyze the the relationship between variables.It creates predictive
models.Regression technique can analyze and predict the results based on previously known data by applying
formulas.Regression is very useful for finding the information on the basis of existing known information.Algorithms
used for regression are Multivariate,Multiple Regression Algorithm,etc.
4) Association:-
Association is a technique that helps find the connection between two or more items.This technique can create a
hidden pattern in data set.These pattern is discovered on the basis of relationship between items which are in same
transaction.This technique is used in market basket analysis to identify a products that customers frequently
purchase together.
5) Outer detection:-
Outer detection is a type of data mining technique that refers to observation of data items in the data-set which do
not match an expected pattern or expected behavior. This technique can be used in a variety of domains, such as
intrusion, detection, fraud or fault detection, etc.
6) Sequential Patterns:-
Sequential pattern is a data mining technique that helps to create or find similar trends in transaction data for
certain time of period.It is one of data mining technique that explore to discover or find similar patterns, trends in
transaction data over a business period.With historical transaction data, vendor can identify items that customers
buy together different times in a year. This technique will also help to customers to buy the product with better
deals based on their previous purchased data.
7) Prediction:-
Prediction is a data mining technique which is a combination of other data mining techniques like sequential
patterns,clustering,classification,etc.It analyzes past events for predicting a future event.The prediction technique
can be used in the sale to predict profit/loss for the future.

Tools For Data Mining-


1. Orange Data Mining: As it is a software-based on components, the components of Orange are called "widgets."
These widgets range from preprocessing and data visualization to the assessment of algorithms and predictive
modeling.
Widgets deliver significant functionalities such as:
 Displaying data table and allowing to select features
 Data reading
 Training predictors and comparison of learning algorithms
 Data element visualization, etc.
2. SAS Data Mining:
Data Mining Tools
SAS stands for Statistical Analysis System. It is a product of the SAS Institute created for analytics and data
management. SAS can mine data, change it, manage information from various sources, and analyze statistics. It
offers a graphical UI for non-technical users.
SAS data miner allows users to analyze big data and provide accurate insight for timely decision-making purposes.
SAS has distributed memory processing architecture that is highly scalable. It is suitable for data mining,
optimization, and text mining purposes.
3. DataMelt Data Mining:
Data Mining Tools
DataMelt is a computation and visualization environment which offers an interactive structure for data analysis and
visualization. It is primarily designed for students, engineers, and scientists. It is also known as DMelt.
DMelt is a multi-platform utility written in JAVA. It can run on any operating system which is compatible with JVM
(Java Virtual Machine). It consists of Science and mathematics libraries.
Applications-
Scientific Analysis: Scientific simulations are generating bulks of data every day. This includes data collected from
nuclear laboratories, data about human psychology, etc. Data mining techniques are capable of the analysis of these
data. Now we can capture and store more new data faster than we can analyze the old data already accumulated.
Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital network. Network intrusions
often involve stealing valuable network resources. Data mining technique plays a vital role in searching intrusion
detection, network attacks, and anomalies. These techniques help in selecting and refining useful and relevant
information from large data sets. Data mining technique helps in classify relevant data for Intrusion Detection .
Business Transactions: Every business industry is memorized for perpetuity. Such transactions are usually time-
related and can be inter-business deals or intra-business operations. The effective and in-time use of the data in a
reasonable time frame for competitive decision-making is definitely the most important problem to solve for
businesses that struggle to survive in a highly competitive world.
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of purchases done by a
customer in a supermarket. This concept identifies the pattern of frequent purchase items by customers. This
analysis can help to promote deals, offers, sale by the companies and data mining techniques helps to achieve this
analysis task.
Education: For analyzing the education sector, data mining uses Educational Data Mining (EDM) method. This
method generates patterns that can be used both by learners and educators. By using data mining EDM we can
perform some educational task
Research: A data mining technique can perform predictions, classification, clustering, associations, and grouping of
data with perfection in the research area. Rules generated by data mining are unique to find results. In most of the
technical research in data mining, we create a training model and testing model. The training/testing model is a
strategy to measure the precision of the proposed model. It is called Train/Test
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity and their outcomes to
improve the focusing of high-value physicians and figure out which promoting activities will have the best effect in
the following upcoming months, Whereas the Insurance sector, data mining can help to predict which customers will
buy new policies, identify behavior patterns of risky customers and identify fraudulent behavior of customers.
Transportation: A diversified transportation company with a large direct sales force can apply data mining to identify
the best prospects for its services. A large consumer merchandise organization can apply information mining to
improve its business cycle to retailers.
data mining vs query processing?
Ans-
Aspect Data Mining Query Processing
Data Source Typically operates on large volumes of Operates on structured data stored in databases or
historical or real-time data. data warehouses.
Goal To extract actionable insights, patterns, To efficiently retrieve specific information based on
trends, and knowledge from data. user queries.
Complexity Involves complex algorithms and statistical Involves optimization algorithms and data access
techniques for pattern discovery. strategies for query execution.
Output Generates patterns, rules, clusters, trends, Returns query results as structured data (e.g., rows
and predictive models. and columns).
User Typically requires domain expertise and Users formulate queries using SQL or other query
Involvement understanding of data mining techniques. languages without needing deep technical
knowledge.
Tools and Uses data mining tools such as WEKA, Relational database management systems (RDBMS)
Software RapidMiner, R, Python libraries like scikit- like MySQL, PostgreSQL, Oracle, SQL Server, etc.,
learn, etc. handle query processing.
Performance Evaluated based on accuracy, precision, Evaluated based on query execution time,
Metrics recall, F1-score, AUC-ROC, etc. optimization cost, throughput, and resource
utilization.
Data Involves data cleaning, transformation, Data is already stored in a structured format and
Preparation normalization, and feature engineering. may require indexing or partitioning for
optimization.
Use Cases Customer segmentation, fraud detection, Inventory management, transaction processing,
market basket analysis, recommendation reporting, data analysis, etc.
systems, etc.
kdd process in brief?
Ans-
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and
potentially valuable information from large datasets. The KDD process is an iterative process and it requires multiple
iterations of the above steps to extract accurate knowledge from the data.The following steps are included in KDD
process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a common
source(DataWarehouse). Data integration using Data Migration tools, Data Synchronization tools and ETL(Extract-
Load-Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data
collection. For this we can use Neural network, Decision Trees, Naive bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into appropriate form required by mining
procedure. Data Transformation is a two step process:
Data Mapping: Assigning elements from source base to destination to capture transformations.
Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns potentially useful. It transforms task
relevant data into patterns, and decides purpose of model using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given
measures. It find interestingness score of each pattern, and uses summarization and Visualization to make data
understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used to make decisions.

Difference between KDD and Data Mining


Parameter KDD Data Mining
Definition KDD refers to a process of identifying valid, novel, Data Mining refers to a process of extracting
potentially useful, and ultimately understandable useful and valuable information or patterns
patterns and relationships in data. from large data sets.
Objective To find useful knowledge from data. To extract useful information from data.
Techniques Data cleaning, data integration, data selection, data Association rules, classification, clustering,
Used transformation, data mining, pattern evaluation, and regression, decision trees, neural networks,
knowledge representation and visualization. and dimensionality reduction.
Output Structured information, such as rules and models, that Patterns, associations, or insights that can be
can be used to make decisions or predictions. used to improve decision-making or
understanding.
Focus Focus is on the discovery of useful knowledge, rather Data mining focus is on the discovery of
than simply finding patterns in data. patterns or relationships in data.
clustering/problems/applications/tools/requirments?
Ans-
The process of making a group of abstract objects into classes of similar objects is known as clustering.
One group is treated as a cluster of data objects
In the process of cluster analysis, the first step is to partition the set of data into groups with the help of data
similarity, and then groups are assigned to their respective labels.
The biggest advantage of clustering over-classification is it can adapt to the changes made and helps single out useful
features that differentiate different groups.
Applications of cluster analysis :
It is widely used in many applications such as image processing, data analysis, and pattern recognition.
It helps marketers to find the distinct groups in their customer base and they can characterize their customer groups
by using purchasing patterns.
It can be used in the field of biology, by deriving animal and plant taxonomies and identifying genes with the same
capabilities.
It also helps in information discovery by classifying documents on the web.
Clustering Methods:
It can be classified based on the following categories.
 Model-Based Method
 Hierarchical Method
 Constraint-Based Method
 Grid-Based Method
 Partitioning Method
 Density-Based Method
Requirements of clustering in data mining:
The following are some points why clustering is important in data mining.
 Scalability – we require highly scalable clustering algorithms to work with large databases.
 Ability to deal with different kinds of attributes – Algorithms should be able to work with the type of data
such as categorical, numerical, and binary data.
 Discovery of clusters with attribute shape – The algorithm should be able to detect clusters in arbitrary
shapes and it should not be bounded to distance measures.
 Interpretability – The results should be comprehensive, usable, and interpretable.
 High dimensionality – The algorithm should be able to handle high dimensional space instead of only
handling low dimensional data.
Clusturing Problems/Challenges-
1. Determining the optimal number of clusters: One of the fundamental challenges in clustering is determining the
appropriate number of clusters in the data. Choosing the incorrect number of clusters can lead to poor clustering
results.
2. Handling high-dimensional data: Clustering algorithms can struggle with high-dimensional data due to the curse of
dimensionality. As the number of dimensions increases, the distance or similarity measures between data points
become less meaningful. High-dimensional data can also lead to sparsity issues and increase the computational
complexity of clustering algorithms
3. Sensitivity to initialization: Many clustering algorithms, such as K-means, are sensitive to the initial configuration
or random seed. Different initializations can lead to different clustering outcomes, making it challenging to obtain
consistent and reliable results
4. Dealing with different cluster shapes and sizes: Clustering algorithms typically assume certain underlying
structures, such as convex or isotropic clusters. However, real-world data often contains clusters with irregular
shapes, varying sizes, or overlapping boundaries.
5. Handling noisy or outlier data: Outliers or noisy data points can significantly affect the clustering process by
distorting cluster boundaries or forming spurious clusters. Clustering algorithms should be robust enough to handle
outliers or provide mechanisms for outlier detection and removal. Outlier detection techniques like the Local Outlier
Factor (LOF) or clustering ensembles can help address this challenge.
6. Interpreting and validating clustering results: Unlike supervised learning, clustering often lacks explicit ground
truth labels for evaluation. Interpreting and validating clustering results can be subjective and require domain
knowledge or additional external information. Internal evaluation measures, such as silhouette scores or cohesion-
separation metrics, can provide some insight into the quality of clustering, but they may not always align with the
desired outcomes or real-world interpretations.
7. Scalability to large datasets: Some clustering algorithms have difficulty scaling to large datasets due to their
computational complexity or memory requirements. As the dataset size increases, the clustering process can
become time-consuming or even infeasible. Scalability is an important consideration, and techniques like mini-batch
clustering, distributed clustering, or online clustering can be employed to handle large-scale data.
Conclusion: Addressing these challenges requires careful consideration of the dataset characteristics, selecting
appropriate algorithms, and employing preprocessing techniques or algorithmic modifications. It’s important to
evaluate and validate the clustering results based on domain knowledge and additional analyses to ensure the
quality and reliability of the clustering outcomes.
Tools-
IBM SPSS Modeler:
Developed by IBM, SPSS Modeler is a comprehensive data mining and predictive analytics software that includes
clustering algorithms.It offers a visual interface for building and deploying predictive models, including clustering
models for segmenting data.
SAS Enterprise Miner:
SAS Enterprise Miner is part of the SAS analytics suite and provides a range of data mining and machine learning
capabilities, including clustering.It offers various clustering algorithms such as k-means clustering, hierarchical
clustering, and others, along with tools for model evaluation and deployment.
RapidMiner:
RapidMiner is a powerful data science platform that supports various tasks, including data preprocessing, modeling,
and evaluation.It offers clustering algorithms like k-means, DBSCAN, hierarchical clustering, and others, along with a
user-friendly visual interface for workflow creation.
KNIME Analytics Platform:
KNIME is an open-source data analytics platform that also has commercial offerings with additional features and
support.It provides clustering capabilities through various plugins and extensions, allowing users to perform
clustering tasks using different algorithms.
Alteryx:
Alteryx is a data analytics and automation platform that includes clustering capabilities for segmenting and analyzing
data.It offers a drag-and-drop interface for data preparation, modeling, and visualization, making it accessible to
users with varying levels of technical expertise.
Statistica:
Statistica, now owned by TIBCO Software, is a data analytics and visualization platform that includes clustering
algorithms.
It provides tools for exploratory data analysis, clustering model building, and interpretation of clustering results.
clustering methods?
Ans- Clustering Methods
Clustering methods can be classified into the following categories −
Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each
partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the
following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical
methods on the basis of how the hierarchical decomposition is formed. There are two approaches here −
 Agglomerative Approach
 Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object forming a separate group.
It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the groups
are merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects in the same cluster. In
the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the
termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the
density in the neighborhood exceeds some threshold, i.e., for each data point within a given cluster, the radius of a
given cluster has to contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of cells that form a grid
structure.
It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given model. This method
locates the clusters by clustering the density function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. It therefore yields robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-oriented constraints. A
constraint refers to the user expectation or the properties of desired clustering results. Constraints provide us with
an interactive way of communication with the clustering process. Constraints can be specified by the user or the
application requirement.


hierarchical clustering method/TECHNIQUES/agglomerative hierarchical clustering/divisive hierarchical
clustering/diff?
Ans-
uA Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering begins by
treating every data point as a separate cluster. Then, it repeatedly executes the subsequent steps:
 Identify the 2 clusters which can be closest together, and
 Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters are
merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram called Dendrogram
(A Dendrogram is a tree-like diagram that statistics the sequences of merges or splits) graphically represents this
hierarchy and is an inverted tree that describes the order in which factors are merged (bottom-up view) or clusters
are broken up (top-down view).
Hierarchical clustering is a method of cluster analysis in data mining that creates a hierarchical representation of the
clusters in a dataset. The method starts by treating each data point as a separate cluster and then iteratively
combines the closest clusters until a stopping criterion is reached. The result of hierarchical clustering is a tree-like
structure, called a dendrogram, which illustrates the hierarchical relationships among the clusters.
Types-
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
 Agglomerative Clustering
 Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and at every step, merge the nearest pairs of the cluster. (It
is a bottom-up method). At first, every dataset is considered an individual entity or cluster. At every iteration, the
clusters merge with different clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
 Consider every data point as an individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Steps 3 and 4 until only a single cluster remains.
Example-
Let’s say we have six data points A, B, C, D, E, and F.
Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other
clusters.
Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s say cluster (B) and
cluster (C) are very similar to each other therefore we merge them in the second step similarly to cluster (D) and (E)
and at last, we get the clusters [(A), (BC), (DE), (F)]
Step-3: We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)])
together to form new clusters as [(A), (BC), (DEF)]
Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together to form a new
cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5: At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
2. Divisive Hierarchical clustering
We can say that Divisive Hierarchical clustering is precisely the opposite of Agglomerative Hierarchical clustering. In
Divisive Hierarchical clustering, we take into account all of the data points as a single cluster and in every iteration,
we separate the data points from the clusters which aren’t comparable. In the end, we are left with N clusters.

Techniques of hierarchial Clusturing-


1. **Single Linkage (Minimum Linkage):**
- Single linkage, also known as minimum linkage, calculates the distance between two clusters based on the
shortest distance between any point in one cluster and any point in the other cluster.
- In single linkage, the distance between two clusters \( A \) and \( B \) is determined by the minimum distance
between any point in \( A \) and any point in \( B \).
- Single linkage tends to produce long, elongated clusters and is sensitive to outliers or noise in the data.
- It is computationally efficient but can lead to the chaining effect, where clusters are connected through a few
distant points.
2. **Complete Linkage (Maximum Linkage):**
- Complete linkage, also known as maximum linkage, calculates the distance between two clusters based on the
maximum distance between any point in one cluster and any point in the other cluster.
- In complete linkage, the distance between two clusters \( A \) and \( B \) is determined by the maximum distance
between any point in \( A \) and any point in \( B \).
- Complete linkage tends to produce compact, spherical clusters and is less sensitive to outliers compared to single
linkage.
- It is computationally more expensive than single linkage but can handle clusters of varying shapes and sizes more
effectively.
3. **Average Linkage:**
- Average linkage calculates the distance between two clusters based on the average distance between all pairs of
points in the two clusters.
- In average linkage, the distance between two clusters \( A \) and \( B \) is determined by averaging the distances
between every point in \( A \) and every point in \( B \).
- Average linkage strikes a balance between single linkage and complete linkage, producing clusters that are less
sensitive to outliers compared to single linkage while avoiding the tight clustering tendency of complete linkage.
- It is computationally moderate and often provides reasonable results for a wide range of datasets.
Aspect Agglomerative Clustering Divisive Clustering
Direction Bottom-up approach where each data point Top-down approach where all data points start
starts as a separate cluster and merges in one cluster and split iteratively.
iteratively.
Process Starts with individual data points as clusters Starts with all data points in one cluster and
and merges similar clusters until reaching the recursively divides them into smaller clusters
desired number of clusters. until reaching the desired number of clusters.
Complexity Generally slower than divisive clustering as it Generally faster than agglomerative clustering
involves merging clusters iteratively. as it involves splitting clusters iteratively.
Initialization Requires defining a similarity/distance Requires defining a criterion for splitting clusters
measure and linkage method (e.g., single, (e.g., variance, distance threshold).
complete, average).
Cluster Tends to produce clusters with nested Tends to produce clusters with flat structure
Structure hierarchical structure (dendrogram). (non-hierarchical).
Suitability Suitable for datasets where the number of Suitable for datasets where the number of
clusters is not known a priori and hierarchical clusters is known a priori or a flat clustering
structure is desired. structure is preferred.
Algorithm Examples include single linkage, complete Examples include k-means, k-medoids, and
Types linkage, average linkage hierarchical divisive k-means clustering algorithms.
clustering algorithms.
Scalability Can be computationally intensive, especially Can be more scalable than agglomerative
for large datasets, due to the iterative clustering, especially when efficient initialization
merging process. and termination conditions are used.
Interpretability Provides a hierarchical representation Provides non-hierarchical clusters that may be
(dendrogram) that shows how clusters are easier to interpret and analyze in some cases.
merged.
cluster evaluation/demands/measures?
Ans-
Cluster evaluation in data mining refers to the process of assessing the quality and effectiveness of clustering
algorithms in partitioning a dataset into meaningful groups or clusters. Various metrics and techniques are used to
evaluate the performance of clustering algorithms and the resulting clusters. Here are some common methods and
metrics used for cluster evaluation:
External Index Measures:
Purity: Measures how accurately clusters match known class labels or ground truth. Higher purity indicates better
clustering performance.
Rand Index: Compares the similarity between pairs of data points in the clusters with the similarity between pairs in
the ground truth. A higher Rand Index indicates better clustering.
Jaccard Index: Measures the similarity between two sets by comparing the intersection over the union of the sets.
Higher Jaccard Index values indicate better clustering.
Internal Cluster Evaluation:

Silhouette Coefficient: Measures the compactness and separation of clusters. Values close to +1 indicate well-
separated clusters, while values close to 0 indicate overlapping clusters.
Davies-Bouldin Index: Evaluates the average similarity of each cluster with its most similar cluster, where lower
values indicate better clustering.
Dunn Index: Measures the compactness and separation of clusters using the ratio of the minimum inter-cluster
distance to the maximum intra-cluster distance. Higher Dunn Index values indicate better clustering.
Cluster Stability:

Cluster Stability Index: Measures the stability of clusters by comparing clustering results on subsamples or
perturbed versions of the dataset. Higher stability indicates more robust clustering.
Visual Inspection and Interpretability:
Cluster Visualization: Visual inspection of clustering results using techniques like scatter plots, dendrograms,
heatmaps, or t-SNE projections to assess cluster separability and structure.
Interpretability: Evaluating the interpretability and meaningfulness of clusters based on domain knowledge or
expert judgment.
Cross-Validation and Resampling:

Cross-Validation: Using techniques like k-fold cross-validation or holdout validation to assess clustering performance
on different subsets of the data.
Bootstrap Resampling: Generating multiple bootstrap samples and evaluating clustering stability and consistency
across these samples.
Domain-Specific Metrics:

Application-Specific Metrics: Tailoring cluster evaluation metrics based on the specific goals and requirements of the
data mining application (e.g., customer segmentation, anomaly detection, pattern recognition).
By employing these evaluation methods and metrics, data miners can assess the quality, robustness, and
effectiveness of clustering algorithms, leading to more reliable and meaningful cluster analysis results.
Measures=

Certainly, let's delve into each type of cluster evaluation measure - external, internal, and relative - along
with examples for clarity:

1. External Evaluation Measures:


 Definition: These measures compare the clustering results with a ground truth or known
class labels to assess how well the clusters correspond to the actual groups in the data.
 Examples: Purity, Rand Index, Jaccard Index, Fowlkes-Mallows Index.
 Explanation: Purity measures the extent to which clusters contain data points from a single
class. The Rand Index measures the similarity between pairs of data points in the clusters and
pairs in the ground truth. The Jaccard Index measures the similarity between two sets based
on the intersection over the union of the sets.
2. Internal Evaluation Measures:
 Definition: These measures evaluate the quality of clusters based solely on the data itself,
without reference to external information or ground truth labels.
 Examples: Silhouette Coefficient, Davies-Bouldin Index, Dunn Index, Within Cluster Sum
of Squares (WCSS).
 Explanation: The Silhouette Coefficient measures the compactness and separation of
clusters. The Davies-Bouldin Index evaluates the average similarity of each cluster with its
most similar cluster. The Dunn Index measures the compactness and separation of clusters
using the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.
3. Relative Evaluation Measures:
 Definition: These measures compare different clustering results or algorithms to determine
which one produces better clusters.
 Examples: Cluster Stability Index, Consensus Index, Relative Validity Measures.
 Explanation: The Cluster Stability Index quantifies the stability of clusters by comparing
clustering results across different samples or perturbed versions of the dataset. Consensus
Index measures the consensus among multiple clustering results. Relative Validity Measures
compare clustering solutions based on internal or external criteria.

These three types of cluster evaluation measures provide a comprehensive framework for assessing the
quality, robustness, and effectiveness of clustering algorithms and the resulting clusters. External measures
validate the clustering against known ground truth, internal measures evaluate clustering based on data
characteristics, and relative measures compare different clustering solutions or algorithms to determine the
best clustering outcome.
k mean algorithm/ adv disadv?
Ans-
Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified data and
enabling the algorithm to operate on that data without supervision. Without any previous data training, the
machine’s job in this case is to organize unsorted data according to parallels, patterns, and variations.
K means clustering, assigns data points to one of the K clusters depending on their distance from the center of the
clusters. It starts by randomly assigning the clusters centroid in the space. Then each data point assign to one of the
cluster based on its distance from centroid of the cluster. After assigning each point to one of the cluster, new
cluster centroids are assigned. This process runs iteratively until it finds good cluster. In the analysis we assume that
number of cluster is given in advanced and we have to put points in one of the group.
In some cases, K is not clearly defined, and we have to think about the optimal number of K. K Means clustering
performs best data is well separated. When data points overlapped this clustering is not suitable. K Means is faster
as compare to other clustering technique. It provides strong coupling between the data points. K Means cluster do
not provide clear information regarding the quality of clusters. Different initial assignment of cluster centroid may
lead to different clusters. Also, K Means algorithm is sensitive to noise. It maymhave stuck in local minima.
What is the objective of k-means clustering?
The goal of clustering is to divide the population or set of data points into a number of groups so that the data points
within each group are more comparable to one another and different from the data points within the other groups.
It is essentially a grouping of things based on how similar and different they are to one another.
Advantages
Simplicity and ease of use
The K-means algorithm's simplicity is a major advantage. Its straightforward concept of partitioning data into
clusters based on similarity makes it easy to understand and implement. This accessibility is especially valuable for
newcomers to the field of machine learning.
Efficiency and speed
K-means is known for its computational efficiency, making it suitable for handling large datasets. Its complexity is
relatively low, allowing it to process data quickly and efficiently. This speed is advantageous for real-time or near-
real-time applications.
Scalability
The algorithm's efficiency scales well with the increase in the number of data points. This scalability makes K-means
applicable to datasets of varying sizes, from small to large.
Unsupervised Learning
K-means operates under the unsupervised learning paradigm, requiring no labeled data for training. It autonomously
discovers patterns within data, making it valuable for exploratory data analysis and uncovering hidden insights.
Disadvantages
Sensitive to initial placement
K-means' convergence to a solution is sensitive to the initial placement of cluster centroids. Different initial
placements can result in different final clusterings. Techniques like the k-means++ initialization method help mitigate
this issue.
Assumption of equal-sized Clusters and spherical shapes
K-means assumes that clusters are of equal sizes and have spherical shapes. This assumption can lead to suboptimal
results when dealing with clusters of varying sizes or non-spherical shapes.
Dependence on number of Clusters
The algorithm's performance heavily depends on the correct choice of the number of clusters ('k'). Incorrect 'k'
values can lead to clusters that are not meaningful or informative.
Sensitive to outliers
K-means is sensitive to outliers, which can skew the placement of cluster centroids and affect the overall clustering
results.Not suitable for non-linear dataK-means assumes that clusters are separated by linear boundaries. It may not
perform well on datasets with complex or non-linear cluster structures.
Outliers?
Ans-
Outlier is a data object that deviates significantly from the rest of the data objects and behaves in a different
manner. They can be caused by measurement or execution errors. The analysis of outlier data is referred to as
outlier analysis or outlier mining.An outlier cannot be termed as a noise or error. Instead, they are suspected of not
being generated by the same method as the rest of the data objects.
Outliers are of three types, namely –
 Global (or Point) Outliers
 Collective Outliers
 Contextual (or Conditional) Outliers
1. Global Outliers
1. Definition: Global outliers are data points that deviate significantly from the overall distribution of a dataset.
2. Causes: Errors in data collection, measurement errors, or truly unusual events can result in global outliers.
3. Impact: Global outliers can distort data analysis results and affect machine learning model performance.
4. Detection: Techniques include statistical methods (e.g., z-score, Mahalanobis distance), machine learning
algorithms (e.g., isolation forest, one-class SVM), and data visualization techniques.
5. Handling: Options may include removing or correcting outliers, transforming data, or using robust methods.
6. Considerations: Carefully considering the impact of global outliers is crucial for accurate data analysis and machine
learning model outcomes.
2. Collective Outliers
1. Definition: Collective outliers are groups of data points that collectively deviate significantly from the overall
distribution of a dataset.
2. Characteristics: Collective outliers may not be outliers when considered individually, but as a group, they exhibit
unusual behavior.
3. Detection: Techniques for detecting collective outliers include clustering algorithms, density-based methods, and
subspace-based approaches.
4 Impact: Collective outliers can represent interesting patterns or anomalies in data that may require special
attention or further investigation.
5. Handling: Handling collective outliers depends on the specific use case and may involve further analysis of the
group behavior, identification of contributing factors, or considering contextual information.
6. Considerations: Detecting and interpreting collective outliers can be more complex than individual outliers, as the
focus is on group behavior rather than individual data points. Proper understanding of the data context and domain
knowledge is crucial for effective handling of collective outliers.
3. Contextual Outliers
1. Definition: Contextual outliers are data points that deviate significantly from the expected behavior within a
specific context or subgroup.
2. Characteristics: Contextual outliers may not be outliers when considered in the entire dataset, but they exhibit
unusual behavior within a specific context or subgroup.
3. Detection: Techniques for detecting contextual outliers include contextual clustering, contextual anomaly
detection, and context-aware machine learning approaches.
4. Contextual Information: Contextual information such as time, location, or other relevant factors are crucial in
identifying contextual outliers.
5. Impact: Contextual outliers can represent unusual or anomalous behavior within a specific context, which may
require further investigation or attention.
6. Handling: Handling contextual outliers may involve considering the contextual information, contextual
normalization or transformation of data, or using context-specific models or algorithms.
7. Considerations: Proper understanding of the context and domain-specific knowledge is crucial for accurate
detection and interpretation of contextual outliers, as they may vary based on the specific context or subgroup being
considered.
data mining trends?
Ans-1. Application exploration
Data mining is increasingly used to explore applications in other areas, such as financial analysis,
telecommunications, biomedicine, wireless security, and science.
2. Multimedia Data Mining
This is one of the latest methods which is catching up because of the growing ability to capture useful data
accurately. It involves data extraction from different kinds of multimedia sources such as audio, text, hypertext,
video, images, etc. The data is converted into a numerical representation in different formats. This method can be
used in clustering and classifications, performing similarity checks, and identifying associations.
3. Ubiquitous Data Mining
This method involves mining data from mobile devices to get information about individuals. Despite having several
challenges in this type, such as complexity, privacy, cost, etc., this method has a lot of opportunities to be enormous
in various industries, especially in studying human-computer interactions.
4. Distributed Data Mining
This type of data mining is gaining popularity as it involves mining a huge amount of information stored in different
company locations or at different organizations. Highly sophisticated algorithms are used to extract data from
different locations and provide proper insights and reports based on them.
5. Embedded Data Mining
Data mining features are increasingly finding their way into many enterprise software use cases, from sales
forecasting in CRM SaaS platforms to cyber threat detection in intrusion detection/prevention systems. The
embedding of data mining into vertical market software applications enables prediction capabilities for any number
of industries and opens up new realms of possibilities for unique value creation.
6. Spatial and Geographic Data Mining
This new trending type of data mining includes extracting information from environmental, astronomical, and
geographical data, including images taken from outer space. This type of data mining can reveal various aspects such
as distance and topology, which are mainly used in geographic information systems and other navigation
applications.
7. Time Series and Sequence Data Mining
The primary application of this type of data mining is the study of cyclical and seasonal trends. This practice is also
helpful in analyzing even random events which occur outside the normal series of events. Retail companies mainly
use this method to access customers' buying patterns and behaviors.
8. Data Mining Dominance in the Pharmaceutical And Health Care Industries
Both the pharmaceutical and health care industries have long been innovators in the category of data mining. The
recent rapid development of coronavirus vaccines is directly attributed to advances in pharmaceutical testing data
mining techniques, specifically signal detection during the clinical trial process for new drugs.
9. Increasing Automation In Data Mining
Today's data mining solutions typically integrate ML and big data stores to provide advanced data management
functionality alongside sophisticated data analysis techniques. Earlier incarnations of data mining involved manual
coding by specialists with a deep background in statistics and programming.
10. Data Mining Vendor Consolidation
If history is any indication, significant product consolidation in the data mining space is imminent as larger database
vendors acquire data mining tooling startups to augment their offerings with new features. The current fragmented
market and a broad range of data mining players resemble the adjacent big data vendor landscape that continues to
undergo consolidation.
11. Biological data mining
Mining DNA and protein sequences, mining high dimensional microarray data, biological pathway and network
analysis, link analysis across heterogeneous biological data, and information integration of biological data by data
mining are interesting topics for biological data mining research.
web mining/content mining/structure mining/usage mining?
Ans-
Web mining can widely be seen as the application of adapted data mining techniques to the web, whereas data
mining is defined as the application of the algorithm to discover patterns on mostly structured data embedded into a
knowledge discovery process. Web mining has a distinctive property to provide a set of various data types. The web
has multiple aspects that yield different approaches for the mining process, such as web pages consist of text, web
pages are linked via hyperlinks, and user activity can be monitored via web server logs. These three features lead to
the differentiation between the three areas are web content mining, web structure mining, web usage mining.
1. Web Content Mining:
Web content mining can be used to extract useful data, information, knowledge from the web page content. In web
content mining, each web page is considered as an individual document. The individual can take advantage of the
semi-structured nature of web pages, as HTML provides information that concerns not only the layout but also
logical structure. The primary task of content mining is data extraction, where structured data is extracted from
unstructured websites. The objective is to facilitate data aggregation over various web sites by using the extracted
structured data. Web content mining can be utilized to distinguish topics on the web. For Example, if any user
searches for a specific task on the search engine, then the user will get a list of suggestions.
2. Web Structured Mining:
The web structure mining can be used to find the link structure of hyperlink. It is used to identify that data either link
the web pages or direct link network. In Web Structure Mining, an individual considers the web as a directed graph,
with the web pages being the vertices that are associated with hyperlinks. The most important application in this
regard is the Google search engine, which estimates the ranking of its outcomes primarily with the PageRank
algorithm. It characterizes a page to be exceptionally relevant when frequently connected by other highly related
pages. Structure and content mining methodologies are usually combined. For example, web structured mining can
be beneficial to organizations to regulate the network between two commercial sites.
3. Web Usage Mining:
Web usage mining is used to extract useful data, information, knowledge from the weblog records, and assists in
recognizing the user access patterns for web pages. In Mining, the usage of web resources, the individual is thinking
about records of requests of visitors of a website, that are often collected as web server logs. While the content and
structure of the collection of web pages follow the intentions of the authors of the pages, the individual requests
demonstrate how the consumers see these pages. Web usage mining may disclose relationships that were not
proposed by the creator of the pages.

Some of the methods to identify and analyze the web usage patterns are given below:

I. Session and visitor analysis:

The analysis of preprocessed data can be accomplished in session analysis, which incorporates the guest records,
days, time, sessions, etc. This data can be utilized to analyze the visitor's behavior.

The document is created after this analysis, which contains the details of repeatedly visited web pages, common
entry, and exit.

II. OLAP (Online Analytical Processing):

OLAP accomplishes a multidimensional analysis of advanced data.

OLAP can be accomplished on various parts of log related data in a specific period.

OLAP tools can be used to infer important business intelligence metrics


database approch for web minig categories?
Ans-
Multilevel Database Structure:
 The system incorporates multiple levels of databases, including raw data storage, intermediate processing
databases, and analytical databases.
 Raw data storage: Stores the original weblog data collected from web servers, including access logs, user
interactions, clickstreams, session data, etc.
 Intermediate processing databases: Store preprocessed and transformed data ready for analysis, such as
cleaned and normalized log entries, extracted features, and aggregated data.
 Analytical databases: Store the results of data mining and analysis tasks, including patterns, trends, clusters,
and insights derived from weblog data.
W3QL (Web Query Language):
 W3QL is a specialized query language designed for web-related queries, enabling expressive and efficient
querying of weblog data and web-related information.
 W3QL includes functionalities for querying web logs, extracting specific data elements (e.g., user sessions,
page views, referrer information), filtering data based on criteria (e.g., time range, user agents), and
aggregating data for analysis.
 W3QL may also support advanced features such as pattern matching, sequence analysis, and correlation
analysis specific to web mining tasks.
Components of the System:
 Data Collection: Weblog data is collected from web servers, web applications, or third-party sources and
stored in the raw data storage layer.
 Preprocessing Module: Data preprocessing tasks such as cleaning, filtering, normalization, and feature
extraction are performed on the raw data to prepare it for analysis.
 W3QL Query Engine: The system includes a query engine that interprets and executes W3QL queries on the
intermediate processing databases, allowing users to query and retrieve relevant web data.
 Data Mining and Analysis: Advanced data mining algorithms and analytical techniques are applied to the
preprocessed data in the analytical databases to discover patterns, trends, anomalies, user behavior, etc.
 Visualization and Reporting: The system may include modules for visualizing analysis results through charts,
graphs, dashboards, and generating reports for stakeholders and decision-makers.
Web Mining Tasks Supported:
 The multilevel database web query system using W3QL can support various web mining tasks, including:
 User behavior analysis: Analyzing user sessions, navigation patterns, clickstreams, and conversions.
 Content analysis: Analyzing web content, keywords, topics, sentiment analysis, and content
recommendation.
 Web structure analysis: Analyzing web graphs, hyperlinks, page ranking, and link analysis.
 Usage mining: Analyzing user interactions, access logs, session data, and building recommendation systems.
 Anomaly detection: Detecting unusual patterns, fraud, security threats, and abnormal user behavior.
By integrating multilevel databases, a specialized query language like W3QL, and advanced analytical capabilities,
this system facilitates comprehensive weblog analysis and web mining tasks, providing valuable insights for
businesses, researchers, and web administrators.
web mining tasks/tools/applications?
Ans-
Web mining encompasses a range of tasks aimed at extracting valuable insights, patterns, and knowledge from web
data. These tasks can be broadly categorized into three main types: content mining, structure mining, and usage
mining. Here's an overview of each type along with specific tasks within web mining:
Content Mining:
Web Content Extraction: Extracting relevant information from web pages, including text, images, videos, metadata,
and structured data.
Text Mining: Analyzing and extracting insights from textual content, such as sentiment analysis, topic modeling,
keyword extraction, and named entity recognition.
Multimedia Mining: Analyzing and processing multimedia content like images, videos, and audio files for content-
based retrieval, classification, and recommendation.
Web Page Clustering: Grouping similar web pages based on content similarity, enabling organization and navigation
of web data.
Duplicate Detection: Identifying and removing duplicate or near-duplicate content across web pages to improve data
quality.
Structure Mining:
Web Link Analysis: Analyzing hyperlink structures, web graphs, and page connections to understand web topology,
page ranking, and link-based algorithms.
PageRank Calculation: Calculating PageRank scores to measure the importance or relevance of web pages based on
their inbound links.
Community Detection: Identifying communities or clusters within web graphs to understand communities of interest
or related topics.
Web Navigation Analysis: Analyzing user navigation paths, clickstreams, and session data to optimize website
usability and user experience.
Ontology Extraction: Extracting ontologies or semantic structures from web data to represent knowledge domains
and relationships.
Usage Mining:
User Behavior Analysis: Analyzing user interactions, preferences, behavior patterns, and engagement metrics on
websites or web applications.
Sessionization: Segmenting user sessions based on time intervals, page views, actions, and events to analyze user
journeys and behavior flows.
Recommendation Systems: Building personalized recommendation systems based on user profiles, historical
behavior, collaborative filtering, and content-based filtering.
Anomaly Detection: Detecting anomalies, outliers, fraud, security threats, and unusual patterns in user behavior or
web traffic.
Clickstream Analysis: Analyzing sequences of user clicks, navigation patterns, and conversion paths to optimize
website design, content placement, and marketing strategies.

These tasks are crucial for various applications such as e-commerce, digital marketing, user experience optimization,
information retrieval, fraud detection, and business intelligence on the web. Effective web mining enables
organizations to make data-driven decisions, improve user engagement, enhance search relevancy, and gain
competitive insights in the digital landscape.
Tools-
SAS Enterprise Miner:
Industry: Banking, finance, healthcare, retail, marketing.
Features: Offers data mining and machine learning capabilities, including web data extraction, text mining, predictive
modeling, and cluster analysis.
Use Cases: Customer segmentation, fraud detection, churn analysis, sentiment analysis, market basket analysis.
IBM SPSS Modeler:
Industry: Retail, telecommunications, healthcare, education.
Features: Provides data mining, predictive analytics, and text analytics functionalities, including web data extraction,
social media analysis, and sentiment analysis.
Use Cases: Customer profiling, campaign optimization, risk assessment, demand forecasting, social media
monitoring.
RapidMiner:
Industry: Manufacturing, e-commerce, energy, government.
Features: Offers a visual workflow environment for data preparation, modeling, and analysis, including web scraping,
text mining, machine learning, and visualization.
Use Cases: Predictive maintenance, supply chain optimization, customer churn prediction, sentiment analysis,
anomaly detection.
KNIME Analytics Platform:
Industry: Healthcare, pharmaceuticals, finance, retail.
Features: Provides an open-source platform for data integration, analysis, and reporting, with extensions for web
data extraction, text processing, and machine learning.
Use Cases: Drug discovery, patient analytics, financial risk modeling, customer segmentation, market research.

Applications-
Web mining applications span a wide range of industries and use cases, leveraging techniques and tools to extract
valuable insights, patterns, and knowledge from web data. Here are some common applications of web mining
across various domains:
E-Commerce and Retail:
Customer Segmentation: Analyzing customer behavior, preferences, and purchase history to segment customers for
targeted marketing campaigns and personalized recommendations.
Market Basket Analysis: Identifying associations and patterns in customer shopping baskets to optimize product
placement, cross-selling, and upselling strategies.
Competitor Analysis: Monitoring competitor websites, pricing strategies, product offerings, and customer reviews to
gain competitive intelligence.
Digital Marketing and Advertising:
Social Media Monitoring: Analyzing social media platforms for brand sentiment, customer feedback, trends,
influencers, and campaign performance.
Ad Campaign Optimization: Analyzing web traffic, click-through rates, conversions, and ad performance metrics to
optimize digital advertising campaigns.
Search Engine Optimization (SEO): Analyzing search engine results, keywords, backlinks, and website traffic to
improve search engine rankings and visibility.
Finance and Banking:
Fraud Detection: Analyzing transaction data, user behavior, and account activities to detect fraudulent patterns,
anomalies, and suspicious activities.
Risk Assessment: Assessing credit risk, investment opportunities, market trends, and financial indicators using web
data and market information.
Customer Insights: Understanding customer preferences, investment behaviors, financial goals, and market
sentiments to offer personalized financial services.
Healthcare and Pharmaceuticals:
Drug Discovery: Analyzing biomedical literature, research papers, clinical trials, and drug interactions to identify
potential drug candidates and therapeutic targets.
Patient Analytics: Analyzing patient records, medical histories, treatment outcomes, and disease patterns to improve
healthcare delivery, patient care, and disease management.
text mining/types(agent based)/techniques/steps/tools?
Ans-
What is Text Mining-
Text mining is a component of data mining that deals specifically with unstructured text data. It involves the use of
natural language processing (NLP) techniques to extract useful information and insights from large amounts of
unstructured text data. Text mining can be used as a preprocessing step for data mining or as a standalone process
for specific tasks.
Text Mining in Data Mining-
Text mining in data mining is mostly used for, the unstructured text data that can be transformed into structured
data that can be used for data mining tasks such as classification, clustering, and association rule mining. This allows
organizations to gain insights from a wide range of data sources, such as customer feedback, social media posts, and
news articles.
Techniques-
Tokenization:
Definition: Tokenization is the process of breaking down a text into smaller units, typically words, phrases, or
sentences, known as tokens.
Example: The sentence "The quick brown fox jumps over the lazy dog" can be tokenized into individual words:
["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"].
Purpose: Tokenization is a fundamental step in text processing as it allows for further analysis and processing of text
data at a granular level.
Term Frequency (TF):
Definition: Term frequency (TF) is a metric used to quantify the frequency of a term (word) within a document.
Formula: TF(term) = (Number of times term appears in a document) / (Total number of terms in the document)
Example: In the sentence "The quick brown fox jumps over the lazy dog," the TF for the term "fox" is 1/9 since "fox"
appears once in the nine-word sentence.
Purpose: TF is used in techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to weigh the
importance of terms in documents and text corpora.
Stemming:
Definition: Stemming is the process of reducing words to their base or root form by removing suffixes or prefixes.
Example: The word "running" can be stemmed to "run," "played" to "play," and "happily" to "happy."
Purpose: Stemming helps in normalizing words and reducing variations, which can improve text processing tasks
such as search, retrieval, and indexing.
Lemmatization:
Definition: Lemmatization is the process of reducing words to their canonical or dictionary form (lemma), which
involves identifying the base form of a word based on its part of speech and context.
Example: The word "better" can be lemmatized to "good," "cats" to "cat," and "went" to "go."
Purpose: Lemmatization is more sophisticated than stemming as it considers linguistic rules and context, resulting in
more accurate transformations of words to their base forms.
Steps-
Gathering unstructured information from various sources accessible in various document organizations, for example,
plain text, web pages, PDF records, etc.
Pre-processing and data cleansing tasks are performed to distinguish and eliminate inconsistency in the data. The
data cleansing process makes sure to capture the genuine text, and it is performed to eliminate stop words
stemming (the process of identifying the root of a certain word and indexing the data.
Processing and controlling tasks are applied to review and further clean the data set.
Pattern analysis is implemented in Management Information System.
Information processed in the above steps is utilized to extract important and applicable data for a powerful and
convenient decision-making process and trend analysis.
Tools-
Python Natural Language Toolkit (NLTK):
Description: NLTK is a leading platform for natural language processing (NLP) and text mining in Python. It provides
libraries and tools for tasks such as tokenization, stemming, lemmatization, part-of-speech tagging, sentiment
analysis, and more.
Features: Comprehensive NLP functionalities, easy-to-use APIs, support for various text processing tasks and
algorithms.
Apache OpenNLP:
Description: OpenNLP is an open-source NLP library that offers tools and models for tasks like tokenization, sentence
detection, named entity recognition (NER), part-of-speech tagging, chunking, parsing, and coreference resolution.
Features: Scalable and customizable, supports multiple languages, provides pre-trained models for text analysis
tasks.
GATE (General Architecture for Text Engineering):
Description: GATE is a powerful open-source text mining and NLP framework that supports various text processing
tasks, including information extraction, document classification, sentiment analysis, ontology engineering, and text
annotation.
Features: Extensible architecture, graphical development environment, integration with external tools and libraries,
support for multiple languages.
RapidMiner:
Description: RapidMiner is an integrated data science platform that offers text mining capabilities along with data
preparation, machine learning, predictive analytics, and visualization tools. It supports tasks like text preprocessing,
sentiment analysis, text classification, and topic modeling.
Features: Visual workflow environment, drag-and-drop interface, machine learning algorithms for text analysis,
deployment options.
KNIME Analytics Platform:
Description: KNIME is an open-source data analytics platform that includes text mining and NLP extensions. It
provides tools for text preprocessing, text classification, sentiment analysis, named entity recognition, topic
modeling, and text mining workflows.
Features: Graphical user interface, extensive collection of nodes for text processing tasks, integration with other
data sources and analytics tools.
IBM Watson Natural Language Understanding:
Description: IBM Watson NLU is a cloud-based text analysis service that offers advanced NLP capabilities, including
entity recognition, sentiment analysis, keyword extraction, concept tagging, emotion analysis, and document
categorization.
Features: Cognitive computing capabilities, API-based access, multilingual support, customizable models, integration
with IBM Watson ecosystem.
Lexalytics Salience:
Description: Lexalytics Salience is a text analytics and sentiment analysis software that provides tools for entity
extraction, concept extraction, sentiment scoring, language detection, and categorization of text data.
Features: Named entity recognition, entity linking, thematic extraction, summarization, industry-specific models
(e.g., finance, healthcare, social media).
data visualization/dashboard-kpi?
Ans-
Data visualization is actually a set of data points and information that are represented graphically to make it easy
and quick for user to understand. Data visualization is good if it has a clear meaning, purpose, and is very easy to
interpret, without requiring context. Tools of data visualization provide an accessible way to see and understand
trends, outliers, and patterns in data by using visual effects or elements such as a chart, graphs, and maps.
Characteristics of Effective Graphical Visual :
It shows or visualizes data very clearly in an understandable manner.
It encourages viewers to compare different pieces of data.
It closely integrates statistical and verbal descriptions of data set.
It grabs our interest, focuses our mind, and keeps our eyes on message as human brain tends to focus on visual data
more than written data.
It also helps in identifying area that needs more attention and improvement.
Using graphical representation, a story can be told more efficiently. Also, it requires less time to understand picture
than it takes to understand textual data.
Categories of Data Visualization ;
Data visualization is very critical to market research where both numerical and categorical data can be visualized that
helps in an increase in impacts of insights and also helps in reducing risk of analysis paralysis. So, data visualization is
categorized into following categories :
Numerical Data :
Numerical data is also known as Quantitative data. Numerical data is any data where data generally represents
amount such as height, weight, age of a person, etc. Numerical data visualization is easiest way to visualize data. It is
generally used for helping others to digest large data sets and raw numbers in a way that makes it easier to interpret
into action. Numerical data is categorized into two categories :
Continuous Data –
It can be narrowed or categorized (Example: Height measurements).
Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a household has).
The type of visualization techniques that are used to represent numerical data visualization is Charts and Numerical
Values. Examples are Pie Charts, Bar Charts, Averages, Scorecards, etc.
Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data where data generally represents
groups. It simply consists of categorical variables that are used to represent characteristics such as a person’s
ranking, a person’s gender, etc. Categorical data visualization is all about depicting key themes, establishing
connections, and lending context. Categorical data is classified into three categories :
Binary Data –
In this, classification is based on positioning (Example: Agrees or Disagrees).
Nominal Data –
In this, classification is based on attributes (Example: Male or Female).
Ordinal Data –
In this, classification is based on ordering of information (Example: Timeline or processes).

Dashboard KPI-
A KPI dashboard displays key performance indicators in interactive charts and graphs, allowing for quick, organized
review and analysis. Key performance indicators are quantifiable measures of performance over time for specific
strategic objectives. Modern KPI dashboards allow any user to easily explore the data behind the KPIs and uncover
actionable insights. In this way, a KPI dashboard transforms massive data sets from across an organization into data-
driven decisions that can improve your business.
business intellegence/future of bi?
Ans-
Business Intelligence is the talk of a new changing and growing world that can be defined as a set of concepts and
methodologies to improve decision-making in business through the use of facts and fact-based systems. The Goal of
Business Intelligence is to improve decision-making in business ideas and analysis. Business Intelligence is not just a
concept it’s a group of concepts and methodologies. Business Intelligence uses analytics and gut feelings for making
decisions.
Business intelligence refers to a collection of mathematical models and analysis methods that utilize data to produce
valuable information and insight for making important decisions.
Main Components of Business Intelligence System:
 Data Source
 Data Mart / Data Warehouse
 Data Exploration
 Data Mining
 Optimization
 Decisions
1.Data Source:
To begin, the first step is gathering and consolidating data from an array of primary and secondary sources. These
sources vary in origin and format, consisting mainly of operational system data but also potentially containing
unstructured documents like emails and data from external providers.
2.Data Mart / Data Warehouse:
Through the utilization of extraction and transformation tools, also known as extract, transform, load (ETL), data is
acquired from various sources and saved in databases designed specifically for business intelligence analysis. These
databases, commonly known as data warehouses and data marts, serve as a centralized location for the gathered
data.
3.Data Exploration:
The third level of the pyramid offers essential resources for conducting a passive analysis in business intelligence.
These resources include query and reporting systems, along with statistical methods. These techniques are referred
to as passive because decision makers must first develop ideas or establish criteria for data extraction before
utilizing analysis tools to uncover answers and confirm their initial theories. For example, a sales manager might
observe a decrease in revenues in a particular geographic region for a specific demographic of customers. In
response, she could utilize extraction and visualization tools to confirm her hypothesis and then use statistical
testing to validate her findings based on the data.
4.Data Mining:
The fourth level, known as active business intelligence methodologies, focuses on extracting valuable information
and knowledge from data. Part II of this book will delve into various techniques such as mathematical models,
pattern recognition, machine learning, and data mining. Unlike the tools discussed in the previous level, active
models do not rely on decision makers to come up with hypothesis but instead aim to enhance their understanding.
5.Optimization:
As you ascend the pyramid, you’ll encounter optimization models that empower you to choose the most optimal
course of action among various alternatives, which can often be quite extensive or even endless. These models have
also been effectively incorporated in marketing and logistics.
6.Decisions:
At last, the pinnacle of the pyramid reflects the ultimate decision made and put into action, serving as the logical end
to the decision-making process. Despite the availability and effective utilization of business intelligence
methodologies, the decision still lies in the hands of the decision makers, who can incorporate informal and
unstructured information to fine-tune and revise the suggestions and outcomes generated by mathematical models.
future of bi?
Ans-
The future scope of Business Intelligence (BI) is promising, driven by technological advancements, data proliferation,
and evolving business needs. Here are key areas shaping the future of BI:

Advanced Analytics and AI: BI is moving beyond descriptive analytics to predictive and prescriptive analytics
powered by AI and machine learning. This includes predictive modeling, anomaly detection, natural language
processing (NLP), and automated decision-making.
Big Data and Real-Time Analytics: With the exponential growth of data, BI is embracing big data technologies like
Hadoop, Spark, and NoSQL databases for processing and analyzing large volumes of structured and unstructured
data in real time.
Data Visualization and Storytelling: BI tools are enhancing data visualization capabilities with interactive dashboards,
geospatial analytics, and storytelling features to convey insights effectively and facilitate data-driven decision-
making.
Self-Service BI and Citizen Data Scientists: Empowering business users with self-service BI tools, drag-and-drop
interfaces, and easy-to-use analytics capabilities to explore data, create reports, and derive insights without heavy
reliance on IT or data scientists.
Embedded BI and Integration: Integrating BI capabilities into operational systems, applications, and workflows to
provide context-aware insights, personalized recommendations, and actionable intelligence directly within business
processes.
Mobile BI and Accessibility: Enabling mobile access to BI platforms, allowing users to access, analyze, and share data
anytime, anywhere, and on any device for improved collaboration and decision-making on-the-go.
Data Governance and Privacy: Addressing data governance challenges, ensuring data quality, compliance with
regulations (e.g., GDPR, CCPA), and implementing robust security measures to protect sensitive information in BI
systems.
Cloud-Based BI and SaaS Solutions: Adoption of cloud-based BI platforms and Software-as-a-Service (SaaS) BI
solutions for scalability, agility, cost-effectiveness, and seamless integration with other cloud services and data
sources.
Industry-Specific BI Solutions: Tailoring BI solutions to specific industries (e.g., healthcare, retail, finance) with
industry-specific analytics, KPIs, and domain expertise to address unique business challenges and opportunities.
Augmented Analytics and Data Democratization: Leveraging augmented analytics, data discovery tools, and natural
language querying to democratize data access, insights, and decision-making across organizations, fostering a data-
driven culture.
Overall, the future of BI lies in harnessing data as a strategic asset, leveraging advanced technologies, fostering data-
driven cultures, and empowering users with actionable insights to drive innovation, competitiveness, and business
success.

You might also like