DW notes
DW notes
Ans-
On-Line Transaction Processing (OLTP) System refers to the system that manage transaction oriented applications.
These systems are designed to support on-line transaction and process query quickly on the Internet.
For example: POS (point of sale) system of any supermarket is a OLTP System.
Every industry in today’s world use OLTP system to record their transactional data. The main concern of OLTP
systems is to enter, store and retrieve the data. They covers all day to day operations such as purchasing,
manufacturing, payroll, accounting, etc.of an organization. Such systems have large numbers of user which conduct
short transaction. It supports simple database query so the response time of any user action is very fast.
The data acquired through an OLTP system is stored in commercial RDBMS, which can be used by an OLAP System
for data analytics and other business intelligence operations.
Some other examples of OLTP systems include order entry, retail sales, and financial transaction systems.
Advantages of an OLTP System:
OLTP Systems are user friendly and can be used by anyone having basic understanding
It allows its user to perform operations like read, write and delete data quickly.
It responds to its user actions immediately as it can process query very quickly.
This systems are original source of the data.
It helps to administrate and run fundamental business tasks
It helps in widening customer base of an organization by simplifying individual processes
Dis Advantages of an OLTP System:
OLTP lacks proper methods of transferring products to buyers by themselves.
OLTP systems are prone to hackers and cybercriminals due to worldwide availability.
Server failure can lead to the loss of a large amount of data from the system.
The number of queries and updates to the system is limited.
In business-to-business (B2B) transactions, some transactions must go offline to complete some stages,
leading to buyers and suppliers losing some OLTP efficiency benefits.
OLTP Challenges:
Performance: OLTP systems require high-performance hardware and optimized software to maintain
responsiveness and support real-time transaction processing.
Data Security: Ensuring the confidentiality, integrity, and availability of sensitive data is crucial in OLTP
systems, necessitating robust security measures.
Scalability: As transaction volumes increase, OLTP systems must scale to accommodate growing workloads
without compromising performance or data integrity.
Concurrency Control: OLTP systems need to manage multiple users accessing and modifying data
simultaneously while maintaining data consistency.
System Maintenance: OLTP systems require regular maintenance, such as backups, updates, and tuning, to
ensure optimal performance and reliability.
OLTP Characteristics
1. Short response time
OLTP systems maintain very short response times to be effective for users. For example, responses from an ATM
operation need to be quick to make the process effective, worthwhile, and convenient.
2. Process small transactions
OLTP systems support numerous small transactions with a small amount of data executed simultaneously over the
network. It can be a mixture of queries and Data Manipulation Language (DML) overload. The queries normally
include insertions, deletions, updates, and related actions. Response time measures the effectiveness of OLTP
transactions, and millisecond responses are becoming common.
3. Data maintenance operations
Data maintenance operations are data-intensive computational reporting and data update programs that run
alongside OLTP systems without interfering with user queries.
2. data warehouse development lifecycle?
Ans-
Data warehouse development life cycle
Data Warehousing is a flow process used to gather and handle structured and unstructured data from multiple
sources into a centralized repository to operate actionable business decisions. With all of your data in one place, it
becomes easier to perform analysis, reporting and discover meaningful insights at completely different combination
levels. A data warehouse setting includes extraction, transformation, and loading (ELT) resolution, an online
analytical processing (OLAP) engine, consumer analysis tools, and different applications that manage the method of
gathering data and delivering it to business. The term data warehouse life-cycle is used to indicate the steps a data
warehouse system goes through between when it is built. The following is the Life-cycle of Data Warehousing:
Requirement Specification: It is the first step in the development of the Data Warehouse and is done by business
analysts. In this step, Business Analysts prepare business requirement specification documents. More than 50% of
requirements are collected from the client side and it takes 3-4 months to collect all the requirements. After the
requirements are gathered, the data modeler starts recognizing the dimensions, facts & combinations based on the
requirements. We can say that this is an overall blueprint of the data warehouse. But, this phase is more about
determining business needs and placing them in the data warehouse.
Data Modelling: This is the second step in the development of the Data Warehouse. Data Modelling is the process of
visualizing data distribution and designing databases by fulfilling the requirements to transform the data into a
format that can be stored in the data warehouse. For example, whenever we start building a house, we put all the
things in the correct position as specified in the blueprint. That’s what data modeling is for data warehouses. Data
modelling helps to organize data, creates connections between data sets, and it’s useful for establishing data
compliance and its security that line up with data warehousing goals. It is the most complex phase of data
warehouse development. And, there are many data modelling techniques that businesses use for warehouse design.
Data modelling typically takes place at the data mart level and branches out in a data warehouse. It’s the logic of
how the data is stored concerning other data. There are three data models for data warehouses:
Star Schema
Snowflake Schema
Galaxy Schema.
ELT Design and Development: This is the third step in the development of the Data Warehouse. ETL or Extract,
Transfer, Load tool may extract data from various source systems and store it in a data lake. An ETL process can
extract the data from the lake, after that transform it and load it into a data warehouse for reporting. For optimal
speeds, good visualization, and the ability to build easy, replicable, and consistent data pipelines between all of the
existing architecture and the new data warehouse, we need ELT tools. This is where ETL tools like SAS Data
Management, IBM Information Server, Hive, etc. come into the picture. A good ETL process can be helpful in
constructing a simple yet functional data warehouse that’s valuable throughout every layer of the organization.
OLAP Cubes: This is the fourth step in the development of the Data Warehouse. An OLAP cube, also known as a
multidimensional cube or hypercube, is a data structure that allows fast analysis of data according to the multiple
dimensions that define a business problem. A data warehouse would extract information from multiple data sources
and formats like text files, excel sheets, multimedia files, etc. The extracted data is cleaned and transformed and is
loaded into an OLAP server (or OLAP cube) where information is pre-processed in advance for further analysis.
Usually, data operations and analysis are performed using a simple spreadsheet, where data values are arranged in
row and column format. This is ideal for two-dimensional data. However, OLAP contains multidimensional data, with
data typically obtained from different and unrelated sources. Employing a spreadsheet isn’t an optimum choice. The
cube will store and analyze multidimensional data in a logical and orderly manner. Now, data warehouses are now
offered as a fully built product that is configurable and capable of staging multiple types of data. OLAP cubes are
becoming outdated as OLAP cubes can’t deliver real-time analysis and reporting, as businesses are now expecting
something with high performance.
UI Development: This is the fifth step in the development of the Data Warehouse. So far, the processes discussed
have taken place at the backend. There is a need for a user interface for how the user and a computer system
interact, in particular the use of input devices and software, to immediately access the data warehouse for analysis
and generating reports. The main aim of a UI is to enable a user to effectively manage a device or machine they’re
interacting with. There are plenty of tools in the market that helps with UI development. BI tools like Tableau or
PowerBI for those using BigQuery are great choices.
Maintenance: This is the sixth step in the development of the Data Warehouse. In this phase, we can update or
make changes to the schema and data warehouse’s application domain or requirements. Data warehouse
maintenance systems must provide means to keep track of schema modifications as well, for instance, modifications.
At the schema level, we can perform operations for the Insertion, and change dimensions and categories. Changes
are, for example, adding or deleting user-defined attributes.
Test and Deployment: This is often the ultimate step in the Data Warehouse development cycle. Businesses and
organizations test data warehouses to ensure whether the required business problems are implemented successfully
or not. The warehouse testing involves the scrutiny of enormous volumes of data. Data that has to be compared
comes from heterogeneous data sources like relational databases, flat files, operational data, etc. The overall data
warehouse project testing phases include: Data completeness, Data Transformation, Data is loaded by means of ETL
tools, Data integrity, etc. After testing the data warehouse, we deployed it so that users could immediately access
the data and perform analysis. Basically, in this phase, the data warehouse is turned on and lets the user take the
benefit of it. At the time of data warehouse deployment, most of its functions are implemented. The data
warehouses can be deployed at their own data center or on the cloud.
3. Data warehouse features(charechteristic)/advtange disadvantage?
Ans-
Characteristics of a Data Warehouse
Integrated Data
One of the key characteristics of a data warehouse is that it contains integrated data. This means that the data is
collected from various sources, such as transactional systems, and then cleaned, transformed, and consolidated into
a single, unified view. This allows for easy access and analysis of the data, as well as the ability to track data over
time.
Subject-Oriented
A data warehouse is also subject-oriented, which means that the data is organized around specific subjects, such as
customers, products, or sales. This allows for easy access to the data relevant to a specific subject, as well as the
ability to track the data over time.
Non-Volatile
Another characteristic of a data warehouse is that it is non-volatile. This means that the data in the warehouse is
never updated or deleted, only added to. This is important because it allows for the preservation of historical data,
making it possible to track trends and patterns over time.
Time-Variant
A data warehouse is also time-variant, which means that the data is stored with a time dimension. This allows for
easy access to data for specific time periods, such as last quarter or last year. This makes it possible to track trends
and patterns over time.
Advantages of a Data Warehouse:
Data warehouses facilitate end users' access to a variety of data.
Assist in the operation of applications for decision support systems such as trend reports, for instance,
obtaining the products that have sold the most in a specific area over the past two years; exception reports,
reports that compare the actual outcomes to the predetermined goals.
Using numerous data warehouses can increase the operational value of business systems, especially
customer relationship management.
Makes selections with higher quality.
For the medium and long term, it is especially helpful.
Installing these systems is quite straightforward if the data sources and goals are clear.
DisAdvantages of a Data Warehouse:
The data warehouses may project substantial expenditures throughout his life. The data warehouse is
typically not stationary. Costs for maintenance are considerable.
Data warehouses could soon become outdated.
They occasionally need to provide complete information before a request for information, which also costs
the organization money.
Between data warehouses and operational systems, there is frequently a fine line. It is necessary to
determine which of these features can be used and which ones should be implemented in the data
warehouse since it would be expensive to carry out needless activities or to stop carrying out those that
would be required.
It could be more useful for making decisions in real-time due to the prolonged processing time it may
require. In any event, the trend of modern products (along with technological advancements) addresses this
issue by turning the drawback into a benefit.
Regarding the various objectives a company seeks to achieve, challenges may arise during implementation.
steps involved in planning and development of project?
Ans-
Major steps in DW Projects
Business and Technical Justification – In this phase, the project’s sponsors detail the business justification,
opportunities and benefits as well as the technical justifications for the DW project. Staffing and other necessary
resources are identified.
Business Justification:
– Review business initiatives and processes
– Enlist BI sponsors and stakeholders (e.g., potential BI users)
– Document business benefits and outcomes in terms of adding business value
– Project scope and Budgeting
Technical Justification:
– Product evaluations for proof-of-concept and technology roadmaps
– Assessing necessary technical skills/expertise
– Assessing data quality
Gathering Business Requirements (KPI’s) – In this phase business users are interviewed to determine what
measurements / metrics they require. These are called Key Performance Indicators (KPI’s) and are generally
calculated by summing and combining OLTP transactions data. This stage clearly requires deep involvement of the
business managers and others in higher level decision making positions.
Examples of KPIs for Telecommunications Industry
System Design / Modeling – In this phase, the overall system is designed using conceptual modeling at three levels:
System Architecture Design – Overall technology architecture (hardware and DBMS software integration) are
designed. This step can be done in parallel with data and application design.
Data Modeling – Data models (Dimensions and facts) are created and mappings / pipelines from existing operational
systems are designed.
BI Application Design – Applications are designed at the conceptual level. For example, reports, user interfaces, etc.
can be mocked up and reviewed by users.
This stage is carried out by systems analysts in conjunction with the business stakeholders.
System Development – In this phase, the designs are implemented in hardware and software. DBMS vendors are
selected, data warehouse schemas are created, ETL code is written/configured, and BI applications are coded. This
stage is carried out almost exclusively by technologists (programmers, DBAs, etc.) although business users may be
called upon for testing.
Phased deployment – In this phase, users are brought on-line (on boarded) to the data warehouse.
Maintenance and evolution – Data warehouses undergo continuous evolution as new KPI’s and data sources are
defined and integrated.
tools for datawarehouse/components?
Ans-
1.Snowflake
Snowflake is a cloud-based data warehousing platform that offers a fully managed and scalable solution for data
storage, processing, and analysis. It is designed to address the challenges of traditional on-premises data
warehousing by providing a modern and cloud-native architecture. Here are the key features of Snowflake:
Snowflake is built from the ground up for the cloud. It runs entirely in cloud environments like AWS, Azure, and
Google Cloud Platform (GCP).
The platform uses a multi-cluster, shared data architecture, which means that multiple users and workloads can
concurrently access and analyze the same data without interference.
2. SAP Datawarehouse Cloud
SAP Data Warehouse Cloud is a cloud-based data warehousing solution developed by SAP. It is designed to provide
organizations with a modern, scalable, and integrated platform for data storage, data modeling, data integration,
and analytics. Here are key features and aspects of SAP Data Warehouse Cloud:
The platform allows you to integrate data from a wide range of sources, including on-premises databases, cloud-
based applications, spreadsheets, and more
Data Warehouse Cloud features a semantic layer that abstracts complex data structures and provides a business-
friendly view of data.
3. Oracle Exadata
Oracle Autonomous Data Warehouse (ADW) is a cloud-based data warehousing service offered by Oracle
Corporation. It is designed to simplify data management and analytics tasks by automating many of the traditionally
complex and time-consuming processes associated with data warehousing. Here are key aspects and features of
Oracle Autonomous Data Warehouse:
It supports data integration and ETL (Extract, Transform, Load) processes with built-in features for data loading and
transformation.
4. Panoply
Panoply is a managed ELT and a cloud data warehouse platform that allows users to set up a data warehouse
architecture. The cloud data warehouse eliminates the need for you to set up and maintain your own on-premises
data warehouse, saving time and resources.
Here are the key features of Panoply:
Various built-in connectors to ingest data from multiple sources
Built-in scheduler for automation
5. Teradata Vantage
Teradata Vantage is a data warehousing and analytics platform designed to handle large volumes of data and
support complex analytical workloads. The platform uses SQL as its primary query language, which means it is
mostly meant for users with SQL skills. Here are some key aspects of Teradata Vantage for data warehousing:
Various sources, including data warehouses, data lakes, on-premises systems, and cloud platforms.
Data in a warehouse are usually in the multidimensional form. Dimensional modeling prefers keeping the table
denormalized. The primary purpose of dimensional modeling is to optimize the database for faster retrieval of the
data. The concept of Dimensional Modelling was developed by Ralph Kimball and consists of “fact” and “dimension”
tables. The primary purpose of dimensional modeling is to enable business intelligence (BI) reporting, query, and
analysis.
Dimensional modeling is a form of modeling of data that is more flexible from the perspective of the user. These
dimensional and relational models have their unique way of data storage that has specific advantages. Dimensional
models are built around business processes. They need to ensure that dimension tables use a surrogate key.
Dimension tables store the history of the dimensional information.
It is transaction-oriented. It is subject-oriented.
High transaction volumes using few Low transaction volumes using many
records at a time. records at a time.
Attribute Hierarchies
By default, attribute members are organized into two level hierarchies, consisting of a leaf level and an All level. The
All level contains the aggregated value of the attribute's members across the measures in each measure group to
which the dimension of which the attribute is related is a member. However, if the IsAggregatable property is set to
False, the All level is not created. For more information, see Dimension Attribute Properties Reference.
Attributes can be, and typically are, arranged into user-defined hierarchies that provide the drill-down paths by
which users can browse the data in the measure groups to which the attribute is related. In client applications,
attributes can be used to provide grouping and constraint information. When attributes are arranged into user-
defined hierarchies, you define relationships between hierarchy levels when levels are related in a many-to-one or a
one-to-one relationship (called a natural relationship). For example, in a Calendar Time hierarchy, a Day level should
be related to the Month level, the Month level related to the Quarter level, and so on. Defining relationships
between levels in a user-defined hierarchy enables Analysis Services to define more useful aggregations to increase
query performance and can also save memory during processing performance
diff star schema vs snowflake?
Ans-
Star Schema Snowflake Schema
In star schema, The fact tables and While in snowflake schema, The fact tables,
the dimension tables are contained. dimension tables as well as sub dimension tables
are contained.
It takes less time for the execution While it takes more time than star schema for the
of queries. execution of queries.
The query complexity of star While the query complexity of snowflake schema is
schema is low. higher than star schema.
It has less number of foreign keys. While it has more number of foreign keys.
Metadata standards have been evolving over the years and they vary in levels of details and complexity. The general
metadata standards like the Dublin Core Metadata Element Set apply to broader communities and make your data
more interoperable with other standards. The subject-specific metadata standards, on the other hand, help search
data more easily. For example, the ISO 19115 standard works well for the geospatial community. You can evaluate
which standards align the best with your use cases and your communities.
Users especially highlight the data curation and autocorrection features as well as general ease of use, though some
pointed out the insufficient feature list and poor platform performance.
Classification Of metadata?
Ans-Metadata comes in many shapes and flavors, carrying additional information about where a resource was
produced, by whom, when was the last time it was accessed, what is it about and many more details around it.
Similar to the library cards, describing a book, metadata describes objects and adds more granularity to the way they
are represented. Three main types of metadata exist: descriptive, structural and administrative.
Descriptive metadata adds information about who created a resource, and most importantly – what the
resource is about, what it includes. This is best applied using semantic annotation.
Structural metadata includes additional data about the way data elements are organized – their
relationships and the structure they exist in.
Administrative metadata provides information about the origin of resources, their type and access rights.
tool selection for data warehouse?
Ans-
While the specifics are, well, specific to every company, there are six key criteria to keep in mind when choosing a
data warehouse:
Cloud vs. on-prem
Tool ecosystem
Implementation cost and time
Ongoing costs and maintenance
Ease of scalability
Support and community
In many cases, these criteria are really trade-offs— for example, a data warehouse that's quick to implement may be
a pain to scale. But you'll be better prepared to make the right decision if you understand what you're getting into
before you buy.
Cloud vs. on-premise storage
Even as recently as a few years ago, you might have struggled with whether to go with a cloud-based or on premises
approach. Today, the battle is over.
There are a few circumstances where it still makes sense to consider an on-prem approach. For example, if most of
your critical databases are on-premises and they are old enough that they don't work well with cloud-based data
warehouses, an on-premises approach might be the way to go. Or if your company is subject to byzantine regulatory
requirements that make on-prem your only choice.
Data tool ecosystem
If you work at a company that’s already heavily invested in a data tool ecosystem and doesn't have a lot of data
sources residing outside of it, you're probably going to pick that ecosystem's tool.
Data warehouse implementation
They say the devil is in the details, and that’s doubly true when it comes to data warehouse implementation. Here
are some of the finer points you should consider:
Cost-When deciding between data warehouse tools, money is often a major driver.
Time- Cost matters, but time often matters more—especially for startups that are trying to move as quickly as they
can.
Ongoing storage costs and maintenance
In addition to the costs of getting started, you'll also need to take into account ongoing costs—which sometimes can
be substantially higher than the resources you allocate at the beginning.
There are several ongoing costs you'll need to consider:
Storage and compute: As your data and usage grows, so will your monthly storage bill. Having a good sense of how
Ease of scalability
If you’re part of a fast-growing business, one of the things you want to find out is, what's involved in scaling up your
data warehouse?To figure that out, first you need to get a rough sense of what your current business needs are,
including how much data you currently have, how quickly your needs are likely to grow, and how much confidence
you have in your assessment of your needs for scale.
Support and community
When you run into trouble with your data warehouse, how likely are you to get the help you need when you need it?
While no one chooses a data warehouse tool based solely on the support they can get, if two data warehouse
systems are pretty equal, it could be the deciding factor.
Granularity?
Ans-
Granularity refers to the level of detail or depth present in a dataset or database. In data warehousing, granularity
determines the extent to which data is broken down. Higher granularity means data is dissected into finer, more
detailed parts, while lower granularity results in broader, less detailed data aggregation. The choice of granularity
level depends on the business's particular needs and data analysis goals.
Security Aspects
While granularity itself doesn't offer any inherent security measures, granularity decisions can impact data
management and security. For example, high granularity may necessitate stricter access controls to safeguard
detailed, potentially sensitive data.
Performance
The performance of a data warehousing system is directly impacted by its granularity. Higher granularity data might
require more computational resources for processing, potentially slowing down query response times. Conversely,
less detailed, lower granularity data can usually be processed faster, resulting in quicker response times.
Consequently, striking a balance between detail level and system performance is a key consideration when
determining granularity.
Olap Fasmi/characteristic/olap vs datawarehouse?
Ans-The FASMI Test
It can represent the characteristics of an OLAP application in a specific method, without dictating how it should be
performed.
Fast − It defines that the system is targeted to produce most responses to users within about five seconds, with the
understandable analysis taking no more than one second and very few taking more than 20 seconds.
Independent research in the Netherlands has shown that end-users consider that a process has declined if results
are not received with 30 seconds, and they are suitable to hit ‘ALT+Ctrl+Delete’ unless the system needs them that
the report will take longer.
Analysis − It defines that the system can manage with any business logic and sta s cal analysis that is appropriate
for the application and the user, the keep it easy enough for the target user. Although some pre-programming can
be required, it does not think it acceptable if all application definitions have to be completed using a professional
4GL. It is necessary to enable the user to represent new ad hoc calculations as part of the analysis and to report on
the data in any desired method, without having to program, so it can exclude products (like Oracle Discoverer) that
do not enable the user to represent new ad hoc calculations as an element of the analysis and to report on the data
in any desired method, without having to program, so it can exclude products (like Oracle Discoverer) that do not
enable adequate end-user oriented calculation flexibility.
Shared − It defines that the system implements all the security requirements for confiden ality (probably down to
cell level) and, multiple write access is required, concurrent update areas at a suitable level. It is not all applications
required users to write data back, but for the increasing number that does, the system must be able to handle
several updates in an appropriate, secure manner. This is a major field of weakness in some OLAP products, which
tend to consider that all OLAP applications will be read-only, with simple security controls.
Multidimensional − The system should support a mul dimensional conceptual view of the data, including complete
support for hierarchies and multiple hierarchies. It is not setting up a specific minimum number of dimensions that
should be managed as it is too software dependent and most products seem to have enough for their target
industry.
Information − Informa on is all of the data and derived data required, whether it is and however much is relevant
for the software. We are measuring the capacity of several products in terms of how much input data can manage,
not how many Gigabytes they take to save it.
In the above given presentation, the factory’s sales for Bangalore are, for the time dimension, which is organized
into quarters and the dimension of items, which is sorted according to the kind of item which is sold. The facts here
are represented in rupees (in thousands).Now, if we desire to view the data of the sales in a three-dimensional table,
then it is represented in the diagram given below. Here the data of the sales is represented as a two dimensional
table. Let us consider the data according to item, time and location (like Kolkata, Delhi, Mumbai)
data preproccesing/ technique in details with daigarm?
Ans-
Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming, and
integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve the quality of
the data and to make it more suitable for the specific data mining task.
Some common steps in data preprocessing include:
Data preprocessing is an important step in the data mining process that involves cleaning and transforming raw data
to make it suitable for analysis. Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as missing values,
outliers, and duplicates. Various techniques can be used for data cleaning, such as imputation, removal, and
transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset. Data integration
can be challenging as it requires handling data with different formats, structures, and semantics. Techniques such as
record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format for analysis. Common techniques used
in data transformation include normalization, standardization, and discretization. Normalization is used to scale the
data to a common range, while standardization is used to transform the data to have zero mean and unit variance.
Discretization is used to convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important information. Data
reduction can be achieved through techniques such as feature selection and feature extraction. Feature selection
involves selecting a subset of relevant features from the dataset, while feature extraction involves transforming the
data into a lower-dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals. Discretization is
often used in data mining and machine learning algorithms that require categorical data. Discretization can be
achieved through techniques such as equal width binning, equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1 and 1.
Normalization is often used to handle data with different units and scales. Common normalization techniques
include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis results. The
specific steps involved in data preprocessing may vary depending on the nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become more accurate.
Preprocessing in Data Mining: Data preprocessing is a data mining technique which is used to transform the raw
data in a useful and efficient format.
Techniques-
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling
of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the
most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data
collection, data entry errors etc. It can be handled in following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and
then various methods are performed to complete the task. Each segmented is handled separately. One can replace
all data in a segment by its mean or boundary values can be used to complete the task.
Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one
independent variable) or multiple (having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves
following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city” can be
converted to “country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the dataset while
preserving the important information. This is done to improve the efficiency of data analysis and to avoid overfitting
of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature selection is often
performed to remove irrelevant or redundant features from the dataset. It can be done using various techniques
such as correlation analysis, mutual information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while preserving the
important information. Feature extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant analysis (LDA), and non-negative matrix
factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to reduce the size
of the dataset while preserving the important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often used to reduce the
size of the dataset by replacing similar data points with a representative centroid. It can be done using techniques
such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important information. Compression is
often used to reduce the size of the dataset for storage and transmission purposes. It can be done using techniques
such as wavelet compression, JPEG compression, and gzip compression.
data cleaning/ tasks/ methods?
Ans-
Data cleaning is an essential step in the data mining process. It is crucial to the construction of a model. The step that
is required, but frequently overlooked by everyone, is data cleaning. The major problem with quality information
management is data quality. Problems with data quality can happen at any place in an information system. Data
cleansing offers a solution to these issues.
Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly formatted, duplicated, or
insufficient data from a dataset. Even if results and algorithms appear to be correct, they are unreliable if the data is
inaccurate. There are numerous ways for data to be duplicated or incorrectly labeled when merging multiple data
sources.
In general, data cleaning lowers errors and raises the caliber of the data. Although it might be a time-consuming and
laborious operation, fixing data mistakes and removing incorrect information must be done. A crucial method for
cleaning up data is data mining. A method for finding useful information in data is data mining. Data quality mining is
a novel methodology that uses data mining methods to find and fix data quality issues in sizable databases. Data
mining mechanically pulls intrinsic and hidden information from large data sets. Data cleansing can be accomplished
using a variety of data mining approaches.
To arrive at a precise final analysis, it is crucial to comprehend and improve the quality of your data. To identify key
patterns, the data must be prepared. Exploratory data mining is understood. Before doing business analysis and
gaining insights, data cleaning in data mining enables the user to identify erroneous or missing data.
Data cleaning before data mining is often a time-consuming procedure that necessitates IT personnel to assist in the
initial step of reviewing your data due to how time-consuming data cleaning is. But if your final analysis is inaccurate
or you get an erroneous result, it's possible due to poor data quality.
Tasks-
Steps for Cleaning Data
You can follow these fundamental stages to clean your data even if the techniques employed may vary depending on
the sorts of data your firm stores:
1. Remove duplicate or irrelevant observations
Remove duplicate or pointless observations as well as undesirable observations from your dataset. The majority of
duplicate observations will occur during data gathering. Duplicate data can be produced when you merge data sets
from several sources, scrape data, or get data from clients or other departments. One of the most important factors
to take into account in this procedure is de-duplication. Those observations are deemed irrelevant when you
observe observations that do not pertain to the particular issue you are attempting to analyze.
You might eliminate those useless observations, for instance, if you wish to analyze data on millennial clients but
your dataset also includes observations from earlier generations. This can improve the analysis's efficiency, reduce
deviance from your main objective, and produce a dataset that is easier to maintain and use.
2. Fix structural errors
When you measure or transfer data and find odd naming practices, typos, or wrong capitalization, such are
structural faults. Mislabelled categories or classes may result from these inconsistencies. For instance, "N/A" and
"Not Applicable" might be present on any given sheet, but they ought to be analyzed under the same heading.
3. Filter unwanted outliers
There will frequently be isolated findings that, at first glance, do not seem to fit the data you are analyzing.
Removing an outlier if you have a good reason to, such as incorrect data entry, will improve the performance of the
data you are working with.
However, occasionally the emergence of an outlier will support a theory you are investigating. And just because
there is an outlier, that doesn't necessarily indicate it is inaccurate. To determine the reliability of the number, this
step is necessary. If an outlier turns out to be incorrect or unimportant for the analysis, you might want to remove it.
4. Handle missing data
Because many algorithms won't tolerate missing values, you can't overlook missing data. There are a few options for
handling missing data. While neither is ideal, both can be taken into account, for example:
Although you can remove observations with missing values, doing so will result in the loss of information, so proceed
with caution.
Again, there is a chance to undermine the integrity of the data since you can be working from assumptions rather
than actual observations when you input missing numbers based on other observations.
To browse null values efficiently, you may need to change the way the data is used.
5. Validate and QA
As part of fundamental validation, you ought to be able to respond to the following queries once the data cleansing
procedure is complete:
Are the data coherent?
Does the data abide by the regulations that apply to its particular field?
Does it support or refute your working theory? Does it offer any new information?
To support your next theory, can you identify any trends in the data?
If not, is there a problem with the data's quality?
False conclusions can be used to inform poor company strategy and decision-making as a result of inaccurate or
noisy data. False conclusions can result in a humiliating situation in a reporting meeting when you find out your data
couldn't withstand further investigation. Establishing a culture of quality data in your organization is crucial before
you arrive. The tools you might employ to develop this plan should be documented to achieve this.
Tasks-
1. Discovery
The first step is to identify and understand data in its original source format with the help of data profiling tools.
Finding all the sources and data types that need to be transformed. This step helps in understanding how the data
needs to be transformed to fit into the desired format.
2. Mapping
The transformation is planned during the data mapping phase. This includes determining the current structure, and
the consequent transformation that is required, then mapping the data to understand at a basic level, the way
individual fields would be modified, joined or aggregated.
3. Code generation
The code, which is required to run the transformation process, is created in this step using a data transformation
platform or tool.
4. Execution
The data is finally converted into the selected format with the help of the code. The data is extracted from the
source(s), which can vary from structured to streaming, telemetry to log files. Next, transformations are carried out
on data, such as aggregation, format conversion or merging, as planned in the mapping stage. The transformed data
is then sent to the destination system which could be a dataset or a data warehouse.
5. Review
The transformed data is evaluated to ensure the conversion has had the desired results in terms of the format of the
data.
It must also be noted that not all data will need transformation, at times it can be used as is.
Types-
here are several different ways to transform data, such as:
Scripting: Data transformation through scripting involves Python or SQL to write the code to extract and transform
data. Python and SQL are scripting languages that allow you to automate certain tasks in a program. They also allow
you to extract information from data sets. Scripting languages require less code than traditional programming
languages. Therefore, it is less intensive.
On-Premises ETL Tools: ETL tools take the required work to script the data transformation by automating the
process. On-premises ETL tools are hosted on company servers. While these tools can help save you time, using
them often requires extensive expertise and significant infrastructure costs.
Cloud-Based ETL Tools: As the name suggests, cloud-based ETL tools are hosted in the cloud. These tools are often
the easiest for non-technical users to utilize. They allow you to collect data from any cloud source and load it into
your data warehouse. With cloud-based ETL tools, you can decide how often you want to pull data from your source,
and you can monitor your usage.
Strategies-
1. Smoothing: It is a process that is used to remove noise from the dataset using some algorithms It allows for
highlighting important features present in the dataset. It helps in predicting the patterns. When collecting data, it
can be manipulated to eliminate or reduce any variance or any other noise form. The concept behind data
smoothing is that it will be able to identify simple changes to help predict different trends and patterns. This serves
as a help to analysts or traders who need to look at a lot of data which can often be difficult to digest for finding
patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting data in a summary format.
The data may be obtained from multiple data sources to integrate these data sources into a data analysis
description. This is a crucial step since the accuracy of data analysis insights is highly dependent on the quantity and
quality of the data used. Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning financing or business
strategy of the product, pricing, operations, and marketing strategies. For example, Sales, data may be aggregated to
compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small intervals. Most Data Mining
activities in the real world require continuous attributes. Yet many of the existing data mining frameworks are
unable to handle these attributes. Also, even if a data mining task can manage a continuous attribute, it can
significantly improve its efficiency by replacing a constant quality attribute with its discrete values. For example, (1-
10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the mining process from the given
set of attributes. This simplifies the original data & makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes using concept hierarchy. For
Example Age initially in Numerical form (22, 25) is converted into categorical value (young, old). For example,
Categorical attributes, such as house addresses, may be generalized to higher-level definitions, such as town or
country.
6. Normalization: Data normalization involves converting all data variables into a given range.
data reduction and integartion/startegies?
Ans-
Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the most
important information. This can be beneficial in situations where the dataset is too large to be processed efficiently,
or where the dataset contains a large amount of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data mining, including:
Data Sampling: This technique involves selecting a subset of the data to work with, rather than using the entire
dataset. This can be useful for reducing the size of a dataset while still preserving the overall trends and patterns in
the data.
Dimensionality Reduction: This technique involves reducing the number of features in the dataset, either by
removing features that are not relevant or by combining multiple features into a single feature.
Data Compression: This technique involves using techniques such as lossy or lossless compression to reduce the size
of a dataset.
Data Discretization: This technique involves converting continuous data into discrete data by partitioning the range
of possible values into intervals or bins.
Feature Selection: This technique involves selecting a subset of features from the dataset that are most relevant to
the task at hand.
It’s important to note that data reduction can have a trade-off between the accuracy and the size of the data. The
more data is reduced, the less accurate the model will be and the less generalizable it will be.
Conect Hirerachy-
n data mining, the concept of a concept hierarchy refers to the organization of data into a tree-like structure, where
each level of the hierarchy represents a concept that is more general than the level below it. This hierarchical
organization of data allows for more efficient and effective data analysis, as well as the ability to drill down to more
specific levels of detail when needed. The concept of hierarchy is used to organize and classify data in a way that
makes it more understandable and easier to analyze. The main idea behind the concept of hierarchy is that the same
data can have different levels of granularity or levels of detail and that by organizing the data in a hierarchical
fashion, it is easier to understand and perform analysis.
Types of Concept Hierarchies
Schema Hierarchy: Schema Hierarchy is a type of concept hierarchy that is used to organize the schema of a
database in a logical and meaningful way, grouping similar objects together. A schema hierarchy can be used
to organize different types of data, such as tables, attributes, and relationships, in a logical and meaningful
way. This can be useful in data warehousing, where data from multiple sources needs to be integrated into a
single database.
Set-Grouping Hierarchy: Set-Grouping Hierarchy is a type of concept hierarchy that is based on set theory,
where each set in the hierarchy is defined in terms of its membership in other sets. Set-grouping hierarchy
can be used for data cleaning, data pre-processing and data integration. This type of hierarchy can be used
to identify and remove outliers, noise, or inconsistencies from the data and to integrate data from multiple
sources.
Operation-Derived Hierarchy: An Operation-Derived Hierarchy is a type of concept hierarchy that is used to
organize data by applying a series of operations or transformations to the data. The operations are applied
in a top-down fashion, with each level of the hierarchy representing a more general or abstract view of the
data than the level below it. This type of hierarchy is typically used in data mining tasks such as clustering
and dimensionality reduction. The operations applied can be mathematical or statistical operations such as
aggregation, normalization
Rule-based Hierarchy: Rule-based Hierarchy is a type of concept hierarchy that is used to organize data by
applying a set of rules or conditions to the data. This type of hierarchy is useful in data mining tasks such as
classification, decision-making, and data exploration. It allows to the assignment of a class label or decision
to each data point based on its characteristics and identifies patterns and relationships between different
attributes of the data.
ETL steps in detail?
Ans-
The ETL process is an iterative process that is repeated as new data is added to the warehouse. The process is
important because it ensures that the data in the data warehouse is accurate, complete, and up-to-date. It also helps
to ensure that the data is in the format required for data mining and reporting.
Additionally, there are many different ETL tools and technologies available, such as Informatica, Talend, DataStage,
and others, that can automate and simplify the ETL process.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in which an ETL
tool extracts the data from various data source systems, transforms it in the staging area, and then finally, loads it
into the Data Warehouse system.
Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems is extracted which can
be in various formats like relational databases, No SQL, XML, and flat files into the staging area. It is important to
extract the data from various source systems and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be corrupted also. Hence loading it directly into
the data warehouse may damage it and rollback will be much more difficult. Therefore, this is one of the most
important steps of ETL process.
Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or functions are applied on the
extracted data to convert it into a single standard format. It may involve following processes/tasks:
Filtering – loading only certain attributes into the data warehouse.
Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States, and America into USA,
etc.
Joining – joining multiple attributes into one.
Splitting – splitting a single attribute into multiple attributes.
Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is finally loaded into the data
warehouse. Sometimes the data is updated by loading into the data warehouse very frequently and sometimes it is
done after longer but regular intervals. The rate and period of loading solely depends on the requirements and
varies from system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can transformed and during
that period some new data can be extracted. And while the transformed data is being loaded into the data
warehouse, the already extracted data can be transformed. The block diagram of the pipelining of ETL process is
shown below:
data extraction/ methods?
Ans-
Data extraction is the process of analyzing and crawling through data sources (such as databases) to recover vital
information in a specific pattern. Data is processed further, including metadata and other data integration; this is
another step in the data workflow.
Unstructured data sources and various data formats account for most data extraction. Tables, indexes, and analytics
can all be used to store unstructured data.
Data in a warehouse can come from various places, and a data warehouse must use three different approaches to
use it. Extraction, Transformation, and Loading are the terms for these procedures (ETL).
Data extraction entails retrieving information from disorganized data sources. The data extracts are subsequently
imported into the relational Database's staging area. The source system is queried for data via application
programming interfaces, and extraction logic is applied. Due to this process, the data is now ready to go through the
transformation phase of the ETL process.
Data extraction techniques
From a logical and physical standpoint, the projected amount of data to be extracted and the stage in the ETL
process (initial load or data maintenance) may also influence how to extract. Essentially, you must decide how to
conceptually and physically extract data.
Methods of Logical Extraction
Logic extraction can be divided into two types −
Full Extraction
The data is fully pulled from the source system. There's no need to keep track of data source changes because this
Extraction reflects all of the information saved on the source system after the last successful Extraction.
The source data will be delivered in its current state, with no further logical information (such as timestamps)
required on the source site. An export file of a specific table or a remote SQL query scanning the entire source table
is two examples of full extractions.
Incremental Extraction
Only data that has changed since a particular occurrence in the past will be extracted at a given time. This event
could be the end of the extraction process or a more complex business event such as the last day of a fiscal period's
bookings. To detect this delta change, there must be a way to identify all the changed information since this precise
time event.
This information can be provided by the source data itself, such as an application column indicating the last-changed
timestamp, or by a changing table in which a separate mechanism keeps track of the modifications in addition to the
originating transactions. Using the latter option, in most situations, entails adding extraction logic to the source
system.
Methods of Physical Extraction
Physically extracting the data can be done in two ways, depending on the chosen logical extraction method and the
source site's capabilities and limits. The data can be extracted online from the source system or offline from a
database. An offline structure like this could already exist or be created by an extraction routine.
Physical Extraction can be done in the following ways −
Online Extraction
The information is taken directly from the source system. The extraction procedure can link directly to the source
system to access the source tables or connect to an intermediate system to store the data in a predefined format
(for example, snapshot logs or change tables). It's worth noting that the intermediary system doesn't have to be
physically distinct from the source system.
It would be best to evaluate whether the distributed transactions use source objects or prepared source objects
when using online extractions.
Offline Extraction
The data is staged intentionally outside the source system rather than extracted straight from it. The data was either
created by an extraction method or already had a structure (redo logs, archive logs, or transportable tablespaces).
data loading/ techniques?
Ans-
The data warehouse is structured by the integration of data from different sources. Several factors separate the data
warehouse from the operational database. Since the two systems provide vastly different functionality and require
different types of data, it is necessary to keep the data database separate from the operational database. A data
warehouse is an exchequer of acquaintance gathered from multiple sources, picked under a unified schema, and
usually residing on a single site. A data warehouse is built through the process of data cleaning, data integration,
data transformation, data loading, and periodic data refresh. Loading is the ultimate step in the ETL process. In this
step, the extracted data and the transformed data are loaded into the target database. To make the data load
efficient, it is necessary to index the database and disable the constraints before loading the data. All three steps in
the ETL process can be run parallel. Data extraction takes time and therefore the second phase of the transformation
process is executed simultaneously. This prepared the data for the third stage of loading. As soon as some data is
ready, it is loaded without waiting for the previous steps to be completed.
Techniques/Methods-
Data Loading-
Data is physically moved to the data warehouse. The loading takes place within a “load window. The tendency is
close to real-time updates for data warehouses as warehouses are growing used for operational applications.
Classification:-
Classification technique of data mining is used to classify data in different classes.It is based on machine
learning.Classification is used to classify each item in a set of predefined data set of classes or groups.This technique
use mathematical concept such as decision trees, linear programming, neural network, and statistics. In
classification, there is a software that can learn how to classify the data items into groups.Algorithm used for
classification is Logistic Regression,Naive Bayes,K-Nearest Neighbor.
2) Clustering:-
Clustering technique of data mining used to identify data that are like each other.This technique helps to understand
similarities and differences between the data.This technique makes meaningful classes and objects and puts object
in each class which having similar characteristics.Algorithm used for clustering is Hierarchical Clustering
Algorithm,etc.
3) Regression:-
Regression is a technique of data mining which analyze the the relationship between variables.It creates predictive
models.Regression technique can analyze and predict the results based on previously known data by applying
formulas.Regression is very useful for finding the information on the basis of existing known information.Algorithms
used for regression are Multivariate,Multiple Regression Algorithm,etc.
4) Association:-
Association is a technique that helps find the connection between two or more items.This technique can create a
hidden pattern in data set.These pattern is discovered on the basis of relationship between items which are in same
transaction.This technique is used in market basket analysis to identify a products that customers frequently
purchase together.
5) Outer detection:-
Outer detection is a type of data mining technique that refers to observation of data items in the data-set which do
not match an expected pattern or expected behavior. This technique can be used in a variety of domains, such as
intrusion, detection, fraud or fault detection, etc.
6) Sequential Patterns:-
Sequential pattern is a data mining technique that helps to create or find similar trends in transaction data for
certain time of period.It is one of data mining technique that explore to discover or find similar patterns, trends in
transaction data over a business period.With historical transaction data, vendor can identify items that customers
buy together different times in a year. This technique will also help to customers to buy the product with better
deals based on their previous purchased data.
7) Prediction:-
Prediction is a data mining technique which is a combination of other data mining techniques like sequential
patterns,clustering,classification,etc.It analyzes past events for predicting a future event.The prediction technique
can be used in the sale to predict profit/loss for the future.
‘
hierarchical clustering method/TECHNIQUES/agglomerative hierarchical clustering/divisive hierarchical
clustering/diff?
Ans-
uA Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering begins by
treating every data point as a separate cluster. Then, it repeatedly executes the subsequent steps:
Identify the 2 clusters which can be closest together, and
Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters are
merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram called Dendrogram
(A Dendrogram is a tree-like diagram that statistics the sequences of merges or splits) graphically represents this
hierarchy and is an inverted tree that describes the order in which factors are merged (bottom-up view) or clusters
are broken up (top-down view).
Hierarchical clustering is a method of cluster analysis in data mining that creates a hierarchical representation of the
clusters in a dataset. The method starts by treating each data point as a separate cluster and then iteratively
combines the closest clusters until a stopping criterion is reached. The result of hierarchical clustering is a tree-like
structure, called a dendrogram, which illustrates the hierarchical relationships among the clusters.
Types-
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
Agglomerative Clustering
Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and at every step, merge the nearest pairs of the cluster. (It
is a bottom-up method). At first, every dataset is considered an individual entity or cluster. At every iteration, the
clusters merge with different clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
Consider every data point as an individual cluster
Merge the clusters which are highly similar or close to each other.
Recalculate the proximity matrix for each cluster
Repeat Steps 3 and 4 until only a single cluster remains.
Example-
Let’s say we have six data points A, B, C, D, E, and F.
Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other
clusters.
Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s say cluster (B) and
cluster (C) are very similar to each other therefore we merge them in the second step similarly to cluster (D) and (E)
and at last, we get the clusters [(A), (BC), (DE), (F)]
Step-3: We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)])
together to form new clusters as [(A), (BC), (DEF)]
Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together to form a new
cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5: At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
2. Divisive Hierarchical clustering
We can say that Divisive Hierarchical clustering is precisely the opposite of Agglomerative Hierarchical clustering. In
Divisive Hierarchical clustering, we take into account all of the data points as a single cluster and in every iteration,
we separate the data points from the clusters which aren’t comparable. In the end, we are left with N clusters.
Silhouette Coefficient: Measures the compactness and separation of clusters. Values close to +1 indicate well-
separated clusters, while values close to 0 indicate overlapping clusters.
Davies-Bouldin Index: Evaluates the average similarity of each cluster with its most similar cluster, where lower
values indicate better clustering.
Dunn Index: Measures the compactness and separation of clusters using the ratio of the minimum inter-cluster
distance to the maximum intra-cluster distance. Higher Dunn Index values indicate better clustering.
Cluster Stability:
Cluster Stability Index: Measures the stability of clusters by comparing clustering results on subsamples or
perturbed versions of the dataset. Higher stability indicates more robust clustering.
Visual Inspection and Interpretability:
Cluster Visualization: Visual inspection of clustering results using techniques like scatter plots, dendrograms,
heatmaps, or t-SNE projections to assess cluster separability and structure.
Interpretability: Evaluating the interpretability and meaningfulness of clusters based on domain knowledge or
expert judgment.
Cross-Validation and Resampling:
Cross-Validation: Using techniques like k-fold cross-validation or holdout validation to assess clustering performance
on different subsets of the data.
Bootstrap Resampling: Generating multiple bootstrap samples and evaluating clustering stability and consistency
across these samples.
Domain-Specific Metrics:
Application-Specific Metrics: Tailoring cluster evaluation metrics based on the specific goals and requirements of the
data mining application (e.g., customer segmentation, anomaly detection, pattern recognition).
By employing these evaluation methods and metrics, data miners can assess the quality, robustness, and
effectiveness of clustering algorithms, leading to more reliable and meaningful cluster analysis results.
Measures=
Certainly, let's delve into each type of cluster evaluation measure - external, internal, and relative - along
with examples for clarity:
These three types of cluster evaluation measures provide a comprehensive framework for assessing the
quality, robustness, and effectiveness of clustering algorithms and the resulting clusters. External measures
validate the clustering against known ground truth, internal measures evaluate clustering based on data
characteristics, and relative measures compare different clustering solutions or algorithms to determine the
best clustering outcome.
k mean algorithm/ adv disadv?
Ans-
Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified data and
enabling the algorithm to operate on that data without supervision. Without any previous data training, the
machine’s job in this case is to organize unsorted data according to parallels, patterns, and variations.
K means clustering, assigns data points to one of the K clusters depending on their distance from the center of the
clusters. It starts by randomly assigning the clusters centroid in the space. Then each data point assign to one of the
cluster based on its distance from centroid of the cluster. After assigning each point to one of the cluster, new
cluster centroids are assigned. This process runs iteratively until it finds good cluster. In the analysis we assume that
number of cluster is given in advanced and we have to put points in one of the group.
In some cases, K is not clearly defined, and we have to think about the optimal number of K. K Means clustering
performs best data is well separated. When data points overlapped this clustering is not suitable. K Means is faster
as compare to other clustering technique. It provides strong coupling between the data points. K Means cluster do
not provide clear information regarding the quality of clusters. Different initial assignment of cluster centroid may
lead to different clusters. Also, K Means algorithm is sensitive to noise. It maymhave stuck in local minima.
What is the objective of k-means clustering?
The goal of clustering is to divide the population or set of data points into a number of groups so that the data points
within each group are more comparable to one another and different from the data points within the other groups.
It is essentially a grouping of things based on how similar and different they are to one another.
Advantages
Simplicity and ease of use
The K-means algorithm's simplicity is a major advantage. Its straightforward concept of partitioning data into
clusters based on similarity makes it easy to understand and implement. This accessibility is especially valuable for
newcomers to the field of machine learning.
Efficiency and speed
K-means is known for its computational efficiency, making it suitable for handling large datasets. Its complexity is
relatively low, allowing it to process data quickly and efficiently. This speed is advantageous for real-time or near-
real-time applications.
Scalability
The algorithm's efficiency scales well with the increase in the number of data points. This scalability makes K-means
applicable to datasets of varying sizes, from small to large.
Unsupervised Learning
K-means operates under the unsupervised learning paradigm, requiring no labeled data for training. It autonomously
discovers patterns within data, making it valuable for exploratory data analysis and uncovering hidden insights.
Disadvantages
Sensitive to initial placement
K-means' convergence to a solution is sensitive to the initial placement of cluster centroids. Different initial
placements can result in different final clusterings. Techniques like the k-means++ initialization method help mitigate
this issue.
Assumption of equal-sized Clusters and spherical shapes
K-means assumes that clusters are of equal sizes and have spherical shapes. This assumption can lead to suboptimal
results when dealing with clusters of varying sizes or non-spherical shapes.
Dependence on number of Clusters
The algorithm's performance heavily depends on the correct choice of the number of clusters ('k'). Incorrect 'k'
values can lead to clusters that are not meaningful or informative.
Sensitive to outliers
K-means is sensitive to outliers, which can skew the placement of cluster centroids and affect the overall clustering
results.Not suitable for non-linear dataK-means assumes that clusters are separated by linear boundaries. It may not
perform well on datasets with complex or non-linear cluster structures.
Outliers?
Ans-
Outlier is a data object that deviates significantly from the rest of the data objects and behaves in a different
manner. They can be caused by measurement or execution errors. The analysis of outlier data is referred to as
outlier analysis or outlier mining.An outlier cannot be termed as a noise or error. Instead, they are suspected of not
being generated by the same method as the rest of the data objects.
Outliers are of three types, namely –
Global (or Point) Outliers
Collective Outliers
Contextual (or Conditional) Outliers
1. Global Outliers
1. Definition: Global outliers are data points that deviate significantly from the overall distribution of a dataset.
2. Causes: Errors in data collection, measurement errors, or truly unusual events can result in global outliers.
3. Impact: Global outliers can distort data analysis results and affect machine learning model performance.
4. Detection: Techniques include statistical methods (e.g., z-score, Mahalanobis distance), machine learning
algorithms (e.g., isolation forest, one-class SVM), and data visualization techniques.
5. Handling: Options may include removing or correcting outliers, transforming data, or using robust methods.
6. Considerations: Carefully considering the impact of global outliers is crucial for accurate data analysis and machine
learning model outcomes.
2. Collective Outliers
1. Definition: Collective outliers are groups of data points that collectively deviate significantly from the overall
distribution of a dataset.
2. Characteristics: Collective outliers may not be outliers when considered individually, but as a group, they exhibit
unusual behavior.
3. Detection: Techniques for detecting collective outliers include clustering algorithms, density-based methods, and
subspace-based approaches.
4 Impact: Collective outliers can represent interesting patterns or anomalies in data that may require special
attention or further investigation.
5. Handling: Handling collective outliers depends on the specific use case and may involve further analysis of the
group behavior, identification of contributing factors, or considering contextual information.
6. Considerations: Detecting and interpreting collective outliers can be more complex than individual outliers, as the
focus is on group behavior rather than individual data points. Proper understanding of the data context and domain
knowledge is crucial for effective handling of collective outliers.
3. Contextual Outliers
1. Definition: Contextual outliers are data points that deviate significantly from the expected behavior within a
specific context or subgroup.
2. Characteristics: Contextual outliers may not be outliers when considered in the entire dataset, but they exhibit
unusual behavior within a specific context or subgroup.
3. Detection: Techniques for detecting contextual outliers include contextual clustering, contextual anomaly
detection, and context-aware machine learning approaches.
4. Contextual Information: Contextual information such as time, location, or other relevant factors are crucial in
identifying contextual outliers.
5. Impact: Contextual outliers can represent unusual or anomalous behavior within a specific context, which may
require further investigation or attention.
6. Handling: Handling contextual outliers may involve considering the contextual information, contextual
normalization or transformation of data, or using context-specific models or algorithms.
7. Considerations: Proper understanding of the context and domain-specific knowledge is crucial for accurate
detection and interpretation of contextual outliers, as they may vary based on the specific context or subgroup being
considered.
data mining trends?
Ans-1. Application exploration
Data mining is increasingly used to explore applications in other areas, such as financial analysis,
telecommunications, biomedicine, wireless security, and science.
2. Multimedia Data Mining
This is one of the latest methods which is catching up because of the growing ability to capture useful data
accurately. It involves data extraction from different kinds of multimedia sources such as audio, text, hypertext,
video, images, etc. The data is converted into a numerical representation in different formats. This method can be
used in clustering and classifications, performing similarity checks, and identifying associations.
3. Ubiquitous Data Mining
This method involves mining data from mobile devices to get information about individuals. Despite having several
challenges in this type, such as complexity, privacy, cost, etc., this method has a lot of opportunities to be enormous
in various industries, especially in studying human-computer interactions.
4. Distributed Data Mining
This type of data mining is gaining popularity as it involves mining a huge amount of information stored in different
company locations or at different organizations. Highly sophisticated algorithms are used to extract data from
different locations and provide proper insights and reports based on them.
5. Embedded Data Mining
Data mining features are increasingly finding their way into many enterprise software use cases, from sales
forecasting in CRM SaaS platforms to cyber threat detection in intrusion detection/prevention systems. The
embedding of data mining into vertical market software applications enables prediction capabilities for any number
of industries and opens up new realms of possibilities for unique value creation.
6. Spatial and Geographic Data Mining
This new trending type of data mining includes extracting information from environmental, astronomical, and
geographical data, including images taken from outer space. This type of data mining can reveal various aspects such
as distance and topology, which are mainly used in geographic information systems and other navigation
applications.
7. Time Series and Sequence Data Mining
The primary application of this type of data mining is the study of cyclical and seasonal trends. This practice is also
helpful in analyzing even random events which occur outside the normal series of events. Retail companies mainly
use this method to access customers' buying patterns and behaviors.
8. Data Mining Dominance in the Pharmaceutical And Health Care Industries
Both the pharmaceutical and health care industries have long been innovators in the category of data mining. The
recent rapid development of coronavirus vaccines is directly attributed to advances in pharmaceutical testing data
mining techniques, specifically signal detection during the clinical trial process for new drugs.
9. Increasing Automation In Data Mining
Today's data mining solutions typically integrate ML and big data stores to provide advanced data management
functionality alongside sophisticated data analysis techniques. Earlier incarnations of data mining involved manual
coding by specialists with a deep background in statistics and programming.
10. Data Mining Vendor Consolidation
If history is any indication, significant product consolidation in the data mining space is imminent as larger database
vendors acquire data mining tooling startups to augment their offerings with new features. The current fragmented
market and a broad range of data mining players resemble the adjacent big data vendor landscape that continues to
undergo consolidation.
11. Biological data mining
Mining DNA and protein sequences, mining high dimensional microarray data, biological pathway and network
analysis, link analysis across heterogeneous biological data, and information integration of biological data by data
mining are interesting topics for biological data mining research.
web mining/content mining/structure mining/usage mining?
Ans-
Web mining can widely be seen as the application of adapted data mining techniques to the web, whereas data
mining is defined as the application of the algorithm to discover patterns on mostly structured data embedded into a
knowledge discovery process. Web mining has a distinctive property to provide a set of various data types. The web
has multiple aspects that yield different approaches for the mining process, such as web pages consist of text, web
pages are linked via hyperlinks, and user activity can be monitored via web server logs. These three features lead to
the differentiation between the three areas are web content mining, web structure mining, web usage mining.
1. Web Content Mining:
Web content mining can be used to extract useful data, information, knowledge from the web page content. In web
content mining, each web page is considered as an individual document. The individual can take advantage of the
semi-structured nature of web pages, as HTML provides information that concerns not only the layout but also
logical structure. The primary task of content mining is data extraction, where structured data is extracted from
unstructured websites. The objective is to facilitate data aggregation over various web sites by using the extracted
structured data. Web content mining can be utilized to distinguish topics on the web. For Example, if any user
searches for a specific task on the search engine, then the user will get a list of suggestions.
2. Web Structured Mining:
The web structure mining can be used to find the link structure of hyperlink. It is used to identify that data either link
the web pages or direct link network. In Web Structure Mining, an individual considers the web as a directed graph,
with the web pages being the vertices that are associated with hyperlinks. The most important application in this
regard is the Google search engine, which estimates the ranking of its outcomes primarily with the PageRank
algorithm. It characterizes a page to be exceptionally relevant when frequently connected by other highly related
pages. Structure and content mining methodologies are usually combined. For example, web structured mining can
be beneficial to organizations to regulate the network between two commercial sites.
3. Web Usage Mining:
Web usage mining is used to extract useful data, information, knowledge from the weblog records, and assists in
recognizing the user access patterns for web pages. In Mining, the usage of web resources, the individual is thinking
about records of requests of visitors of a website, that are often collected as web server logs. While the content and
structure of the collection of web pages follow the intentions of the authors of the pages, the individual requests
demonstrate how the consumers see these pages. Web usage mining may disclose relationships that were not
proposed by the creator of the pages.
Some of the methods to identify and analyze the web usage patterns are given below:
The analysis of preprocessed data can be accomplished in session analysis, which incorporates the guest records,
days, time, sessions, etc. This data can be utilized to analyze the visitor's behavior.
The document is created after this analysis, which contains the details of repeatedly visited web pages, common
entry, and exit.
OLAP can be accomplished on various parts of log related data in a specific period.
These tasks are crucial for various applications such as e-commerce, digital marketing, user experience optimization,
information retrieval, fraud detection, and business intelligence on the web. Effective web mining enables
organizations to make data-driven decisions, improve user engagement, enhance search relevancy, and gain
competitive insights in the digital landscape.
Tools-
SAS Enterprise Miner:
Industry: Banking, finance, healthcare, retail, marketing.
Features: Offers data mining and machine learning capabilities, including web data extraction, text mining, predictive
modeling, and cluster analysis.
Use Cases: Customer segmentation, fraud detection, churn analysis, sentiment analysis, market basket analysis.
IBM SPSS Modeler:
Industry: Retail, telecommunications, healthcare, education.
Features: Provides data mining, predictive analytics, and text analytics functionalities, including web data extraction,
social media analysis, and sentiment analysis.
Use Cases: Customer profiling, campaign optimization, risk assessment, demand forecasting, social media
monitoring.
RapidMiner:
Industry: Manufacturing, e-commerce, energy, government.
Features: Offers a visual workflow environment for data preparation, modeling, and analysis, including web scraping,
text mining, machine learning, and visualization.
Use Cases: Predictive maintenance, supply chain optimization, customer churn prediction, sentiment analysis,
anomaly detection.
KNIME Analytics Platform:
Industry: Healthcare, pharmaceuticals, finance, retail.
Features: Provides an open-source platform for data integration, analysis, and reporting, with extensions for web
data extraction, text processing, and machine learning.
Use Cases: Drug discovery, patient analytics, financial risk modeling, customer segmentation, market research.
Applications-
Web mining applications span a wide range of industries and use cases, leveraging techniques and tools to extract
valuable insights, patterns, and knowledge from web data. Here are some common applications of web mining
across various domains:
E-Commerce and Retail:
Customer Segmentation: Analyzing customer behavior, preferences, and purchase history to segment customers for
targeted marketing campaigns and personalized recommendations.
Market Basket Analysis: Identifying associations and patterns in customer shopping baskets to optimize product
placement, cross-selling, and upselling strategies.
Competitor Analysis: Monitoring competitor websites, pricing strategies, product offerings, and customer reviews to
gain competitive intelligence.
Digital Marketing and Advertising:
Social Media Monitoring: Analyzing social media platforms for brand sentiment, customer feedback, trends,
influencers, and campaign performance.
Ad Campaign Optimization: Analyzing web traffic, click-through rates, conversions, and ad performance metrics to
optimize digital advertising campaigns.
Search Engine Optimization (SEO): Analyzing search engine results, keywords, backlinks, and website traffic to
improve search engine rankings and visibility.
Finance and Banking:
Fraud Detection: Analyzing transaction data, user behavior, and account activities to detect fraudulent patterns,
anomalies, and suspicious activities.
Risk Assessment: Assessing credit risk, investment opportunities, market trends, and financial indicators using web
data and market information.
Customer Insights: Understanding customer preferences, investment behaviors, financial goals, and market
sentiments to offer personalized financial services.
Healthcare and Pharmaceuticals:
Drug Discovery: Analyzing biomedical literature, research papers, clinical trials, and drug interactions to identify
potential drug candidates and therapeutic targets.
Patient Analytics: Analyzing patient records, medical histories, treatment outcomes, and disease patterns to improve
healthcare delivery, patient care, and disease management.
text mining/types(agent based)/techniques/steps/tools?
Ans-
What is Text Mining-
Text mining is a component of data mining that deals specifically with unstructured text data. It involves the use of
natural language processing (NLP) techniques to extract useful information and insights from large amounts of
unstructured text data. Text mining can be used as a preprocessing step for data mining or as a standalone process
for specific tasks.
Text Mining in Data Mining-
Text mining in data mining is mostly used for, the unstructured text data that can be transformed into structured
data that can be used for data mining tasks such as classification, clustering, and association rule mining. This allows
organizations to gain insights from a wide range of data sources, such as customer feedback, social media posts, and
news articles.
Techniques-
Tokenization:
Definition: Tokenization is the process of breaking down a text into smaller units, typically words, phrases, or
sentences, known as tokens.
Example: The sentence "The quick brown fox jumps over the lazy dog" can be tokenized into individual words:
["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"].
Purpose: Tokenization is a fundamental step in text processing as it allows for further analysis and processing of text
data at a granular level.
Term Frequency (TF):
Definition: Term frequency (TF) is a metric used to quantify the frequency of a term (word) within a document.
Formula: TF(term) = (Number of times term appears in a document) / (Total number of terms in the document)
Example: In the sentence "The quick brown fox jumps over the lazy dog," the TF for the term "fox" is 1/9 since "fox"
appears once in the nine-word sentence.
Purpose: TF is used in techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to weigh the
importance of terms in documents and text corpora.
Stemming:
Definition: Stemming is the process of reducing words to their base or root form by removing suffixes or prefixes.
Example: The word "running" can be stemmed to "run," "played" to "play," and "happily" to "happy."
Purpose: Stemming helps in normalizing words and reducing variations, which can improve text processing tasks
such as search, retrieval, and indexing.
Lemmatization:
Definition: Lemmatization is the process of reducing words to their canonical or dictionary form (lemma), which
involves identifying the base form of a word based on its part of speech and context.
Example: The word "better" can be lemmatized to "good," "cats" to "cat," and "went" to "go."
Purpose: Lemmatization is more sophisticated than stemming as it considers linguistic rules and context, resulting in
more accurate transformations of words to their base forms.
Steps-
Gathering unstructured information from various sources accessible in various document organizations, for example,
plain text, web pages, PDF records, etc.
Pre-processing and data cleansing tasks are performed to distinguish and eliminate inconsistency in the data. The
data cleansing process makes sure to capture the genuine text, and it is performed to eliminate stop words
stemming (the process of identifying the root of a certain word and indexing the data.
Processing and controlling tasks are applied to review and further clean the data set.
Pattern analysis is implemented in Management Information System.
Information processed in the above steps is utilized to extract important and applicable data for a powerful and
convenient decision-making process and trend analysis.
Tools-
Python Natural Language Toolkit (NLTK):
Description: NLTK is a leading platform for natural language processing (NLP) and text mining in Python. It provides
libraries and tools for tasks such as tokenization, stemming, lemmatization, part-of-speech tagging, sentiment
analysis, and more.
Features: Comprehensive NLP functionalities, easy-to-use APIs, support for various text processing tasks and
algorithms.
Apache OpenNLP:
Description: OpenNLP is an open-source NLP library that offers tools and models for tasks like tokenization, sentence
detection, named entity recognition (NER), part-of-speech tagging, chunking, parsing, and coreference resolution.
Features: Scalable and customizable, supports multiple languages, provides pre-trained models for text analysis
tasks.
GATE (General Architecture for Text Engineering):
Description: GATE is a powerful open-source text mining and NLP framework that supports various text processing
tasks, including information extraction, document classification, sentiment analysis, ontology engineering, and text
annotation.
Features: Extensible architecture, graphical development environment, integration with external tools and libraries,
support for multiple languages.
RapidMiner:
Description: RapidMiner is an integrated data science platform that offers text mining capabilities along with data
preparation, machine learning, predictive analytics, and visualization tools. It supports tasks like text preprocessing,
sentiment analysis, text classification, and topic modeling.
Features: Visual workflow environment, drag-and-drop interface, machine learning algorithms for text analysis,
deployment options.
KNIME Analytics Platform:
Description: KNIME is an open-source data analytics platform that includes text mining and NLP extensions. It
provides tools for text preprocessing, text classification, sentiment analysis, named entity recognition, topic
modeling, and text mining workflows.
Features: Graphical user interface, extensive collection of nodes for text processing tasks, integration with other
data sources and analytics tools.
IBM Watson Natural Language Understanding:
Description: IBM Watson NLU is a cloud-based text analysis service that offers advanced NLP capabilities, including
entity recognition, sentiment analysis, keyword extraction, concept tagging, emotion analysis, and document
categorization.
Features: Cognitive computing capabilities, API-based access, multilingual support, customizable models, integration
with IBM Watson ecosystem.
Lexalytics Salience:
Description: Lexalytics Salience is a text analytics and sentiment analysis software that provides tools for entity
extraction, concept extraction, sentiment scoring, language detection, and categorization of text data.
Features: Named entity recognition, entity linking, thematic extraction, summarization, industry-specific models
(e.g., finance, healthcare, social media).
data visualization/dashboard-kpi?
Ans-
Data visualization is actually a set of data points and information that are represented graphically to make it easy
and quick for user to understand. Data visualization is good if it has a clear meaning, purpose, and is very easy to
interpret, without requiring context. Tools of data visualization provide an accessible way to see and understand
trends, outliers, and patterns in data by using visual effects or elements such as a chart, graphs, and maps.
Characteristics of Effective Graphical Visual :
It shows or visualizes data very clearly in an understandable manner.
It encourages viewers to compare different pieces of data.
It closely integrates statistical and verbal descriptions of data set.
It grabs our interest, focuses our mind, and keeps our eyes on message as human brain tends to focus on visual data
more than written data.
It also helps in identifying area that needs more attention and improvement.
Using graphical representation, a story can be told more efficiently. Also, it requires less time to understand picture
than it takes to understand textual data.
Categories of Data Visualization ;
Data visualization is very critical to market research where both numerical and categorical data can be visualized that
helps in an increase in impacts of insights and also helps in reducing risk of analysis paralysis. So, data visualization is
categorized into following categories :
Numerical Data :
Numerical data is also known as Quantitative data. Numerical data is any data where data generally represents
amount such as height, weight, age of a person, etc. Numerical data visualization is easiest way to visualize data. It is
generally used for helping others to digest large data sets and raw numbers in a way that makes it easier to interpret
into action. Numerical data is categorized into two categories :
Continuous Data –
It can be narrowed or categorized (Example: Height measurements).
Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a household has).
The type of visualization techniques that are used to represent numerical data visualization is Charts and Numerical
Values. Examples are Pie Charts, Bar Charts, Averages, Scorecards, etc.
Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data where data generally represents
groups. It simply consists of categorical variables that are used to represent characteristics such as a person’s
ranking, a person’s gender, etc. Categorical data visualization is all about depicting key themes, establishing
connections, and lending context. Categorical data is classified into three categories :
Binary Data –
In this, classification is based on positioning (Example: Agrees or Disagrees).
Nominal Data –
In this, classification is based on attributes (Example: Male or Female).
Ordinal Data –
In this, classification is based on ordering of information (Example: Timeline or processes).
Dashboard KPI-
A KPI dashboard displays key performance indicators in interactive charts and graphs, allowing for quick, organized
review and analysis. Key performance indicators are quantifiable measures of performance over time for specific
strategic objectives. Modern KPI dashboards allow any user to easily explore the data behind the KPIs and uncover
actionable insights. In this way, a KPI dashboard transforms massive data sets from across an organization into data-
driven decisions that can improve your business.
business intellegence/future of bi?
Ans-
Business Intelligence is the talk of a new changing and growing world that can be defined as a set of concepts and
methodologies to improve decision-making in business through the use of facts and fact-based systems. The Goal of
Business Intelligence is to improve decision-making in business ideas and analysis. Business Intelligence is not just a
concept it’s a group of concepts and methodologies. Business Intelligence uses analytics and gut feelings for making
decisions.
Business intelligence refers to a collection of mathematical models and analysis methods that utilize data to produce
valuable information and insight for making important decisions.
Main Components of Business Intelligence System:
Data Source
Data Mart / Data Warehouse
Data Exploration
Data Mining
Optimization
Decisions
1.Data Source:
To begin, the first step is gathering and consolidating data from an array of primary and secondary sources. These
sources vary in origin and format, consisting mainly of operational system data but also potentially containing
unstructured documents like emails and data from external providers.
2.Data Mart / Data Warehouse:
Through the utilization of extraction and transformation tools, also known as extract, transform, load (ETL), data is
acquired from various sources and saved in databases designed specifically for business intelligence analysis. These
databases, commonly known as data warehouses and data marts, serve as a centralized location for the gathered
data.
3.Data Exploration:
The third level of the pyramid offers essential resources for conducting a passive analysis in business intelligence.
These resources include query and reporting systems, along with statistical methods. These techniques are referred
to as passive because decision makers must first develop ideas or establish criteria for data extraction before
utilizing analysis tools to uncover answers and confirm their initial theories. For example, a sales manager might
observe a decrease in revenues in a particular geographic region for a specific demographic of customers. In
response, she could utilize extraction and visualization tools to confirm her hypothesis and then use statistical
testing to validate her findings based on the data.
4.Data Mining:
The fourth level, known as active business intelligence methodologies, focuses on extracting valuable information
and knowledge from data. Part II of this book will delve into various techniques such as mathematical models,
pattern recognition, machine learning, and data mining. Unlike the tools discussed in the previous level, active
models do not rely on decision makers to come up with hypothesis but instead aim to enhance their understanding.
5.Optimization:
As you ascend the pyramid, you’ll encounter optimization models that empower you to choose the most optimal
course of action among various alternatives, which can often be quite extensive or even endless. These models have
also been effectively incorporated in marketing and logistics.
6.Decisions:
At last, the pinnacle of the pyramid reflects the ultimate decision made and put into action, serving as the logical end
to the decision-making process. Despite the availability and effective utilization of business intelligence
methodologies, the decision still lies in the hands of the decision makers, who can incorporate informal and
unstructured information to fine-tune and revise the suggestions and outcomes generated by mathematical models.
future of bi?
Ans-
The future scope of Business Intelligence (BI) is promising, driven by technological advancements, data proliferation,
and evolving business needs. Here are key areas shaping the future of BI:
Advanced Analytics and AI: BI is moving beyond descriptive analytics to predictive and prescriptive analytics
powered by AI and machine learning. This includes predictive modeling, anomaly detection, natural language
processing (NLP), and automated decision-making.
Big Data and Real-Time Analytics: With the exponential growth of data, BI is embracing big data technologies like
Hadoop, Spark, and NoSQL databases for processing and analyzing large volumes of structured and unstructured
data in real time.
Data Visualization and Storytelling: BI tools are enhancing data visualization capabilities with interactive dashboards,
geospatial analytics, and storytelling features to convey insights effectively and facilitate data-driven decision-
making.
Self-Service BI and Citizen Data Scientists: Empowering business users with self-service BI tools, drag-and-drop
interfaces, and easy-to-use analytics capabilities to explore data, create reports, and derive insights without heavy
reliance on IT or data scientists.
Embedded BI and Integration: Integrating BI capabilities into operational systems, applications, and workflows to
provide context-aware insights, personalized recommendations, and actionable intelligence directly within business
processes.
Mobile BI and Accessibility: Enabling mobile access to BI platforms, allowing users to access, analyze, and share data
anytime, anywhere, and on any device for improved collaboration and decision-making on-the-go.
Data Governance and Privacy: Addressing data governance challenges, ensuring data quality, compliance with
regulations (e.g., GDPR, CCPA), and implementing robust security measures to protect sensitive information in BI
systems.
Cloud-Based BI and SaaS Solutions: Adoption of cloud-based BI platforms and Software-as-a-Service (SaaS) BI
solutions for scalability, agility, cost-effectiveness, and seamless integration with other cloud services and data
sources.
Industry-Specific BI Solutions: Tailoring BI solutions to specific industries (e.g., healthcare, retail, finance) with
industry-specific analytics, KPIs, and domain expertise to address unique business challenges and opportunities.
Augmented Analytics and Data Democratization: Leveraging augmented analytics, data discovery tools, and natural
language querying to democratize data access, insights, and decision-making across organizations, fostering a data-
driven culture.
Overall, the future of BI lies in harnessing data as a strategic asset, leveraging advanced technologies, fostering data-
driven cultures, and empowering users with actionable insights to drive innovation, competitiveness, and business
success.