Data Mining Complete
Data Mining Complete
Data Warehouse
Data Warehouse is a relational database management system (RDBMS) construct to meet
the requirement of transaction processing systems. It can be loosely described as any
centralized data repository which can be queried for business benefits. It is a database that
stores information oriented to satisfy decision-making requests. It is a group of decision
support technologies, targets to enabling the knowledge worker (executive, manager, and
analyst) to make superior and higher decisions. So, Data Warehousing support
architectures and tool for business executives to systematically organize, understand and
use their information to make strategic decisions.
Data Warehouse environment contains an extraction, transportation, and loading (ETL)
solution, an online analytical processing (OLAP) engine, customer analysis tools, and other
applications that handle the process of gathering information and delivering it to business
users.
A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's
ongoing operations. This is done by excluding data that are not useful concerning the
subject and including all data needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records. It requires performing data cleaning and integration
during data warehousing to ensure consistency in naming conventions, attributes types,
etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
These variations with a transactions system, where often only the most current file is
kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data. Therefore,
the DW does not require transaction processing, recovery, and concurrency capabilities,
which allows for substantial speedup of data retrieval. NonVolatile defines that once
entered into the warehouse, and data should not change.
2. 2) Store historical data: Data Warehouse is required to store the time variable data
from the past. This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in
the data warehouse. So, data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and
consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected
loads and types of queries, which demands a significant degree of flexibility and
quick response time.
6.Data warehousing provide the capabilities to analyze a large amount of historical data.
OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is
rapidly enhancing the essential foundation for Intelligent Solutions containing Business
Performance Management, Planning, Budgeting, Forecasting, Financial Documenting,
Analysis, Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting.
Production o
Production planning
o Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data
model more intuitive to them than a tabular model. This model is called a Dimensional
Model.
The second purpose is to enable fast query response that is usually difficult to achieve
using tabular models.
Data is balanced within the scope of this Data must be integrated and balanced from
one system. multiple system.
Data is updated when transaction
occurs. Data is updated on scheduled processes.
Data verification occurs when entry is
done.
Data verification occurs after the fact.
100 MB to GB. 100 GB to TB.
ER based. Star/Snowflake.
The dimensions are the perspectives or entities concerning which an organization keeps
records. Each dimension has a table related to it, called a dimensional table, which
describes the dimension further. For example, a dimensional table for an item may contain
the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales.
This theme is represented by a fact table. Facts are numerical measures. The fact table
contains the names of the facts or measures of the related dimensional tables.
Consider the OLAP operations which are to be performed on multidimensional data. The
figure shows data cubes for sales of a shop. The cube contains the dimensions, location, and
time and item, where the location is aggregated with regard to city values, time is
aggregated with respect to quarters, and an item is aggregated with respect to item types.
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs
aggregation on a data cube, by climbing down concept hierarchies, i.e., dimension
reduction. Roll-up is like zooming-out on the data cubes. Figure shows the result of rollup
operations performed on the dimension location. The hierarchy for the location is defined
as the Order Street, city, province, or state, country. The roll-up operation aggregates the
data by ascending the location hierarchy from the level of the city to the level of the
country.
Week1 1 0 1 0 1 0 0 0 0 0 1 0
Week2 0 0 0 1 0 0 1 2 0 1 0 0
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up.
Drilldown is like zooming-in on the data cube. It navigates from less detailed record to
more detailed data. Drill-down can be performed by either stepping down a concept
hierarchy for a dimension or adding additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down
a concept hierarchy which is defined as day, month, quarter, and year. Drill-down appears
by descending the time hierarchy from the level of the quarter to a more detailed level of
the month.
Because a drill-down adds more details to the given data, it can also be performed by
adding a new dimension to a cube. For example, a drill-down on the central cubes of the
figure can occur by introducing an additional dimension, such as a customer group.
Example
Day 2 0 0 0
Day 3 0 0 1
Day 4 0 1 0
Day 5 1 0 0
Day 6 0 0 0
Day 7 1 0 0
Day 8 0 0 0
Day 9 1 0 0
Day 10 0 1 0
Day 11 0 1 0
Day 12 0 1 0
Day 13 0 0 1
Day 14 0 0 0
The following diagram illustrates how Drill-down works.
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members
of the dimension. For example, a slice operation is executed when the customer wants a
selection on one dimension of a three-dimensional cube resulting in a two-dimensional
site. So, the Slice operations perform a selection on one dimension of the given cube, thus
resulting in a subcube.
For example, if we make the selection, temperature=cool we will obtain the following cube:
Temperature cool
Day 1 0
Day 2 0
Day 3 0
Day 4 0
Day 5 1
Day 6 1
Day 7 1
Day 8 1
Day 9 1
Day 11 0
Day 12 0
Day 13 0
Day 14 0
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature
= cool OR temperature = hot) to the original cubes we get the following subcube (still
two-dimensional)
Temperature cool hot
Day 3 0 1
Day 4 0 0
Consider the following diagram, which shows the dice operations.
The dice operation on the cubes based on the following selection criteria involves three
dimensions.
o (location = "Toronto" or
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which
rotates the data axes in view to provide an alternative presentation of the data. It may
contain swapping the rows and columns or moving one of the row-dimensions into the
column dimensions.
Other OLAP operations may contain ranking the top-N or bottom-N elements in lists, as
well as calculate moving average, growth rates, and interests, internal rates of returns,
depreciation, currency conversions, and statistical tasks.
The data processing is carried out automatically or manually. Nowadays, most data is
processed automatically with the help of the computer, which is faster and gives accurate
results. Thus, data can be converted into different forms. It can be graphic as well as audio
ones. It depends on the software used as well as data processing methods.
After that, the data collected is processed and then translated into a desirable form as per
requirements, useful for performing tasks. The data is acquired from Excel files,
databases, text file data, and unorganized data such as audio clips, images, GPRS, and
video clips.
Data processing is crucial for organizations to create better business strategies and
increase their competitive edge. By converting the data into a readable format like graphs,
charts, and documents, employees throughout the organization can understand and use
the data.
The most commonly used tools for data processing are Storm, Hadoop, HPCC, Statwing,
Qubole, and CouchDB. The processing of data is a key step of the data mining process.
Raw data processing is a more complicated task. Moreover, the results can be misleading.
Therefore, it is better to process data before analysis. The processing of data largely
depends on the following things, such as:
The collection of raw data is the first step of the data processing cycle. The raw data
collected has a huge impact on the output produced. Hence, raw data should be gathered
from defined and accurate sources so that the subsequent findings are valid and usable.
Raw data can include monetary figures, website cookies, profit/loss statements of a
company, user behavior, etc.
2. Data Preparation
Data preparation or data cleaning is the process of sorting and filtering the raw data to
remove unnecessary and inaccurate data. Raw data is checked for errors, duplication,
miscalculations, or missing data and transformed into a suitable form for further analysis
and processing. This ensures that only the highest quality data is fed into the processing
unit.
3. Data Input
In this step, the raw data is converted into machine-readable form and fed into the
processing unit. This can be in the form of data entry through a keyboard, scanner, or any
other input source.
4. Data Processing
In this step, the raw data is subjected to various data processing methods using machine
learning and artificial intelligence algorithms to generate the desired output. This step may
vary slightly from process to process depending on the source of data being processed
(data lakes, online databases, connected devices, etc.) and the intended use of the output.
The data is finally transmitted and displayed to the user in a readable form like graphs,
tables, vector files, audio, video, documents, etc. This output can be stored and further
processed in the next data processing cycle.
6. Data Storage
The last step of the data processing cycle is storage, where data and metadata are stored
for further use. This allows quick access and retrieval of information whenever needed.
Effective proper data storage is necessary for compliance with GDPR (data protection
legislation).
Why Should We Use Data Processing?
In the modern era, most of the work relies on data, therefore collecting large amounts of
data for different purposes like academic, scientific research, institutional use, personal
and private use, commercial purposes, and lots more. The processing of this data collected
is essential so that the data goes through all the above steps and gets sorted, stored,
filtered, presented in the required format, and analyzed.
The amount of time consumed and the intricacy of processing will depend on the required
results. In situations where large amounts of data are acquired, the necessity of processing
to obtain authentic results with the help of data processing in data mining and data
processing in data research is inevitable.
Data is processed manually in this data processing method. The entire procedure of data
collecting, filtering, sorting, calculation and alternative logical operations is all carried out
with human intervention without using any electronic device or automation software. It is
a low-cost methodology and does not need very many tools. However, it produces high
errors and requires high labor costs and lots of time.
Data is processed mechanically through the use of devices and machines. These can
include simple devices such as calculators, typewriters, printing press, etc. Simple data
processing operations can be achieved with this method. It has much fewer errors than
manual data processing, but the increase in data has made this method more complex
and difficult.
Data is processed with modern technologies using data processing software and
programs. The software gives a set of instructions to process the data and yield output.
This method is the most expensive but provides the fastest processing speeds with the
highest reliability and accuracy of output.
Types of Data Processing
There are different types of data processing based on the source of data and the steps
taken by the processing unit to generate an output. There is no one size fits all method
that can be used for processing raw data.
1. Batch Processing: In this type of data processing, data is collected and processed in
batches. It is used for large amounts of data. For example, the payroll system.
2. Single User Programming Processing: It is usually done by a single person for his
personal use. This technique is suitable even for small offices.
4. Real-time Processing: This technique facilitates the user to have direct contact with the
computer system. This technique eases data processing. This technique is also known as
the direct mode or the interactive mode technique and is developed exclusively to perform
one task. It is a sort of online processing, which always remains under execution. For
example, withdrawing money from ATM.
5. Online Processing: This technique facilitates the entry and execution of data directly; so, it
does not store or accumulate first and then process. The technique is developed to reduce
the data entry errors, as it validates data at various points and ensures that only corrected
data is entered. This technique is widely used for online applications. For example, barcode
scanning.
6. Time-sharing Processing: This is another form of online data processing that facilitates
several users to share the resources of an online computer system. This technique is
adopted when results are needed swiftly. Moreover, as the name suggests, this system is
time-based. Following are some of the major advantages of time-sharing processing, such
as:
o Several users can be served simultaneously. o All the users have an almost equal
amount of processing time. o There is a possibility of interaction with the running
programs.
7. Distributed Processing: This is a specialized data processing technique in which various
computers (located remotely) remain interconnected with a single host computer making
a network of computers. All these computer systems remain interconnected with a
highspeed communication network. However, the central computer system maintains the
master database and monitors accordingly. This facilitates communication between
computers.
o Stock trading software that converts millions of stock data into a simple graph.
The complexity of this process is subject to the scope of data collection and the complexity
of the required results. Whether this process is time-consuming depends on steps, which
need to be made with the collected data and the type of output file desired to be received.
This issue becomes actual when the need for processing a big amount of data arises.
Therefore, data mining is widely used nowadays.
When data is gathered, there is a need to store it. The data can be stored in physical form
using paper-based documents, laptops and desktop computers, or other data storage
devices. With the rise and rapid development of such things as data mining and big data,
the process of data collection becomes more complicated and timeconsuming. It is
necessary to carry out many operations to conduct thorough data analysis.
At present, data is stored in a digital form for the most part. It allows processing data faster
and converting it into different formats. The user has the possibility to choose the most
suitable output.
Generally, data cleaning reduces errors and improves data quality. Correcting errors in
data and eliminating bad records can be a time-consuming and tedious process, but it
cannot be ignored. Data mining is a key technique for data cleaning. Data mining is a
technique for discovering interesting information in data. Data quality mining is a recent
approach applying data mining techniques to identify and recover data quality problems
in large databases. Data mining automatically extracts hidden and intrinsic information
from the collections of data. Data mining has various techniques that are suitable for data
cleaning.
In most cases, data cleaning in data mining can be a laborious process and typically
requires IT resources to help in the initial step of evaluating your data because data
cleaning before data mining is so time-consuming. But without proper data quality, your
final analysis will suffer inaccuracy, or you could potentially arrive at the wrong conclusion.
Steps of Data Cleaning
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:
For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can
make analysis more efficient, minimize distraction from your primary target, and create a
more manageable and performable dataset.
Structural errors are when you measure or transfer data and notice strange naming
conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled
categories or classes. For example, you may find "N/A" and "Not Applicable" in any sheet,
but they should be analyzed in the same category.
Often, there will be one-off observations where, at a glance, they do not appear to fit
within the data you are analyzing. If you have a legitimate reason to remove an outlier,
like improper data entry, doing so will help the performance of the data you are working
with.
However, sometimes, the appearance of an outlier will prove a theory you are working on.
And just because an outlier exists doesn't mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or
is a mistake, consider removing it.
4. Handle missing data
You can't ignore missing data because many algorithms will not accept missing values.
There are a couple of ways to deal with missing data. Neither is optimal, but both can be
considered, such as:
o You can drop observations with missing values, but this will drop or lose information, so be
careful before removing it. o You can input missing values based on other observations; again,
there is an opportunity to lose the integrity of the data because you may be operating from
assumptions and not actual observations.
o You might alter how the data is used to navigate null values effectively.
5. Validate and QA
At the end of the data cleaning process, you should be able to answer these questions as
a part of basic validation, such as:
o Does the data make sense? o Does the data follow the appropriate
rules for its field? o Does it prove or disprove your working theory
o Can you find trends in the data to help you for your next theory? o
If not, is that because of a data quality issue?
Because of incorrect or noisy data, false conclusions can inform poor business strategy
and decision-making. False conclusions can lead to an embarrassing moment in a
reporting meeting when you realize your data doesn't stand up to study. Before you get
there, it is important to create a culture of quality data in your organization. To do this,
you should document the tools you might use to create this strategy.
2. Fill the missing value: This approach is also not very effective or feasible. Moreover, it can
be a time-consuming method. In the approach, one has to fill in the missing value. This is
usually done manually, but it can also be done by attribute mean or using the most
probable value.
3. Binning method: This approach is very simple to understand. The smoothing of sorted
data is done using the values around it. The data is then divided into several segments of
equal size. After that, the different methods are executed to complete the task.
4. Regression: The data is made smooth with the help of using the regression function. The
regression can be linear or multiple. Linear regression has only one independent variable,
and multiple regressions have more than one independent variable.
5. Clustering: This method mainly operates on the group. Clustering groups the data in a
cluster. Then, the outliers are detected with the help of clustering. Next, the similar values
are then arranged into a "group" or a "cluster".
1. Monitoring the errors: Keep a note of suitability where the most mistakes arise. It will
make it easier to determine and stabilize false or corrupt information. Information is
especially necessary while integrating another possible alternative with established
management software.
2. Standardize the mining process: Standardize the point of insertion to assist and reduce
the chances of duplicity.
3. Validate data accuracy: Analyze and invest in data tools to clean the record in real-time.
Tools used Artificial Intelligence to better examine for correctness.
4. Scrub for duplicate data: Determine duplicates to save time when analyzing data.
Frequently attempted the same data can be avoided by analyzing and investing in separate
data erasing tools that can analyze rough data in quantity and automate the operation.
5. Research on data: Before this activity, our data must be standardized, validated, and
scrubbed for duplicates. There are many third-party sources, and these Approved &
authorized parties sources can capture information directly from our databases. They help
us to clean and compile the data to ensure completeness, accuracy, and reliability for
business decision-making.
6. Communicate with the team: Keeping the group in the loop will assist in developing and
strengthening the client and sending more targeted data to prospective customers.
o Data Integration: Since it is difficult to ensure quality in low-quality data, data integration
has an important role in solving this problem. Data Integration is the process of combining
data from different data sets into a single one. This process uses data cleansing tools to
ensure that the embedded data set is standardized and formatted before moving to the
final destination.
o Data Migration: Data migration is the process of moving one file from one system to
another, one format to another, or one application to another. While the data is on the
move, it is important to maintain its quality, security, and consistency, to ensure that the
resultant data has the correct format and structure without any delicacies at the destination.
o Data Transformation: Before the data is uploaded to a destination, it needs to be
transformed. This is only possible through data cleaning, which considers the system criteria
of formatting, structuring, etc. Data transformation processes usually include using rules
and filters before further analysis. Data transformation is an integral part of most data
integration and data management processes. Data cleansing tools help to clean the data
using the built-in transformations of the systems.
o Data Debugging in ETL Processes: Data cleansing is crucial to preparing data during
extract, transform, and load (ETL) for reporting and analysis. Data cleansing ensures that
only high-quality data is used for decision-making and analysis.
For example, a retail company receives data from various sources, such as CRM or ERP
systems, containing misinformation or duplicate data. A good data debugging tool would
detect inconsistencies in the data and rectify them. The purged data will be converted to
a standard format and uploaded to a target database.
Characteristics of Data Cleaning
Data cleaning is mandatory to guarantee the business data's accuracy, integrity, and
security. Based on the qualities or characteristics of data, these may vary in quality. Here
are the main points of data cleaning in data mining:
o Accuracy: All the data that make up a database within the business must be highly accurate.
One way to corroborate their accuracy is by comparing them with different sources. If the
source is not found or has errors, the stored information will have the same problems. o
Coherence: The data must be consistent with each other, so you can be sure that the
information of an individual or body is the same in different forms of storage used. o Validity:
The stored data must have certain regulations or established restrictions. Likewise, the
information has to be verified to corroborate its authenticity. o Uniformity: The data that
make up a database must have the same units or values. It is an essential aspect when carrying
out the Data Cleansing process since it does not increase the complexity of the procedure. o
Data Verification: The process must be verified at all times, both the appropriateness and the
effectiveness of the procedure. Said verification is carried out through various insistence of the
study, design, and validation stages. The drawbacks are often evident after the data is applied
in a certain amount of changes. o Clean Data Backflow: After eliminating quality problems,
the already clean data must be replaced by those not located in the original source, so that
legacy applications obtain the benefits of these, obviating the need for applications of actions
of data cleaning afterward.
1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
9. TIBCO Clarity
10. Winpure
o Removal of errors when multiple sources of data are at play. o Fewer errors make for
Data transformation changes the format, structure, or values of the data and converts
them into clean, usable data. Data may be transformed at two stages of the data pipeline
for data analytics projects. Organizations that use on-premises data warehouses generally
use an ETL (extract, transform, and load) process, in which data transformation is the
middle step. Today, most organizations use cloud-based data warehouses to scale
compute and storage resources with latency measured in seconds or minutes. The
scalability of the cloud platform lets organizations skip preload transformations and load
raw data into the data warehouse, then transform it at query time.
Data integration, migration, data warehousing, data wrangling may all involve data
transformation. Data transformation increases the efficiency of business and analytic
processes, and it enables businesses to make better data-driven decisions. During the data
transformation process, an analyst will determine the structure of the data. This could
mean that data transformation may be:
Destructive: The system deletes fields or records. o Aesthetic: The transformation standardizes
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce
any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to
help predict different trends and patterns. This serves as a help to analysts or traders who
need to look at a lot of data which can often be difficult to digest for finding patterns that
they wouldn't see otherwise.
We have seen how the noise is removed from the data using the techniques such as
binning, regression, clustering.
o Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
o Regression: This method identifies the relation among two dependent attributes so that if
we have one attribute, it can be used to predict the other attribute.
o Clustering: This method groups similar data values and form a cluster. The values that lie
outside a cluster are known as outliers.
2. Attribute Construction
In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied
to assist the mining process from the given attributes. This simplifies the original data and
makes the mining more efficient.
For example, suppose we have a data set referring to measurements of different plots, i.e.,
we may have the height and width of each plot. So here, we can construct a new attribute
'area' from attributes 'height' and 'weight'. This also helps understand the relations among
the attributes in a data set.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to
produce relevant results. The collection of data is useful for everything from decisions
concerning financing or business strategy of the product, pricing, operations, and
marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales
of each year. We can aggregate the data to get the enterprise's annual sales report.
4. Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [1,
1] or [0.0, 1.0]. There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values
for attribute A that are V1, V2, V3, ….Vn.
For example, we have $1200 and $9800 as the minimum, and maximum value for the
attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:
o Z-score normalization: This method normalizes the value for attribute A using the
meanand standard deviation. The following formula is used for Z-score normalization:
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively. For
example, we have a mean and standard deviation for attribute A as $54,000 and $16,000.
And we have to normalize the value $73,600 using z-score normalization.
o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal
point in the value. This movement of a decimal point depends on the maximum absolute
value of A. The formula for the decimal scaling is given below:
Here j is the smallest integer such that max(|v'i|)<1 For example, the observed values
for attribute A range from -986 to 917, and the maximum absolute value for attribute A is
986. Here, to normalize each value of attribute A using decimal scaling, we have to divide
each value of attribute A by 1000, i.e., j=3. So, the value -986 would be normalized to 0.986,
and 917 would be normalized to 0.917. The normalization parameters such as mean,
standard deviation, the maximum absolute value must be preserved to normalize the
future data uniformly.
5. Data Discretization
This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to study
and analyze. If a data mining task handles a continuous attribute, then its discrete values
can be replaced by constant quality attributes. This improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset
into a set of categorical data. Discretization also uses decision tree-based algorithms to
produce short, compact, and accurate results when using discrete values.
Data discretization can be classified into two types: supervised discretization, where the
class information is used, and unsupervised discretization, which is based on which
direction the process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging
strategy'.
For example, the values for the age attribute can be replaced by the interval labels such
as (0-10, 11-20…) or (kid, youth, adult, senior).
6. Data Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy.
This conversion from a lower level to a higher conceptual level is useful to get a clearer
picture of the data. Data generalization can be divided into two approaches:
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a
higher conceptual level into a categorical value (young, old).
Data Transformation Process
The entire process for transforming data is known as ETL (Extract, Load, and Transform).
Through the ETL process, analysts can convert data to its desired format. Here are the
steps involved in the data transformation process:
1. Data Discovery: During the first stage, analysts work to understand and identify data in its
source format. To do this, they will use data profiling tools. This step helps analysts decide
what they need to do to get data into its desired format.
2. Data Mapping: During this phase, analysts perform data mapping to determine how
individual fields are modified, mapped, filtered, joined, and aggregated. Data mapping is
essential to many data processes, and one misstep can lead to incorrect analysis and ripple
through your entire organization.
3. Data Extraction: During this phase, analysts extract the data from its original source. These
may include structured sources such as databases or streaming sources such as customer
log files from web applications.
4. Code Generation and Execution: Once the data has been extracted, analysts need to
create a code to complete the transformation. Often, analysts generate codes with the help
of data transformation platforms or tools.
5. Review: After transforming the data, analysts need to check it to ensure everything has
been formatted correctly.
6. Sending: The final step involves sending the data to its target destination. The target might
be a data warehouse or a database that handles both structured and unstructured data.
o Improved Data Quality: There are many risks and costs associated with bad data. Data
transformation can help your organization eliminate quality issues such as missing values
and other inconsistencies.
o Perform Faster Queries: You can quickly and easily retrieve transformed data thanks to it
being stored and standardizedin a source location.
o Better Data Management: Businesses are constantly generating data from more and more
sources. If there are inconsistencies in the metadata, it can be challenging to organize
and understand it. Data transformation refines your metadata, so it's easier to organize
and understand. o More Use Out of Data: While businesses may be collecting data
constantly, a lot of that data sits around unanalyzed. Transformation makes it easier to
get the most out of your data by standardizing it and making it more usable.
o Data transformation can be expensive. The cost is dependent on the specific infrastructure,
software, and tools used to process data. Expenses may include licensing, computing
resources, and hiring necessary personnel. o Data transformation processes can be
resource-intensive. Performing transformations in an on-premises data warehouse after
loading or transforming data before feeding it into applications can create a computational
burden that slows down other operations. If you use a cloud-based data warehouse, you
can do the transformations after loading because the platform can scale up to meet
demand. o Lack of expertise and carelessness can introduce problems during
transformation. Data analysts without appropriate subject matter expertise are less likely to
notice incorrect data because they are less familiar with the range of accurate and
permissible values.
o Enterprises can perform transformations that don't suit their needs. A business might
change information to a specific format for one application only to then revert the
information to its prior format for a different application.
Data reduction techniques ensure the integrity of data while reducing the data. Data
reduction is a process that reduces the volume of original data and represents it in a much
smaller volume. Data reduction techniques are used to obtain a reduced representation
of the dataset that is much smaller in volume by maintaining the integrity of the original
data. By reducing the data, the efficiency of the data mining process is improved, which
produces the same analytical results.
Data reduction does not affect the result obtained from data mining. That means the result
obtained from data mining before and after data reduction is the same or almost the same.
Data reduction aims to define it more compactly. When the data size is smaller, it is simpler
to apply sophisticated and computationally high-priced algorithms. The reduction of the
data may be in terms of the number of rows (records) or terms of the number of columns
(dimensions).
1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for our
analysis. Dimensionality reduction eliminates the attributes from the data set under
consideration, thereby reducing the volume of original data. It reduces data size as it
eliminates outdated or redundant features. Here are three methods of dimensionality
reduction.
ii. Principal Component Analysis: Suppose we have a data set to be analyzed that has
tuples with n attributes. The principal component analysis identifies k independent tuples
with n attributes that can represent the data set.
In this way, the original data can be cast on a much smaller space, and dimensionality
reduction can be achieved. Principal component analysis can be applied to sparse and
skewed data.
iii. Attribute Subset Selection: The large data set has many attributes, some of which are
irrelevant to data mining or some are redundant. The core attribute subset selection
reduces the data volume and dimensionality. The attribute subset selection reduces the
volume of data by eliminating redundant and
irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original attributes even
after eliminating the unwanted attributes. The resulting probability of data distribution is
as close as possible to the original data distribution using all the attributes.
2. Numerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much
smaller form. This technique includes two types parametric and non-parametric
numerosity reduction.
c. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing
SRSWOR on these clusters. A simple random sample of size s could be
generated from these clusters where s<M.
d. Stratified sample: The large data set D is partitioned into mutually disjoint
sets called 'strata'. A simple random sample is taken from each stratum to
get stratified data. This method is effective for skewed data.
3. Data Cube Aggregation
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to
represent the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year
2018 to the year 2022. If you want to get the annual sale per year, you just have to
aggregate the sales per quarter for each year. In this way, aggregation provides you with
the required data, which is much smaller in size, and thereby we achieve data reduction
even without losing any data.
4. Data Compression
This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on
their compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them. For example,
the JPEG image format is a lossy compression, but we can find the meaning equivalent to
the original image. Methods such as the Discrete Wavelet transform technique PCA
(principal component analysis) are examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature
into data with intervals. We replace many constant values of the attributes with labels of
small intervals. This means that mining results are shown in a concise and easily
understandable way.
ii. Bottom-up discretization: If you first consider all the constant values as split-points, some
are discarded through a combination of the neighborhood values in the interval.
That process is called bottom-up discretization.
Data reduction greatly increases the efficiency of a storage system and directly impacts
your total spending on capacity.
Binning
Binning refers to a data smoothing technique that helps to group a huge number of
continuous values into smaller values. For data discretization and the development of idea
hierarchy, this technique can also be used.
Cluster Analysis
Data discretization refers to a decision tree analysis in which a top-down slicing technique
is used. It is done through a supervised procedure. In a numeric attribute discretization,
first, you need to select the attribute that has the least entropy, and then you need to run
it with the help of a recursive process. The recursive process divides it into various
discretized disjoint intervals, from top to bottom, using the same splitting criterion.
Data discretization using correlation analysis
Discretizing data by linear regression technique, you can get the best neighboring interval,
and then the large intervals are combined to develop a larger overlap to form the final 20
overlapping intervals. It is a supervised procedure.
Let's understand this concept hierarchy for the dimension location with the help of an
example.
A particular city can map with the belonging country. For example, New Delhi can be
mapped to India, and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends
with the bottom to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information
and ends with the top to the generalized information.
Data discretization and binarization in data mining
Data discretization is a method of converting attributes values of continuous data into a
finite set of intervals with minimum data loss. In contrast, data binarization is used to
transform the continuous and discrete attributes into binary attributes.
UNIT-II
What is Data Mining?
The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns
of information to various perspectives for categorization into useful data, which is
collected and assembled in particular areas such as data warehouses, efficient analysis,
data mining algorithm, helping decision making and other data requirement to eventually
cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such
as text mining, web mining, audio and video mining, pictorial data mining, and social
media mining. It is done through software that is simple or highly specific. By outsourcing
data mining, all the work can be done faster with low operation costs. Specialized firms
can also use new technologies to collect data that is impossible to locate manually. There
are tons of information available on various platforms, but very little knowledge is
accessible. The biggest challenge is to analyze the data to extract important information
that can be used to solve a problem or for company development. There are many
powerful instruments and techniques available to mine data and find better insight from
it.
Data warehouses: A Data Warehouse is the technology that collects the data from various
sources within the organization to provide meaningful business insights. The huge amount
of data comes from multiple places such as Marketing and Finance. The extracted data is
utilized for analytical purposes and helps in decision- making for a business organization.
The data warehouse is designed for the analysis of data rather than transaction processing.
Data Repositories: The Data Repository generally refers to a destination for data storage.
However, many IT professionals utilize the term more clearly to refer to a specific kind of
setup within an IT structure. For example, a group of databases, where an organization has
kept various kinds of information.
These are the following areas where data mining is widely used:
Data Mining in Healthcare: Data mining in healthcare has excellent potential to improve
the health system. It uses data and analytics for better insights and to identify best
practices that will enhance health care services and reduce costs. Analysts use data mining
approaches such as Machine learning, Multi-dimensional database, Data visualization, Soft
computing, and statistics. Data Mining can be used to forecast patients in each category.
The procedures ensure that the patients get intensive care at the right place and at the
right time. Data mining also enables healthcare insurers to recognize fraud and abuse.
Data Mining in Market Basket Analysis: Market basket analysis is a modeling method
based on a hypothesis. If you buy a specific group of products, then you are more likely
to buy another group of products. This technique may enable the retailer to understand
the purchase behavior of a buyer. This data may assist the retailer in understanding the
requirements of the buyer and altering the store's layout accordingly. Using a different
analytical comparison of results between various stores, between customers in different
demographic groups can be done.
Data mining in Education: Education data mining is a newly emerging field, concerned
with developing techniques that explore knowledge from the data generated from
educational Environments. EDM objectives are recognized as affirming student's future
learning behavior, studying the impact of educational support, and promoting learning
science. An organization can use data mining to make precise decisions and also to predict
the results of the student. With the results, the institution can concentrate on what to teach
and how to teach.
Data Mining in Fraud detection: Billions of dollars are lost to the action of frauds.
Traditional methods of fraud detection are a little bit time consuming and sophisticated.
Data mining provides meaningful patterns and turning data into information. An ideal
fraud detection system should protect the data of all the users. Supervised methods
consist of a collection of sample records, and these records are classified as fraudulent or
non-fraudulent. A model is constructed using this data, and the technique is made to
identify whether the document is fraudulent or not.
Data Mining in Lie Detection: Apprehending a criminal is not a big deal, but bringing out
the truth from him is a very challenging task. Law enforcement may use data mining
techniques to investigate offenses, monitor suspected terrorist communications, etc. This
technique includes text mining also, and it seeks meaningful patterns in data, which is
usually unstructured text. The information collected from the previous investigations is
compared, and a model for lie detection is constructed.
Data Mining Financial Banking: The Digitalization of the banking system is supposed to
generate an enormous amount of data with every new transaction. The data mining
technique can help bankers by solving business-related problems in banking and finance
by identifying trends, casualties, and correlations in business information and market costs
that are not instantly evident to managers or executives because the data volume is too
large or are produced too rapidly on the screen by experts. The manager may find these
data for better targeting, acquiring, retaining, segmenting, and maintain a profitable
customer.
Performance: The data mining system's performance relies primarily on the efficiency of
algorithms and techniques used. If the designed algorithm and techniques are not up to
the mark, then the efficiency of the data mining process will be affected adversely.
Data Privacy and Security: Data mining usually leads to serious issues in terms of data
security, governance, and privacy. For example, if a retailer analyzes the details of the
purchased items, then it reveals data about buying habits and preferences of the
customers without their permission.
Data Visualization: In data mining, data visualization is a very important process because
it is the primary method that shows the output to the user in a presentable way. The
extracted data should convey the exact meaning of what it intends to express. But many
times, representing the information to the end-user in a precise and easy way is difficult.
The input data and the output information being complicated, very efficient, and
successful data visualization processes need to be implemented to make it successful.
What is KDD?
KDD is a computer science field specializing in extracting previously unknown and
interesting information from raw data. KDD is the whole process of trying to make sense
of data by developing appropriate methods or techniques. This process deals with lowlevel
mapping data into other forms that are more compact, abstract, and useful. This is
achieved by creating short reports, modeling the process of generating data, and
developing predictive models that can predict future cases.
Due to the exponential growth of data, especially in areas such as business, KDD has
become a very important process to convert this large wealth of data into business
intelligence, as manual extraction of patterns has become seemingly impossible in the
past few decades.
KDD is the overall process of extracting knowledge from data, while Data Mining is a step
inside the KDD process, which deals with identifying patterns in data.
And Data Mining is only the application of a specific algorithm based on the overall goal
of the KDD process.
KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, and new data can be integrated and transformed to get different and more
appropriate results.
1. Classification:
This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.
Play Video
Data mining techniques can be classified by different criteria, as follows:
i. Classification of Data mining frameworks as per the type of data sources
mined: This classification is as per the type of data handled. For example, multimedia,
spatial data, text data, time-series data, World Wide Web, and so on.. ii. Classification
of data mining frameworks as per the database involved: This classification based
on the data model involved. For example. Object-oriented database, transactional
database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge
discovered: This classification depends on the types of knowledge discovered or
data mining functionalities. For example, discrimination, classification, clustering,
characterization, etc. some frameworks tend to be extensive frameworks offering
a few data mining functionalities together..
iv. Classification of data mining frameworks according to data mining
techniques used: This classification is as per the data analysis approach utilized,
such as neural networks, machine learning, genetic algorithms, visualization,
statistics, data warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved
in the data mining procedure, such as query-driven systems, autonomous
systems, or interactive exploratory systems.
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the
data by a few clusters mainly loses certain confine details, but accomplishes
improvement. It models data by its clusters. Data modeling puts clustering from a
historical point of view rooted in statistics, mathematics, and numerical analysis. From a
machine learning point of view, clusters relate to hidden patterns, the search for clusters
is unsupervised learning, and the subsequent framework represents a data concept.
From a practical point of view, clustering plays an extraordinary job in data mining
applications. For example, scientific data exploration, text mining, information retrieval,
spatial database applications, CRM, Web analysis, computational biology, medical
diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining technique to identify
similar data. This technique helps to recognize the differences and similarities between
the data. Clustering is very similar to the classification, but it involves grouping chunks of
data together based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable. Regression, primarily a form of planning
and modeling. For example, we might use it to project certain costs, depending on other
factors such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of
items being purchased together.
These are three major measurements technique: o Lift: This measurement technique
measures the accuracy of the confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support: This measurement technique measures how often multiple items are
purchased and compared it to the overall dataset. (Item A + Item B) / (Entire
dataset)
o Confidence: This measurement technique measures how often item B is purchased
when item A is purchased as well. (Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique may
be used in various domains like intrusion, detection, fraud detection, etc. It is also known
as Outlier Analysis or Outilier mining. The outlier is a data point that diverges too much
from the rest of the dataset. The majority of the real-world datasets have an outlier.
Outlier detection plays a significant role in the data mining field. Outlier detection is
valuable in numerous fields like network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequences in
a set of sequences, where the stake of a sequence can be measured in terms of different
criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar
patterns in transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends,
clustering, classification, etc. It analyzes past events or instances in the right sequence to
predict a future event.
1. Classification
2. Prediction
We use classification and prediction to extract a model, representing the data classes to
predict future data trends. Classification predicts the categorical labels of data with the
prediction models. This analysis provides us with the best understanding of the data at a
large scale.
Classification models predict categorical class labels, and prediction models predict
continuous-valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky or a prediction model to predict
the expenditures in dollars of potential customers on computer equipment given their
income and occupation.
What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set
of data is used as training data. The set of input data and the corresponding outputs are
given to the algorithm. So, the training data set includes the input data and their
associated class labels. Using the training dataset, the algorithm derives a model or the
classifier. The derived model can be a decision tree, mathematical formula, or a neural
network. In classification, when unlabeled data is given to the model, it should find the
class to which it belongs. The new data provided to the model is the test data set.
Classification is the process of classifying a record. One simple example of classification is
to check whether it is raining or not. The answer can either be yes or no. So, there is a
particular number of choices. Sometimes there can be more than two classes to classify.
That is called multiclass classification.
The bank needs to analyze whether giving a loan to a particular customer is risky or not.
For example, based on observable data for multiple loan borrowers, a classification model
may be established that forecasts credit risk. The data could track job records,
homeownership or leasing, years of residency, number, type of deposits, historical credit
ranking, etc. The goal would be credit ranking, the predictors would be the other
characteristics, and the data would represent a case for each consumer. In this example, a
model is constructed to find the categorical label. The labels are risky or safe.
1. Developing the Classifier or model creation: This level is the learning stage or the
learning process. The classification algorithms construct the classifier in this stage.
A classifier is constructed from a training set composed of the records of databases
and their corresponding class names. Each category that makes up the training set
is referred to as a category or class. We may also refer to these records as samples,
objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this
level. The test data are used here to estimate the accuracy of the classification
algorithm. If the consistency is deemed sufficient, the classification rules can be
expanded to cover new data records. It includes:
o Sentiment Analysis: Sentiment analysis is highly helpful in social media
monitoring. We can use it to extract social media insights. We can build
sentiment analysis models to read and analyze misspelled words with
advanced machine learning algorithms. The accurate trained models provide
consistently accurate outcomes and result in a fraction of the time.
o Document Classification: We can use document classification to organize
the documents into sections according to the content. Document
classification refers to text classification; we can classify the words in the
entire document. And with the help of machine learning classification
algorithms, we can execute it automatically.
o Image Classification: Image classification is used for the trained categories
of an image. These could be the caption of the image, a statistical value, a
theme. You can tag images to train your model for relevant categories by
applying supervised learning algorithms.
o Machine Learning Classification: It uses the statistically demonstrable
algorithm rules to execute analytical tasks that would take humans hundreds
of more hours to perform.
3. Data Classification Process: The data classification process can be categorized into
five steps:
o Create the goals of data classification, strategy, workflows, and
architecture of data classification.
o Classify confidential details that we store. o Using marks by data
labelling. o To improve protection and obedience, use effects. o Data is
complex, and a continuous method is a classification.
3. Storage: Here, we have the obtained data, including access controls and encryption.
6. Publication: Through the publication of data, it can reach customers. They can then
view and download in the form of dashboards.
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same
as in classification, the training dataset contains the inputs and corresponding numerical
output values. The algorithm derives the model or a predictor according to the training
dataset. The model should find a numerical output when the new data is given. Unlike in
classification, this method does not have a class label. The model predicts a
continuousvalued function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on
the facts such as the number of rooms, the total area, etc., is an example for prediction.
For example, suppose the marketing manager needs to predict how much a particular
customer will spend at his company during a sale. We are bothered to forecast a numerical
value in this case. Therefore, an example of numeric prediction is the data processing
activity. In this case, a model or a predictor will be developed that forecasts a continuous
or ordered value function.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the
following methods. o Normalization: The data is transformed using
normalization.
Normalization involves scaling all values for a given attribute to make them
fall within a small specified range. Normalization is used when the neural
networks or the methods involving measurements are used in the learning
step.
o Generalization: The data can also be transformed by generalizing it to the higher
concept. For this purpose, we can use the concept hierarchies.
NOTE: Data can also be reduced by some other methods such as wavelet
transformation, binning, histogram analysis, and clustering.
o Accuracy: The accuracy of the classifier can be referred to as the ability of the
classifier to predict the class label correctly, and the accuracy of the predictor can
be referred to as how well a given predictor can estimate the unknown value.
o Speed: The speed of the method depends on the computational cost of generating
and using the classifier or predictor. o Robustness: Robustness is the ability to
make correct predictions or classifications. In the context of data mining, robustness
is the ability of the classifier or predictor to make correct predictions from incoming
unknown data.
o Scalability: Scalability refers to an increase or decrease in the performance of the
classifier or predictor based on the given data.
o Interpretability: Interpretability is how readily we can understand the reasoning
behind predictions or classification made by the predictor or classifier.
Classification Prediction
Classification is the process of identifying which category a Predication is the process of identifying the missing or
new observation belongs to based on a training data set unavailable numerical data for a new observation.
containing observations whose category membership is
known.
In classification, the accuracy depends on finding the class label In prediction, the accuracy depends on how well a given
correctly. predictor can guess the value of a predicated attribute for
new data.
In classification, the model can be known as the classifier. In prediction, the model can be known as the predictor.
A model or the classifier is constructed to find the categorical A model or a predictor will be constructed that predicts a
labels. continuous-valued function or ordered value.
For example, the grouping of patients based on their medical For example, We can think of prediction as predicting the
records can be considered a classification. correct treatment for a particular disease for a person.
Parametric Methods uses a fixed number of parameters to Non-Parametric Methods use the flexible number of parameters
build the model. to build the model.
It is applicable only for variables. It is applicable for both – Variable and Attribute.
It always considers strong assumptions about data. It generally fewer assumptions about data.
Parametric Methods require lesser data than Non- Non-Parametric Methods requires much more data than
Parametric Methods. Parametric Methods.
Parametric methods assumed to be a normal distribution. There is no assumed distribution in non-parametric methods.
Parametric data handles – Intervals data or ratio data. But non-parametric methods handle original data.
Here when we use parametric methods then the result or When we use non-parametric methods then the result or
outputs generated can be easily affected by outliers. outputs generated cannot be seriously affected by outliers.
Parametric Methods can perform well in many situations Similarly, Non-Parametric Methods can perform well in many
but its performance is at peak (top) when the spread of situations but its performance is at peak
each group is different. (top) when the spread of each group is the same.
Parametric methods have more statistical power than Non- Non-parametric methods have less statistical power than
Parametric methods. Parametric methods.
As far as the computation is considered these methods are As far as the computation is considered these methods are
computationally faster than the Non-Parametric methods. computationally slower than the Parametric methods.
Examples: Logistic Regression, Naïve Bayes Model, etc. Examples: KNN, Decision Tree Model, etc.
Data Mining Algorithms
Data Mining Algorithms are a particular category of algorithms useful for analyzing data
and developing data models to identify meaningful patterns. These are part of machine
learning algorithms. These algorithms are implemented through various programming
like R language, Python, and data mining tools to derive the optimized data models.
Some of the popular data mining algorithms are C4.5 for decision trees, K-means for
cluster data analysis, Naive Bayes Algorithm, Support Vector Mechanism Algorithms,
The Apriori algorithm for time series data mining. These algorithms are part of data
analytics implementation for business. These algorithms are based upon statistical and
mathematical formulas which applied to the data set.
1. C4.5 Algorithm
Some constructs are used by classifiers which are tools in data mining. These systems
take inputs from a collection of cases where each case belongs to one of the small
numbers of classes and are described by its values for a fixed set of attributes. The
output classifier can accurately predict the level to which it belongs. It uses decision trees
where the first initial tree is acquired by using a divide and conquer algorithm.
Suppose S is a class and the tree is leaf labelled with the most frequent type in S.
Choosing a test based on a single attribute with two or more outcomes than making this
test as root one branch for each work of the test can be used. The partitions correspond
to subsets S1, S2, etc., which are outcomes for each case. C4.5 allows for multiple
products. C4.5 has introduced an alternative formula in thorny decision trees, which
consists of a list of rules, where these rules are grouped for each class. To classify the
case, the first class whose conditions are satisfied is named as the first one. If the patient
meets no power, then it is assigned a default class. The C4.5 rulesets are formed from the
initial decision tree. C4.5 enhances the scalability by multi-threading.
2. The k-means Algorithm
This algorithm is a simple method of partitioning a given data set into the user-specified
number of clusters. This algorithm works on d-dimensional vectors, D={xi | i= 1, … N}
where i is the data point. To get these initial data seeds, the data has to be sampled at
random. This sets the solution of clustering a small subset of data, the global mean of
data k times. This algorithm can be paired with another algorithm to describe nonconvex
clusters. It creates k groups from the given set of objects. It explores the entire data set
with its cluster analysis. It is simple and faster than other algorithms when it is used with
different algorithms. This algorithm is mostly classified as semi-supervised. Along with
specifying the number of clusters, it also keeps learning without any information. It
observes the group and learns.
3. Naive Bayes Algorithm
This algorithm is based on Bayes theorem. This algorithm is mainly used when the
dimensionality of inputs is high. This classifier can easily calculate the next possible
output. New raw data can be added during the runtime, and it provides a better
probabilistic classifier. Each class has a known set of vectors that aim to create a rule that
allows the objects to be assigned to classes in the future. The vectors of variables
describe the future things. This is one of the most comfortable algorithms as it is easy to
construct and does not have any complicated parameter estimation schemas. It can be
easily applied to massive data sets as well. It does not need any elaborate iterative
parameter estimation schemes, and hence unskilled users can understand why the
classifications are made.
4. Support Vector Machines Algorithm
If a user wants robust and accurate methods, then Support Vector machines algorithm
must be tried. SVMs are mainly used for learning classification, regression or ranking
function. It is formed based on structural risk minimization and statistical learning theory.
The decision boundaries must be identified, which is known as a hyperplane. It helps in
the optimal separation of classes. The main job of SVM is to identify the maximizing the
margin between two types. The margin is defined as the amount of space between two
types. A hyperplane function is like an equation for the line, y= MX + b. SVM can be
extended to perform numerical calculations as well. SVM makes use of kernel so that it
operates well in higher dimensions. This is a supervised algorithm, and the data set is
used first to let SVM know about all the classes. Once this is done then, SVM can be
capable of classifying this new data.
5. The Apriori Algorithm
The Apriori algorithm is widely used to find the frequent itemsets from a transaction data
set and derive association rules. To find frequent itemsets is not difficult because of its
combinatorial explosion. Once we get the frequent itemsets, it is clear to generate
association rules for larger or equal specified minimum confidence. Apriori is an
algorithm which helps in finding routine data sets by making use of candidate
generation. It assumes that the item set or the items present are sorted in lexicographic
order. After the introduction of Apriori data mining research has been specifically
boosted. It is simple and easy to implement. The basic approach of this algorithm is as
below:
• Join: The whole database is used for the hoe frequent 1 item sets.
• Prune: This item set must satisfy the support and confidence to move to the next
round for the 2 item sets.
• Repeat: Until the pre-defined size is not reached till, then this is repeated for each
itemset level.
Conclusion
With the five algorithms being used prominently, others help in mining data and learn. It
integrates different techniques including machine learning, statistics, pattern
recognition, artificial intelligence and database systems. All these help in analyzing large
sets of data and perform other data analysis tasks. Hence they are the most useful and
reliable analytics algorithms.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.
P(X/Y) is a conditional probability that describes the occurrence of event X is given that
Y is true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that
X is true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This
is known as the marginal probability.
Bayesian interpretation:
o P(X),
the prior, is the primary degree of belief in X o P(X/Y), the
posterior is the degree of belief having accounted for Y. o The
Bayesian network:
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other
statistical graph, a DAG consists of a set of nodes and links, where the links signify the
connection between the nodes.
The nodes here represent random variables, and the edges define the relationship
between these variables.
A DAG models the uncertainty of an event taking place based on the Conditional
Probability Distribution (CDP) of each random variable. A Conditional Probability Table
(CPT) is used to represent the CPD of each variable in a network. Classification
of Errors
Errors are classified in two types – Systemic (Determinate) and Random (Indeterminate) errors
Systemic (Determinate) errors:
Errors which can be avoided or whose magnitude can be determined is called as systemic errors.
It can be determinable and presumably can be either avoided or corrected. Systemic errors further
classified as
When errors occur during operation is called as operational error e.g. transfers of solution,
effervescence, incomplete drying, underweighting of precipitates, overweighing of precipitates,
and insufficient cooling of precipitates. These errors are physical in nature and occur when sound
analytical techniques is not followed
Errors of Method:
When errors occur due to method, it is difficult to correct. In gravimetric analysis, error occurs due
to Insolubility of precipitates, co-precipitates, post-precipitates, decomposition, and volatilization.
In titrimetric analysis errors occur due to failure of reaction, side reaction, reaction of substance
other than the constituent being determined, difference between observed end point and the
stoichiometric equivalence point of a reaction.
Proportional error depends on the amount of the constituent e.g. impurities in standard compound.
Random Errors:
It occurs accidentally or randomly so called as indeterminate or accidental or random error.
Analyst has no control in this error. It follows a random distribution and a mathematical law of
probability can be applied.
UNIT-III
Association Rules in Data Mining
Introduction to Association Rules in Data Mining
Association rules unit typically needed to satisfy user-specified minimum support and
userspecified minimum confidence at constant time.
The generation of the Association Rule is sometimes divided into a combination of separate st
eps. They are:
• To look for all the frequent items a minimum support threshold is applied which
sets the database information.
• Where minimum confidence is applicable to those frequent item sets so on turn
out rules. While the other step is easy, the primary step needs much attention.
• An antecedent (if)
• An consequent(then)
Association rules unit created by absolutely analyzing information and looking for frequent if
or then patterns. Then, looking at the future a combination of
parameters, the obligatory relationships unit discovered.
• Support
• Confidence • Lift
Support indicates however frequently the if/then relationship appearance within the data.
Confidence tells concerning the number of times these relationships unit found to be true.
Lift is additionally wont to compare the boldness with the expected confidence.
There unit such a large amount of algorithms planned for generating association rules. Style
of the algorithms unit mentioned below:
• Apriori formula
• Eclat formula
• FP-growth formula
1. Apriori algorithm
Apriori is the associate formula for frequent itemset mining and association rule learning
over relative databases. It yields by characteristic the frequent individual things within the
data and protraction them to larger and bigger item sets as long as those item
sets seem sufficiently typically within the data.
The frequent itemsets ensured by apriori is additionally wont to confirm association rules
that highlight trends within the data. It uses a breadth-first search strategy to count the
support of item sets and uses a candidate generation perform that exploits the downward
closure property of support.
2. Eclat algorithm
Eclat represents for equivalence category transformation. Its depth-first search formula
supported set intersection. It’s applicable for each consecutive in addition to parallel
execution with spot-magnifying properties. This is the associate formula for frequent
pattern mining supported depth-first search cross of the item set lattice.
The basic got wind of typically to use dealings Id sets intersections to cypher the
support price of a candidate and avoiding the generation of
the subsets that don’t exist within the prefix tree.
3. FP-growth algorithm
It is also known as a frequent pattern. It’s the associate improvement of apriori formula.
FP growth formula is employed for locating frequent item sets
terribly dealings information whereas not the candidate generation.
This was mainly designed to compress the database which provides the frequent sets and
then it divides the compressed data into sets of the conditional databases.
This conditional database is associated with a frequent set and then apply to data mining
on each database.
• Construction of FP-tree
There unit style of the categories of association rule mining. They’re mentioned as:
Correlation is a bivariate analysis that measures the strength of association between two
variables and the direction of the relationship. In terms of the strength of the relationship,
the correlation coefficient's value varies between +1 and -1. A value of ± 1 indicates a
perfect degree of association between the two variables.
As the correlation coefficient value goes towards 0, the relationship between the two
variables will be weaker. The coefficient sign indicates the direction of the relationship; a
+ sign indicates a positive relationship, and a - sign indicates a negative relationship.
Suppose there is a strong correlation between two variables or metrics, and one of them
is being observed acting in a particular way. In that case, you can conclude that the other
one is also being affected similarly. This helps group related metrics together to reduce
the need for individual data processing.
1. Pearson r correlation
Interpreting Results
Typically, the best way to gain a generalized but more immediate interpretation of the
results of a set of data is to visualize it on a scatter graph such as these:
1. Positive Correlation: Any score from +0.5 to +1 indicates a very strong positive
correlation, which means that they both increase simultaneously. This case follows
the data points upwards to indicate the positive correlation. The line of best fit, or
the trend line, places to best represent the graph's data.
1. Reduce Time to Detection: In anomaly detection, working with many metrics and
surfacing correlated anomalous metrics helps draw relationships that reduce time to
detection (TTD) and support shortened time to remediation (TTR). As data-driven
decision-making has become the norm, early and robust detection of anomalies is critical
in every industry domain, as delayed detection adversely impacts customer experience
and revenue.
2. Reduce Alert Fatigue: Another important benefit of correlation analysis in anomaly
detection is reducing alert fatigue by filtering irrelevant anomalies (based on the
correlation) and grouping correlated anomalies into a single alert. Alert storms and false
positives are significant challenges organizations face - getting hundreds, even thousands
of separate alerts from multiple systems when many of them stem from the same incident.
3. Reduce Costs: Correlation analysis helps significantly reduce the costs associated
with the time spent investigating meaningless or duplicative alerts. In addition, the time
saved can be spent on more strategic initiatives that add value to the organization.
Clustering helps to splits data into several subsets. Each of these subsets contains data
similar to each other, and these subsets are called clusters. Now that the data from our
customer base is divided into clusters, we can make an informed decision about who we
think is best suited for this product.
Clustering, falling under the category of unsupervised machine learning, is one of the
problems that machine learning algorithms solve.
Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its
input data.
o The intra-cluster similarities are high, It implies that the data present inside the
cluster is similar to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not
similar to other data.
What is a Cluster?
o A cluster is a subset of similar objects
o A subset of objects such that the distance between any of the two objects in the
cluster is less than the distance between any object in the cluster and any object
that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density
of objects.
1. Scalability
2. Interpretability
6. High dimensionality
K-Mean (A centroid based Technique): The K means algorithm takes the input
parameter K from the user and partitions the dataset containing N objects into K clusters
so that resulting similarity among the data objects inside the group (intracluster) is high
but the similarity of data objects with the data objects from outside the cluster is low
(intercluster). The similarity of the cluster is determined with respect to the mean value of
the cluster. It is a type of square error algorithm. At the start randomly k objects from the
dataset are chosen in which each of the objects represents a cluster mean(centre). For the
rest of the data objects, they are assigned to the nearest cluster based on their distance
from the cluster mean. The new mean of each of the cluster is then calculated with the
added data objects. Algorithm: K mean:
Input:
Output:
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.
4. Repeat Step 2 until no change occurs.
K-Medoids clustering
K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering.
First, Clustering is the process of breaking down an abstract group of data points/ objects
into classes of similar objects such that all the objects in one cluster have similar traits. , a
group of n objects is broken down into k number of clusters based on their similarities.
Two statisticians, Leonard Kaufman, and Peter J. Rousseeuw came up with this method.
This tutorial explains what K-Medoids do, their applications, and the difference between
K-Means and K-Medoids.
Medoid: A Medoid is a point in the cluster from which the sum of distances to other data
points is minimal. (or)
A Medoid is a point in the cluster from which dissimilarities with all the other points in the
clusters are minimal.
1. Choose k number of random points from the data and assign these k points to k
number of clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid and
assign it to the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to the
medoids)
4. Select a random point as the new medoid and swap it with the previous medoid.
Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid, make
the new medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous medoid,
undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new medoids
to classify data points.
Disadvantages:
1. Not suitable for Clustering arbitrarily shaped groups of data points.
2. As the initial medoids are chosen randomly, the results might vary based on the
choice in different runs.
Both algorithms group n objects into k clusters based on similar traits where k is pre-defined.
Clustering is done based on distance from centroids. Clustering is done based on distance
from medoids.
A centroid can be a data point or some other point in the A medoid is always a data point in the cluster.
cluster
Can't cope with outlier data Can manage outlier data too
Sometimes, outlier sensitivity can turn out to be useful Tendency to ignore meaningful clusters in outlier
data
2. Merge the 2 maximum comparable clusters. We need to continue these steps until
all the clusters are merged together.
1. The ability to handle non-convex clusters and clusters of different sizes and
densities.
3. The ability to reveal the hierarchical structure of the data, which can be useful for
understanding the relationships among the clusters. However, it also has some
drawbacks, such as:
4. The need for a criterion to stop the clustering process and determine the final
number of clusters.
5. The computational cost and memory requirements of the method can be high,
especially for large datasets.
6. The results can be sensitive to the initial conditions, linkage criterion, and distance
metric used.
7. This method can handle different types of data and reveal the relationships among
the clusters. However, it can have high computational cost and results can be
sensitive to some conditions.
• Merge the clusters which are highly similar or close to each other.
Note: This is just a demonstration of how the actual algorithm works no calculation has
been performed below all the proximity among the clusters is assumed.
• Step-1: Consider each alphabet as a single cluster and calculate the distance of
one cluster from all the other clusters.
• Step-2: In the second step comparable clusters are merged together to form a
single cluster. Let’s say cluster (B) and cluster (C) are very similar to each other
therefore we merge them in the second step similarly to cluster (D) and (E) and at
last, we get the clusters [(A), (BC), (DE), (F)]
• Step-3: We recalculate the proximity according to the algorithm and merge the
two nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
• Step-4: Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
• Step-5: At last the two remaining clusters are merged together to form a single
cluster [(ABCDEF)].
2. Divisive:
We can say that Divisive Hierarchical clustering is precisely the opposite of Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the
data points as a single cluster and in every iteration, we separate the data points from the
clusters which aren’t comparable. In the end, we are left with N clusters.
4. It is very problematic to apply this It can work better than Hierarchical clustering
technique when we have data with even when error is there.
high level of error.
5. It is comparatively easier to read The clusters are difficult to read and understand
and understand. as compared to Hierarchical clustering.
UNIT-IV
Decision Tree Induction
Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. The decision
tree creates classification or regression models as a tree structure. It separates a data set
into smaller subsets, and at the same time, the decision tree is steadily developed. The
final tree is a tree with the decision nodes and leaf nodes. A decision node has at least two
branches. The leaf nodes show a classification or decision. We can't accomplish more split
on leaf nodes-The uppermost decision node in a tree that relates to the best predictor
called the root node. Decision trees can deal with both categorical and numerical data.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return
the highest data gain.
In short, a decision tree is just like a flow chart diagram with the terminal nodes showing
decisions. Starting with the dataset, we can measure the entropy to find a way to segment
the set until the data belongs to the same class.
It helps us to make the best decisions based on existing data and best speculations.
In other words, we can say that a decision tree is a hierarchical tree structure that can be
used to split an extensive collection of records into smaller sets of the class by
implementing a sequence of simple decision rules. A decision tree model comprises a set
of rules for portioning a huge heterogeneous population into smaller, more
homogeneous, or mutually exclusive classes. The attributes of the classes can be any
variables from nominal, ordinal, binary, and quantitative values, in contrast, the classes
must be a qualitative type, such as categorical or ordinal or binary. In brief, the given data
of attributes together with its class, a decision tree creates a set of rules that can be used
to identify the class. One rule is implemented after another, resulting in a hierarchy of
segments within a segment. The hierarchy is known as the tree, and each segment is called
a node. With each progressive division, the members from the subsequent sets become
more and more similar to each other. Hence, the algorithm used to build a decision tree
is referred to as recursive partitioning. The algorithm is known as CART (Classification and
Regression Trees)
Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which
leads to $8 million profit, and the probability of a bad economy is 0.4 (40%), which leads
to $6 million profit.
Not expanding factor with 0$ cost, the probability of a good economy is 0.6(60%), which
leads to $4 million profit, and the probability of a bad economy is 0.4, which leads to $2
million profit.
The management teams need to take a data-driven decision to expand or not based on
the given data.
Net Expand = ( 0.6 *8 + 0.4*6 ) - 3 = $4.2M Net Not Expand = (0.6*4 + 0.4*2) - 0 = $3M
$4.2M > $3M,therefore the factory should be expanded.
Initially, D is the entire set of training tuples and their related class levels (input training
data).
Missing values in data also do not influence the process of building a choice tree to any
considerable extent.
A decision tree model is automatic and simple to explain to the technical team as well as
stakeholders.
Compared to other algorithms, decision trees need less exertion for data preparation
during pre-processing.
A decision tree consists of a root node, several branch nodes, and several leaf nodes.
• The root node represents the top of the tree. It does not have a parent node, however,
it has different child nodes.
• Branch nodes are in the middle of the tree. A branch node has a parent node and
several child nodes.
• Leaf nodes represent the bottom of the tree. A leaf node has a parent node. It does
not have child nodes.
The color of the pruned nodes is a shade brighter than the color of unpruned nodes, and the
decision next to the pruned nodes is represented in italics. In contrast to collapsing nodes to
hide them from the view, pruning actually changes the model.
You can manually prune the nodes of the tree by selecting the check box in
the Pruned column. When the node is pruned, the lower levels of the node are collapsed. If
you expand a collapsed node by clicking on the node icon, the collapsed nodes are displayed.
You can specify the prune level. The prune level determines that all nodes with a level
smaller than the specified prune level are unpruned, and all nodes with a level equal or
greater than the specified prune level are pruned. For example, if you specify a prune level
of 3, all nodes with level 1 and 2 are unpruned, and all nodes with level 3 or greater are
pruned.
The computed prune level is the original prune state of the tree classification model. This
means that some of the branch nodes might be pruned by the Tree Classification mining
function, or none of the branch nodes might be pruned at all. Resetting to the computed
prune level removes the manual pruning that you might ever have done to the tree
classification model.
Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and these
tests are logically ANDed.
• The consequent part consists of class prediction. Note − We can also write rule R1
as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision
tree.
Points to remember −
To extract a rule from a decision tree −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the FOIL_Prune
value is higher for the pruned version of R, then we prune R.
#1) Learning Step: The training data is fed into the system to be analyzed by a classification
algorithm. In this example, the class label is the attribute i.e. “loan decision”. The model built from
this training data is represented in the form of decision rules.
#2) Classification: Test dataset are fed to the model to check the accuracy of the classification
rule. If the model gives acceptable results then it is applied to a new dataset with unknown class
variables.
Decision Tree Induction Algorithm
UNIT-V
Data Mining Software
Introduction to Data Mining Software
Data mining is a process of analyzing data, identifying patterns, and converting
unstructured data into structured data ( data organized in rows and columns) to use it for
business-related decision making. It is a process to extract extensive unstructured data from
various databases. Data mining is an interdisciplinary science that has mathematics and
computer science algorithms used by a machine. Data Mining Software helps the user to
analyze data from different databases and detect patterns. Data mining tools’ primary aim
is to find, extract, and refine data and then distribute the information.
Features of Data Mining Software
Below are the different features of Data Mining Software:
• Easy to use: Data mining software has easy to use Graphical User Interface (GUI)
to help the user analyze data efficiently.
• Pre-processing: Data pre-processing is a necessary step. It includes data cleaning,
data transformation, data normalization, and data integration.
• Scalable processing: Data mining software permits scalable processing, i.e., the
software is scalable on the size of the data and users.
• High Performance: Data mining software increases the performance capabilities
and creates an environment that generates results quickly.
• Anomaly Detection: They help to identify unusual data that might have errors or
need further investigation.
• Association Rule Learning: Data mining software use Association rule learning
that identifies the relationship between variables.
• Clustering: It is a process of grouping the data that are similar in some way or
other.
• Classification: It is the process of generalizing the known structure and then
applying it to new data.
• Regression: It is the task of estimating the relationships between datasets or data.
The text mining market has experienced exponential growth and adoption over the last few
years and also expected to gain significant growth and adoption in the coming future. One
of the primary reasons behind the adoption of text mining is higher competition in the
business market, many organizations seeking value-added solutions to compete with other
organizations. With increasing completion in business and changing customer perspectives,
organizations are making huge investments to find a solution that is capable of analyzing
customer and competitor data to improve competitiveness. The primary source of data is
ecommerce websites, social media platforms, published articles, survey, and many more.
The larger part of the generated data is unstructured, which makes it challenging and
expensive for the organizations to analyze with the help of the people. This challenge
integrates with the exponential growth in data generation has led to the growth of analytical
tools. It is not only able to handle large volumes of text data but also helps in decision-
making purposes. Text mining software empowers a user to draw useful information from a
huge set of data available sources.
Text mining can be used to extract structured information from unstructured text data
such as:
Named Entity Recognition (NER): Identifying and classifying named entities such as
people, organizations, and locations in text data.
Sentiment Analysis: Identifying and extracting the sentiment (e.g. positive, negative,
neutral) of text data.
Text Summarization: Creating a condensed version of a text document that captures the
main points.
Topic Modeling: Identifying the main topics present in a document or collection of
documents.
Text Classification: Assigning predefined categories to text data
Issues in Text Mining
Numerous issues happen during the text mining process:
1. The efficiency and effectiveness of decision-making.
2. The uncertain problem can come at an intermediate stage of text mining. In the
preprocessing stage, different rules and guidelines are characterized to normalize the
text that makes the text mining process efficient. Prior to applying pattern analysis on
the document, there is a need to change over unstructured data into a moderate
structure.
3. Sometimes original message or meaning can be changed due to alteration.
4. Another issue in text mining is many algorithms and techniques support multi-
language text. It may create ambiguity in text meaning. This problem can lead to false-
positive results.
5. The utilization of synonym, polysemy, and antonyms in the document text makes
issues for the text mining tools that take both in a similar setting. It is difficult to
categorize such kinds of text/ words.
ADVANTAGES OR DISADVANTAGES:
Advantages of Text Mining:
1. Large Amounts of Data: Text mining allows organizations to extract insights from
large amounts of unstructured text data. This can include customer feedback,
social media posts, and news articles.
2. Variety of Applications: Text mining has a wide range of applications, including
sentiment analysis, named entity recognition, and topic modeling. This makes it a
versatile tool for organizations to gain insights from unstructured text data.
3. Improved Decision Making: Text mining can be used to extract insights from
unstructured text data, which can be used to make data-driven decisions.
4. Cost-effective: Text mining can be a cost-effective way to extract insights from
unstructured text data, as it eliminates the need for manual data entry.
Disadvantages of Text Mining:
1. Complexity: Text mining can be a complex process that requires advanced skills in
natural language processing and machine learning.
2. Quality of Data: The quality of text data can vary, which can affect the accuracy of
the insights extracted from text mining.
3. High computational cost: Text mining requires high computational resources, and
it may be difficult for smaller organizations to afford the technology.
4. Limited to text data: Text mining is limited to extracting insights from unstructured
text data and cannot be used with other data types.
Parser
Parser is a compiler that is used to break the data into smaller elements coming from lexical
analysis phase.
A parser takes input in the form of sequence of tokens and produces output in the form of parse
tree.
Bottom up parsing
o Bottom up parsing is also known as shift-reduce parsing.
o Bottom up parsing is used to construct a parse tree for an input string.
o In the bottom up parsing, the parsing starts with the input symbol and construct the parse
tree up to the start symbol by tracing out the rightmost derivations of string in reverse.
A soft parse is recorded when the Oracle Server checks the shared pool for a SQL statement
and finds a version of the statement that it can reuse.
This metric represents the percentage of parse requests where the cursor was already in the
cursor cache compared to the number of total parses. This ratio provides an indication as to
how often the application is parsing statements that already reside in the cache as compared
to hard parses of statements that are not in the cache.
This test checks the percentage of soft parse requests to total parse requests. If the value is
less than or equal to the threshold values specified by the threshold arguments, and the
number of occurrences exceeds the value specified in the "Number of Occurrences" parameter,
then a warning or critical alert is generated.
Hard Parses (per second)
This metric represents the number of hard parses per second during this sample period. A hard
parse occurs when a SQL statement has to be loaded into the shared pool. In this case, the
Oracle Server has to allocate memory in the shared pool and parse the statement.
Each time a particular SQL cursor is parsed, this count will increase by one. There are certain
operations that will cause a SQL cursor to be parsed. Parsing a SQL statement breaks it down
into atomic steps, which the optimizer will evaluate when generating an execution plan for the
cursor.
This test checks the number of parses of statements that were not already in the cache. If the
value is greater than or equal to the threshold values specified by the threshold arguments,
and the number of occurrences exceeds the value specified in the "Number of Occurrences"
parameter, then a warning or critical alert is generated.
Web content mining can be used to extract useful data, information, knowledge from the web
page content. In web content mining, each web page is considered as an individual document.
The individual can take advantage of the semi-structured nature of web pages, as HTML provides
information that concerns not only the layout but also logical structure. The primary task of
content mining is data extraction, where structured data is extracted from unstructured websites.
The objective is to facilitate data aggregation over various web sites by using the extracted
structured data. Web content mining can be utilized to distinguish topics on the web. For
Example, if any user searches for a specific task on the search engine, then the user will get a list
of suggestions.
The web structure mining can be used to find the link structure of hyperlink. It is used to identify
that data either link the web pages or direct link network. In Web Structure Mining, an individual
considers the web as a directed graph, with the web pages being the vertices that are associated
with hyperlinks. The most important application in this regard is the Google search engine, which
estimates the ranking of its outcomes primarily with the PageRank algorithm. It characterizes a
page to be exceptionally relevant when frequently connected by other highly related pages.
Structure and content mining methodologies are usually combined. For example, web structured
mining can be beneficial to organizations to regulate the network between two commercial sites.
Web usage mining is used to extract useful data, information, knowledge from the weblog
records, and assists in recognizing the user access patterns for web pages. In Mining, the usage
of web resources, the individual is thinking about records of requests of visitors of a website, that
are often collected as web server logs. While the content and structure of the collection of web
pages follow the intentions of the authors of the pages, the individual requests demonstrate how
the consumers see these pages. Web usage mining may disclose relationships that were not
proposed by the creator of the pages.
Some of the methods to identify and analyze the web usage patterns are given below:
The analysis of preprocessed data can be accomplished in session analysis, which incorporates
the guest records, days, time, sessions, etc. This data can be utilized to analyze the visitor's
behavior.
The document is created after this analysis, which contains the details of repeatedly visited web
pages, common entry, and exit.
OLAP can be accomplished on various parts of log related data in a specific period.
OLAP tools can be used to infer important business intelligence metrics
o The complexity of web pages: The site pages don't have a unifying structure. They are
extremely complicated as compared to traditional text documents. There are enormous
amounts of documents in the digital library of the web. These libraries are not organized
according to a specific order.
o The web is a dynamic data source: The data on the internet is quickly updated. For
example, news, climate, shopping, financial news, sports, and so on.
o Diversity of client networks: The client network on the web is quickly expanding. These
clients have different interests, backgrounds, and usage purposes. There are over a
hundred million workstations that are associated with the internet and still increasing
tremendously.
o The web is too broad: The size of the web is tremendous and rapidly increasing. It
appears that the web is too huge for data warehousing and data mining.
Web mining is the application of data mining techniques to extract knowledge from web data,
including web documents, hyperlinks between documents, usage logs of web sites.
The primary objective of a Web Mining process is to discover interesting patterns and rules from
data collected within the Web space. In order to adopt generic data mining techniques and
algorithms to Web data, these data must be transformed into a suitable form. The idea is to
connect specific research domains such as Information Retrieval, Information Extraction, Text
Mining and so on, and to put them together in an innovative process of workflow defining several
phases and steps moreover they can share common activities, facilitating reuse and
standardization.
Generally, Web mining is the application of data mining algorithms and techniques to large Web
data repositories Web usage mining refers to the automatic discovery and analysis of
generalized patterns which describe user navigation paths (e.g. click streams), collected or
generated as a result of user interactions with Web site. constraint- based data mining algorithms
applied in Web Usage Mining and developed software tools .One of the most common algorithm
applied in Web Usage Mining is the Apriori algorithm. Web user navigation patterns were
represented by association rules in. Sequence mining can be also used to mine Web user
navigation patterns. The association rules holds information of forward the sequence of
requested pages (e.g. if user visits page A, and then page C, it will visit page D). Based on this,
users activity can be determined and predictions to the next page can be calculated. The
sequence mining algorithms inherited much from association mining algorithms to discovered
pattern.