0% found this document useful (0 votes)
402 views

Data Warehousing and Mining With Q-Gram As An Application

A database, or collection of databases, designed to help managers make strategic decisions about their business. Whereas a data warehouse combines databases across an entire enterprise, data marts are usually smaller and focus on a particular subject or department. Some data marts, called dependent data marts, are subsets of larger data warehouses.

Uploaded by

Bridget Smith
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
402 views

Data Warehousing and Mining With Q-Gram As An Application

A database, or collection of databases, designed to help managers make strategic decisions about their business. Whereas a data warehouse combines databases across an entire enterprise, data marts are usually smaller and focus on a particular subject or department. Some data marts, called dependent data marts, are subsets of larger data warehouses.

Uploaded by

Bridget Smith
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 6

DATA WAREHOUSING AND MINING WITH Q-GRAM AS AN

APPLICATION

Abstract
A warehouse is a “Subject-oriented, integrated, nonvolatile, time variant collection of
data in support of management decisions.” Data Mining is becoming increasingly popular
as a business information management tool. Data Mining is more oriented towards
applications than the basic nature of the underlying phenomena. Data Mining accepts
among others a "black box" approach to data exploration or knowledge discovery and
uses not only the traditional Exploratory Data Analysis (EDA) techniques, but also such
techniques as Neural Networks which can generate valid predictions but are not capable
of identifying the specific nature of the interrelations between the variables on which the
predictions are based. q-gram matching is used for approximate substring matching
problems in a wide range of application areas, including intrusion detection. All q-grams
present in the text are stored in a tree structure similar to trie.

Data warehouse
Abbreviated DW, a collection of data designed to support management decision making.
Data warehouses contain a wide variety of data that present a coherent picture of business
conditions at a single point in time.
Development of a data warehouse includes development of systems to extract data from
operating systems plus installation of a warehouse database system that provides
managers flexible access to the data.
The term data warehousing generally refers to the combination of many different
databases across an entire enterprise. Contrast with data mart.
Data mart
A database, or collection of databases, designed to help managers make strategic
decisions about their business. Whereas a data warehouse combines databases across an
entire enterprise, data marts are usually smaller and focus on a particular subject or
department. Some data marts, called dependent data marts, are subsets of larger data
warehouses.
Data mining
Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or
Knowledge-Discovery and Data Mining, is the process of automatically searching large
volumes of data for patterns using tools such as classification, association rule mining,
clustering, etc.. Data mining is a complex topic and has links with multiple core fields
such as computer science and adds value to rich seminal computational techniques from
statistics, information retrieval, machine learning and pattern recognition
Example
A simple example of data mining, often called Market Basket Analysis, is its use for retail
sales. If a clothing store records the purchases of customers, a data mining system could
identify those customers who favour silk shirts over cotton ones.
Another is that of a supermarket chain who, through analysis of transactions over a long
period of time, found that beer and diapers were often bought together. Although
explaining this relationship may be difficult, taking advantage of it is easier, for example

1
by placing the high-profit diapers in the store close to the high-profit beers. (This
example is questioned at Beer and Nappies -- A Data Mining Urban Legend.)
Use of the term
Data mining has been defined as "the nontrivial extraction of implicit, previously
unknown, and potentially useful information from data" [1] and "the science of extracting
useful information from large data sets or databases".
It involves sorting through large amounts of data and picking out relevant information.
It is usually used by businesses and other organizations, but is increasingly used in the
sciences to extract information from the enormous data sets generated by modern
experimentation.
Metadata, or data about a given set of data, are often expressed in a condensed data
mine-able format, or one that facilitates the practice of data mining. Common examples
include executive summaries and scientific abstracts.
Although data mining is a relatively new term, the technology is not. Companies for a
long time have used powerful computers to sift through volumes of data such as
supermarket scanner data, and produce market research reports. Continuous innovations
in computer processing power, disk storage, and statistical software are dramatically
increasing the accuracy and usefulness of analysis.
Data mining identifies trends within data that go beyond simple analysis. Through the use
of sophisticated algorithms, users have the ability to identify key attributes of business
processes and target opportunities.
Related terms
Although the term "data mining" is usually used in relation to analysis of data, like
artificial intelligence, it is an umbrella term with varied meanings in a wide range of
contexts. Unlike data analysis, data mining is not based or focused on an existing model
which is to be tested or whose parameters are to be optimized.
In statistical analyses where there is no underlying theoretical model, data mining is often
approximated via stepwise regression methods wherein the space of 2k possible
relationships between a single outcome variable and k potential explanatory variables is
smartly searched. With the advent of parallel computing, it became possible (when k is
less than approximately 40) to examine all 2k models. This procedure is called all subsets
or exhaustive regression. Some of the first applications of exhaustive regression involved
the study of plant data.
Data dredging
Data dredging or data fishing are terms one may use to criticize someone's data mining
efforts when it is felt the patterns or causal relationships discovered are unfounded.
Data dredging is the scanning of the data for any relationships, and then when one is
found coming up with an interesting explanation. The conclusions may be suspect
because data sets with large numbers of variables have by chance some "interesting"
relationships. Fred Schwed said:
"There have always been a considerable number of people who busy themselves
examining the last thousand numbers which have appeared on a roulette wheel, in search
of some repeating pattern. Sadly enough, they have usually found it."
Nevertheless, determining correlations in investment analysis has proven to be very
profitable for statistical arbitrage operations (such as pairs trading strategies), and
correlation analysis has shown to be very useful in risk management. Indeed, finding

2
correlations in the financial markets, when done properly, is not the same as finding false
patterns in roulette wheels.
Some exploratory data work is always required in any applied statistical analysis to get a
feel for the data, so sometimes the line between good statistical practice and data
dredging is less than clear.
Most data mining efforts are focused on developing highly detailed models of some large
data set. Other researchers have described an alternate method that involves finding the
minimal differences between elements in a data set, with the goal of developing simpler
models that represent relevant data. [5]
When data sets contain a big set of variables, the level of statistical significance should be
proportional to the patterns that were tested. For example, if we test 100 random patterns,
it is expected that one of them will be "interesting" with a statistical significance at the
0.01 level.
Cross validation is a common approach to evaluating the fitness of a model generated via
data mining, where the data is divided into a training subset and a test subset to
respectively build and then test the model. Common cross validation techniques include
the holdout method, k-fold cross validation, and the leave-one-out method.
Privacy concerns
There are also privacy concerns associated with data mining - specifically regarding the
source of the data analyzed. For example, if an employer has access to medical records,
they may screen out people who have diabetes or have had any legal problems.
Data mining government or commercial data sets for national security or law
enforcement purposes has also raised privacy concerns.
There are many legitimate uses of data mining. For example, a database of prescription
drugs taken by a group of people could be used to find combinations of drugs exhibiting
harmful interactions. Since any particular combination may occur in only 1 out of 1000
people, a great deal of data would need to be examined to discover such an interaction. A
project involving pharmacies could reduce the number of drug reactions and potentially
save lives. Unfortunately, there is also a huge potential for abuse of such a database.
Essentially, data mining gives information that would not be available otherwise. It must
be properly interpreted to be useful. When the data collected involves individual people,
there are many questions concerning privacy, legality, and ethics.
Combinatorial game data mining
Data mining from combinatorial game oracles:
Since the early 1990s, with the availability of oracles for certain combinatorial games,
also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board
dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and
hex; a new area for data mining has been opened up. This is the extraction of human-
usable strategies from these oracles. This is pattern-recognition at too high an abstraction
for known Statistical Pattern Recognition algorithms or any other algorithmic approaches
to be applied: at least, no one knows how to do it yet (as of January 2005). The method
used is the full force of Scientific Method: extensive experimentation with the tablebases
combined with intensive study of tablebase-answers to well designed problems,
combined with knowledge of prior art i.e. pre-tablebase knowledge, leading to flashes of
insight. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable

3
examples of people doing this work, though they were not and are not involved in
tablebase generation.
Notable uses of data mining
Data mining has been cited as the method by which the U.S. Army unit Able Danger
supposedly had identified the 9/11 attack leader, Mohamed Atta, and three other 9/11
hijackers as possible members of an al Qaeda cell operating in the U.S. more than a year
before the attack.
See also: Able Danger, wikinews:U.S. Army intelligence had detected 9/11 terrorists year
before, says officer.
It has been suggested that both the CIA and their Canadian counterparts, CSIS, have put
this method of interpreting data to work for them as well[7], although they have not said
how.
Structured data mining is the process of finding and extracting useful information from
raw datasets. This page may one day describe mining for general data structures. Graph
mining is a special case of Structured Data Mining, which is related to Molecule mining

q-Gram Matching Using Tree Models:


Abstract: q-gram matching is used for approximate substring matching problems in a
wide range of application areas, including intrusion detection. All q-grams present in the
text are stored in a tree structure similar to trie. We use a tree redundancy pruning
algorithm to reduce the size of the tree without losing any information. We also use suffix
links for fast q-gram search during query matching. We compare our work with the
Rabin-Karp-based hash-table technique, commonly used for multiple q-gram search. We
present results of experiments on system call sequence data used for intrusion detection.
Introduction: Given a text string, T, and a query string, Q, the problem of q-gram
matching is to find all the substrings of length q in the query which are also present in the
text. This problem can be easily extended to multiple text strings. The length of the text
string is assumed to be very large and the length of the query is much smaller than the
text. Q-gram matching has been used extensively in many application areas including
information retrieval, signal processing, pattern recognition, and computational biology.
In computational biology genomic sequences show a high level of matching for small
length substrings. In information retrieval filtration-based approximate string matching
algorithms require efficient search of all the q-gram shared by query and text. In some
word processors, q-gram matching is performed for approximate matching and finding
misspellings. Substring matching has also been used extensively in intrusion detection.
For example, consider the problem of detecting anomalous program behavior. Normal
behavior of a program can be observed via its interaction with the underlying operating
system, which can be characterized by the sequence of the system calls generated by the
program. For example, it has been observed that small length substrings are very
consistent through out different normal executions of a program. We can build a normal
profile of a program using the substrings generated by the sample executions of the
program. While monitoring, we can observe the substrings generated by the program and
check if they match with the stored substrings. If a substring does not have a match an
alarm is raised. Similarly, many intrusion detection systems also rely on substring
matching of network traffic or host activities with normal patterns or attack patterns. One
main requirement of an intrusion detector is that it should be fast, so that one can detect

4
any possible attack as early as possible. One of the main factors for the speed of an
intrusion detector is the speed of the substring matching algorithm it uses. Since the set of
normal or attack patterns needs to be extensive, there are usually a huge number of text
patterns in intrusion detection. Existing string or substring matching algorithms are not
suitable for the q-gram matching problem and do not have good run time efficiency. The
expected run time complexity of the best existing string matching algorithms is sub-linear
to the length of the text. They do not apply well to multiple q-gram matching with huge
text size. Some string matching algorithms with multiple texts have complexity of length
of the query. When used for multiple q-gram matching they take q*m time, where m is
the length of the query. A simple solution for the problem will be to record all the unique
q-grams present in the text and store them in a hash-table. While matching, we can get
each of the q-gram in the query and check if it is present in the hash-table. If the
calculation of the hash function takes time t, then the total time to match the query will be
O(mt), where m is the length of the query. The Rabin-Karp algorithm follows the similar
idea and uses an efficient method to calculate hashes. The Rabin-Karp is the only existing
algorithm that we are aware of which works well for the given problem. In this paper, we
present a new q-gram matching algorithm that is more efficient than the Rabin-Karp
algorithm. Although our work was motivated by applications in intrusion detection, the
algorithm is general and is applicable to other domains.

(a) (b)

(a) Tree redundancy pruning algorithm: subtrees S1 and S2 are similar. Remove S2
are similar. Remove S1.
(b) Matching using pruned tree: marking previous substrings after finding a
mismatch. In both figures, a path up to the solid and dash-dot lines are present in
the pruned. The dotted part is pruned or absent in the tree.

5
Conclusion
Data mining offers an important approach for achieving values from the data warehouse
for use in decision support. Data warehouse is needed for better informed and timely
decisions. Over the next few years, the growth of data warehousing is going to be
enormous with new products and technologies coming out frequently. In order to get the
most out of this period, it is going to be important that data warehouse planners and
developers have a clear idea of what they are looking for and then choose strategies.
To conclude, we say that performance and flexibility are the keywords.

References
Data mining concepts and technique by A.K.Raju
IEEE Transactions on knowledge and data engineering, volume 18, No.4, APRIL 2006

Siravuru Siva Sankar Sharath


124/A ,
sector-6 UKKUNAGARAM,
VISAKHAPATNAM-530032
Ph: 0891-2758486(res)
Ph: 9441234774(off)
Email: [email protected]

Sunil kumar kodi


Door no. 44-37-43/3,
Akkayapalem,
Srinivas nagar,
Visakhapatnam-530016
Ph: 2517440(res)
Ph: 9885380917(off)
Email: [email protected]

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING


COMPUTER SCIENCE(2nd year)

You might also like