0% found this document useful (0 votes)
65 views

Project Report For ME

The document discusses data warehouses and data mining. It defines a data warehouse as a database containing cleansed data from various sources that can be queried from different organizational perspectives. Data warehouses use a multidimensional schema structure and contain non-volatile aggregated data for analysis. The document also defines data mining as extracting useful patterns from large datasets to reveal hidden insights. Common data mining techniques include association, classification, clustering and sequential patterns.

Uploaded by

sumit
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Project Report For ME

The document discusses data warehouses and data mining. It defines a data warehouse as a database containing cleansed data from various sources that can be queried from different organizational perspectives. Data warehouses use a multidimensional schema structure and contain non-volatile aggregated data for analysis. The document also defines data mining as extracting useful patterns from large datasets to reveal hidden insights. Common data mining techniques include association, classification, clustering and sequential patterns.

Uploaded by

sumit
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 49

Chapter -1 INTRODUCTION

1.1 Data Warehouse


A data warehouse is a data base that contains cleansed data. Data warehouse is recognized for quick querying of information by different views of the organization. Data warehouses are subject oriented (i.e., customer, vendor, product, activity, patient) rather than functionally oriented, such as the production planning system or human resources system. Data warehouses are integrated; therefore, the meaning and results of the information is the same regardless of organizational source. The data is nonvolatile but can change based upon history. A database used for an on-line transaction processing system where the database records are continually updated, deleted, and inserted. Data warehouse is also called as on-line analytical processing. As data in Data Warehouse contain aggregate data, it can be analyzed to improve the company growth. Data in Data Warehouse comes from various heterogeneous sources. There are several ways a data warehouse or data mart can be structured: multidimensional, star, and snowflake. However, an underlying concept used by all the models method above is that of a dimension. A dimension is the different ways the information can be "cut" and summarized such as geography, time intervals, product groups, salesperson(s) [6]. Star schema has one fact table and multiple dimension tables. Snowflake schema has one fact table and multiple dimension tables but dimension table is also act as fact table for other dimension. Constellation schema has multiple fact tables and multiple dimension tables. Facts are numeric or factual data that represents a specific business activity that are to be analyzed. Dimensions are the single perspectives on the data that determines the granularity to be adopted for the fact representation. Data in data warehouse cant be updated. The records in a single dimension table represent the levels or choices of aggregation for the given dimension. Using the date dimension we would be able to analyze data by a single date or dates aggregated by month, quarter, fiscal year, calendar year, holidays, etc [6]. Bill Inmon has formally defined a data warehouse in the following terms; it is subject-oriented, time-variant, non-volatile, and integrated.

Subject-oriented: The data in the database is organized so that all the data elements relating to the same real-world event or object are linked together. Time-variant: The changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time. Non-volatile: Data in the database is never over-written or deleted, once committed, the data is static, read-only, but retained for future reporting. Integrated: The database contains data from most or all of an organization's operational applications, and that this data is made consistent. As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages: Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance. Offline Data Warehouse - Data warehouses at this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.) Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization. There are many advantages to using a data warehouse, some of them are: It enhances the end-user access to a wide variety of data. Decision support system users can obtain specified trend reports, e.g. the item with the most sales in a particular area/country within the last two years.

Top tier: front-end tools Query/report Analysis Data Mining

------------------------------------------------------------------------------------------------------------

Middle tier: OLAP Server -----------------------------------------------------------------------------------------------------------Data warehouse Data marts Monitoring Administration

Bottom tier: Data warehouse server Metadata repository Extract ------------------------------------------------------------------------------------------------------------

Clean Transform Load Refresh

Operational databases

External sources

Fig 1.1: Three-tier data warehousing architecture 3

Other sources Operational DBs

Metadata

Monitor & Integrator

OLAP Server

Extract Transfor m Load Refresh

Data Warehouse

Serve

Analysis Query Reports Data mining

Data Marts

Data Sources

Data Storage

OLAP Engine

Front-End Tools

Fig 1.2: Data Warehouse: A Multi-Tiered Architecture There is a difference between data warehouse and normal database. Every company conducting business inputs valuable information into transactional-oriented data stores. The distinguishing traits of these online transaction processing (OLTP) databases are that they handle very detailed, day-to-day segments of data, are very write-intensive by nature and are designed to maximize data input and throughput while minimizing data contention and resource-intensive data lookups. By contrast, a data warehouse is constructed to manage aggregated, historical data records, is very read-intensive by nature and is oriented to maximize data output. Usually, a data warehouse is fed a daily diet of detailed business data in overnight batch loads with the intricate daily transactions being aggregated into more historical and analytically formatted database objects. Naturally, since a data warehouse is a collection of a business entitys historical information, it tends to be much larger in terms of size than its OLTP counterpart.

1.2 Data Mining


Data mining has been defined as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases". An example of data mining often called the Market Basket Analysis. Data mining is the process of posing queries and extracting previously unknown information in the form of patterns, trends and structures from large quantities of oftendiverse data [2]. As data mining evolves and matures more and more businesses are incorporating this technology into their business practices. However, currently data mining and decision support software is expensive and selection of the wrong tools can be costly in many ways [3]. The ultimate goal of data mining is in revealing patterns that are easy to perceive, interpret, and manipulate [4]. Data mining can be applied to following area: Business intelligence Business performance management Discovery science Loyalty card Bioinformatics Intelligence services Cheminformatics

Goals of data mining is prediction and description. Prediction makes use of existing variables in the database in order to predict unknown or future values of intereset. Description focuses on finding patterns describing the data and the subsequent presentation for user interpretation.There are several data mining techniques fulfilling these objectives. Some of these are associations, classifications, sequential patterns and clustering. The basic premise of an association is to find all associations, such that the presence of one set of items in a transaction implies the other items. Classification develops profiles of different groups. Sequential patterns identify sequential patterns subject to a user-specified minium constraint. Clustering segments a database into subsets or clusters. Another approach of the study of data mining techniques is to classify the techniques as i. User-guided or verification-driven data mining.

ii.

Discovery-driven or automatic discovery of rules

Most of the techniques of data mining have elements of both the models. 1.2.1 Data Mining using Association rule The goal of the association rules algorithm is to detect relationships or associations between specific values of nominal attributes in large data sets. There are two important quantities measured for every association rule: support and confidence. The support is the fraction of transactions that contain both X and Y. The confidence is the fraction of transactions containing X, which also contain Y. For a given transaction database T, support is , confidence is . An association rule is an expression of the form XY, where X and Y are subsets of A and XY holds with confidence , if % of transactions in D that support X also support Y. The rule X Y has support in the transaction set T if % of transactions in T supports X U Y.

1.3 Related Work


1.3.1 Construction of Data Warehouse using star schema An implementation of a data warehouse for an outpatient clinical information system is done by star schema. Star schema contain fact table and dimension tables. Star schema usually contains measures. Common to the star and snowflake methods is the fact table. The fact table is a database table that contains the data (factual history) like sales volume, cost or quantity for usually for the smallest time period or a low level of all the dimensions. For an outpatient clinical information system, a simple fact table would have the following columns (SAS variables) [6]. PATIENT: unique patient identification number LOCATION: where the procedure was performed PROVIDER: doctor or facility that performed the procedure PAYOR: the organization who pays the bill CPTCODE: standard CPT procedure code DIAGNOS1: primary diagnosis as an ICD9 code

DIAGNOS2: secondary diagnosis as an ICD9 code DATESERV: date of the procedure was performed ADJUST: the adjustment to the charge for services CHARGE: the final actual charge for the procedure AGE: age of patient at time of procedure COUNT: the frequency of the event The first eight variables from PATIENT to DATESERV are dimension variables. The final four variables, ADJUST, CHARGE, AGE and COUNT are numerical variables to be used for arithmetical or statistical operations. Users of the Clinical Information System will want to look at the data summarized to various levels. Joining selected dimension tables to the fact table will provide the user with a dataset on which to aggregate the needed information. For example, to analyze the charge by patient, by quarter, by location would require a join of four tables: the fact table, the patient dimension table, the location dimension table, and the dataserv dimension table. The resultant data file will then be aggregated by using the Proc Summary step to produce a dataset for analysis [6]. On-line Analytical Processing (OLAP) is the analytical capabilities provided by the data warehouse or data mart. One can view granular data or various aggregations of data for business analyses using graphical-user-friendly tools.

Fig 1.3: Star schema for Outpatient Clinical information system 1.3.2 Predator-Miner [5] 7

Predator-Miner extends Predator with a relational like association rule mining operator to support data mining operations. Predator-Miner allows a user to combine association rule mining queries with SQL queries. This approach towards tight integration differs from existing techniques of using user-defined functions, stored procedures, or reexpressing a mining query as several SQL queries in two aspects: (i) By encapsulating the task of association rule mining in a relational operator, it allow association rule mining to be considered as part of the query plan, on which query optimization can be performed on the mining query holistically. (ii) By integrating it as a relational operator, it can be leverage on the mature field of relational database technology. Authors define ad hoc data mining to be a flexible and interactive data mining process performed over a sub-set of the data without the need for data pre-processing. The motivation for ad hoc data mining is to allow the end-user to get a quick analysis of the results from the mining process prior to performing mining on the entire dataset, and to reduce the hassle (disturbance) of a pre-processing step. Novelty of the proposed system is tight integration of association rule mining operation within a relational database, Ad Hoc Association Rule Mining, Integration of Association Rule mining queries and relational queries.

Fig1.4: System Architecture for Predator-Miner

1.4 Proposed Work


8

To analyze the academic data of MGMs college of engineering we divided this dissertation in two modules which is shown in fig 1.5. In first module we design data warehouse using star schema. In second module analysis can be done by using proposed approaches. Construction of DWH (Star schema) Fig 1.5: Proposed work Analysis of data by proposed approaches

Fig 1.6: System Architecture 1.4.1 Design Data Warehouse Data Warehouse is implemented using star schema. Star schema is the mostly used schema in data warehouse design. Dimensional modeling, which is the most prevalent technique for modeling data warehouses, organizes tables into fact tables containing basic quantitative measurements of a business subject and dimension tables that provide descriptions of the facts being stored. There are several ways a data warehouse can be structured: multidimensional, star, and snowflake. The dimension table is a database table that contains the detailed data. The fact table is a database table that contains the summary

level data. In this dissertation we proposed designing of data warehouse for educational data. For this purpose, firstly data is analyzed then put up into fact table & dimension table as per analysis. Fact table contain data, which is going to be changed frequently. Dimension tables contain data, which is not going to be changed in future. Fact table usually contain measures. For this data warehouse one fact table and nine dimension tables are designed. Steps in designing of data warehouse: 1. Analyzing the data of educational system by different views. 2. Construction of Data Warehouse using star schema. 3. Bitmap join index is used for fast access of data warehouse. 4. Data has been entered into dimension table and fact table respectively. 1.4.2 Mining Using Ad-hoc Association Rule Mining is the process of retrieving knowledge from data warehouse. Association rules describe the co-occurrence of pattern. There are two important measured for every association rule: support and confidence. The support is the fraction of attendance that contains both X & Y. The confidence is the fraction of attendance containing X, which also contains Y. The support measures the significance of the rule, so we are interested in rules with relatively high support. The confidence measures the strength of the correlation. So, rules with low confidence are not meaningful. Ad-hoc means temporary, Ad-hoc association rule means, to find association rules on temporary relations. The tables which are not connected directly to each other but we are trying to retrieve data from those tables & want to compare those tables for efficient knowledge. By using ad-hoc association rule data can retrieve from analytical level database. Ad-hoc association rule able to find out data from multiple dimensions of star schema & can compare it also. Data mining tools can retrieve data only from fact table not from dimension table. Dimensions of star schema contain much information. Dimensions of star schema can not be avoided. By using ad-hoc association rule data can be retrieved from dimensions of star schema. Four types of approaches are suggested in this dissertation. Type 1:

10

date = dn,time = tn

(student _ present ) <= days & timethresh old date = d 1,time = t1

Sample query: Find percentage of students present in campus between dates 29-Aug-05 and 29-Oct-05 during times 11 A.M and 3:15 P.M. Type 2:
(

1 if (present students) else 0 class, dept, studentno, presence, date, subject)

Sample query: Find the presence of students on date 14-SEP-05 in subject TCPIP from final year computer science & engineering department. Type 3: Let C be the classes. In a class we can see different departments D. Department can be further split into subjects S. And subjects can be further divided into different days D1.

P/A of student = (C < D < S < D1) .


D1 will give data by comparison of more dimension tables. C will give data by comparison of less dimension tables. Sample Query: Find presence of student between dates 15-Sep-05 and 13-Nov-05 in three subjects from final year computer science & engineering department whose enrollment number is Y05CSBE133.

11

Type 4:
Class, dept, studentno, subject = 1, presence, date

( A1 A2 . Ai) < presence

threshold

Sample Query: Find percentage of students from third year electronics & telecommunication branch whose presence is less than 75% in 3 different subjects.

12

Chapter -2 STAR SCHEMA AND INDEXING


2.1 Star schema
A description of data in terms of data model is called as schema. A data model in the form of star is called as star schema because it looks like star. Star schema is the widely used representation of data in data warehouse. The main advantages of star schema:

Star schema provides direct and intuitive mapping between the business entities being analyzed by end users and the schema design.

Star schema provides highly optimized performance for typical star queries; but star schema takes more space than snowflake schema. Query response time of snowflake schema is more than star schema.

Star schema is widely supported by a large number of business intelligence tools, which may anticipate or even require that the data-warehouse schema contain dimension tables.

Star schema has much simpler structure than snowflake or constellation schema. Additional nodes can be added or deleted without much effort.

Star schema mostly used structures of data warehouse. Support indexing with much lower complexity while designing than other schemas.

The first step in designing a fact table is to determine the granularity of the fact table. Granularity constitutes two steps: (i) Determine which dimensions will be included. (ii) Determine where along the hierarchy of each dimension the information will be kept. The determining factors usually goes back to the requirements. In star schema dimension tables are not normalized. Star schemas are used for both simple data marts and very large data warehouses.

13

Dimension Table Key1 Value Value Fact Table Key1 Key2 Key3 Key4 Measures Dimension Table Key3 Value Value Fig 2.1 Star schema

Dimension Table Key2 Value Value Value

Dimension Table Key4 Value Value Value Value

A simple query against the base dimension table can provide sub-second response, but a query that involves multiple joined snowflakes can take more time. The advantage of this type of schema is its simplicity; it's understandable by end users. Other advantages are low maintenance (since the diagram is simple), it is relatively easy to define new hierarchies and the numbers of connections are less. A star schema is characterized by one or more very large fact tables that contain the primary information in the data warehouse and a number of much smaller dimension tables (or lookup tables), each of which contains information about the entries for a particular attribute in the fact table. Query to star schema involves joining of fact table and a number of lookup tables (also called as dimension tables). Each lookup table is joined to the fact table using a primary-key to foreign-key join, but the lookup tables are not joined to each other. A star join is a primary-key to foreign-key join of the dimension tables to a fact table. The fact table normally has a concatenated index on the key columns to facilitate this type of join.

14

2.2 Indexing
2.2.1 B-tree indexing: B-tree indexes are well suited for OLTP applications in which users' queries are relatively routine (and well tuned before deployment in production), as opposed to ad hoc queries, which are much less frequent and executed during nonpeak business hours. Because data is frequently updated in and deleted from OLTP applications, bitmap indexes can cause a serious locking problem in these situations. B-tree eliminates the redundant storage of search key values. Deletion may occur in a non-leaf node (more complicated). Oracle supports dynamic B-tree-to-bitmap conversion, but it can be inefficient. Null values are not indexed in B-tree indexes. It speeds up known queries. It is well suited for high cardinality. The space requirement is independent of the cardinality of the indexed column. It is relatively inexpensive when we update the indexed column since individual rows are locked. It performs inefficiently with low cardinality data. It does not support ad hoc queries. More I/O operations are needed for a wide range of queries. The indexes can not be combined before fetching the data. A fully developed B-tree index is composed of the following three different types of index pages or nodes:

One root node

A root node contains node pointers to branch nodes.

Two or more branch nodes

A branch node contains pointers to leaf nodes or other branch nodes.

Many leaf nodes

A leaf node contains index items and horizontal pointers to other leaf nodes. 2.2.2 Bitmap Indexing: In a Bitmap Index, each distinct value for the specified column is associated with a bitmap where each bit represents a row in the table. Two values of bitmap are 1 and 0. Value of bit is 1 i.e. row contains that value and 0 i.e. it doesn't. Bitmap indexes are a great boon to certain kinds of application. When there are bitmap indexes on tables then updates

15

will take out full table locks. Bitmap indexes are good for low-cardinality columns. Bitmap index scans are more efficient than table scans even when returning a large fraction of a table. Indexes are created to allow Oracle to identify requested rows as efficiently as possible. Strategy behind bitmap indexes is very different from the strategy behind B tree indexes. Inserts and deletes on a table will result in updates to all the associated indexes. Bitmap indexes are primarily intended for data warehousing applications. They are not suitable for OLTP applications with large numbers of concurrent transactions modifying the data. Key facts about bitmap indexes are: If a B tree index is not an efficient mechanism for accessing data, it is unlikely to become more efficient simply because we convert it to a bitmap index. Bitmap indexes can usually be built quickly, and tend to be surprisingly small. The size of the bitmap index varies with the distribution of the data. Updates to bitmapped columns, and general insertion/deletion of data can cause serious lock conflict. Updates to bitmapped columns, and general insertion/deletion of data can degrade the quality of the indexes quite dramatically. Bitmap index specialized type of index. Designed for querying on multiple keys. Each bitmap index is built on a single key. Bitmap is an array of bits. Bitmap indices are generally quite small compared to the actual relation size. Records are at least ten of bytes to hundreds of bytes long, whereas a single bit represents the record in a bitmap. Space occupied by a single bitmap is usually less than 1% of the space occupied by the relation Reduced response time for large classes of ad hoc queries Reduced storage requirements compared to other indexing techniques Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory

16

Efficient maintenance during parallel DML and loads

2.2.3 Bitmap Join Indexing: Bitmap Join Index extended the concept such that the index contains the data to support the join query, allowing the query to retrieve the data from the index rather than referencing the join tables. Since the information is compressed into a bitmap, the size of the resulting structure is significantly smaller than the corresponding materialized view. Bitmap join indexes represent the join of columns in two or more tables. With a bitmap join index, the value of a column in one table, dimension table, is stored with the associated ROWIDS of the like value in the other tables that the index is defined on. This provides fast join access between the tables---if that query uses the columns of the bitmap join index. In a data warehouse environment, a bitmap join index might be a more efficient way of accessing data than a materialized-view join. When using a bitmap join index in a warehouse environment, join will be created using an equi-inner join between the primary key column(s) of the dimension tables and the foreign key column(s) of the fact table. SQL query for bitmap join index is: Create bitmap index my_bitmap_index on attendence (A.studentno) from student_no A, attendence B where A.studentno=B.studentno; CREATE BITMAP INDEX my_bitmap_index ON fact_table (dimension_table.col2) FROM dimension_table, fact_table WHERE dimension_table.col1=fact_table.col1; There are a few restrictions on bitmap join indexes. These are the following: The bitmap join index is built on a single table. Oracle will allow only one of the tables of a bitmap join index to be updated, inserted, or deleted from at a time. A bitmap join index cannot be created on an index-organized table or a temporary table. 17

Every column in the bitmap join index must be present in one of the association dimension tables. The join operations on the bitmap index must form either a star or snowflake schema.

Either primary key columns or unique constraints must be created on the columns that will be join columns in the bitmap join index. All the primary key columns of the dimension table must be part of the join criteria of the bitmap join index. All restrictions on normal bitmap indexes apply to bitmap join indexes. Parallel DML (insertion, deletion, up gradation or modification) is currently only supported on the fact table. Parallel DML on one of the participating dimension tables will mark the index as unusable.

Only one table can be updated concurrently by different transactions when using the bitmap join index. The columns in the index must all be columns of the dimension tables. The dimension table join columns must be either primary key columns or have unique constraints. If a dimension table has composite primary key, each column in the primary key must be part of the join. If more than one primary key exists in a table it is called composite primary key.

18

Chapter 3 CONSTRUCTION OF DATA WAREHOUSE


Users can enter information using GUI. GUI is designed using JAVA swing. Data warehouse is designed using oracle 9i Enterprise edition. JDBC is used for connectivity of GUI & warehouse. The benefits of this integration are threefold: 1. User cant enter random data in data warehouse. It will reduce the redundancy and avoid any privacy, security, and confidentiality issues related to data movement. 2. The relational database does all query processing, so we leverage the computational and storage resources of the data warehouse 3. Extended association rules can be mined from any set of tables within the relational database that is storing the data warehouse, so we enable wide-range ad-hoc mining.

3.1 Construction of data warehouse


Data warehouse is designed using star schema. Fact table has foreign keys and dimension tables have primary keys and detail of keys. Our design has one fact table and nine dimension tables. Bitmap join index is used for indexing of schema. Concurrent updating of dimension tables is not allowed. Updating fact table concurrently is allowed. Any number of users can access data warehouse without any damage of the warehouse. Our warehouse is designed using star schema so it will take more space but it will give fast query response. Query execution time is measured in seconds. For most complex query it will take 58 seconds. But, response time of query from dimension tables is in micro seconds (quick response). For our work storage space of data warehouse (one fact table and nine dimension tables) is 19.8 GB. Dimension tables namely Student_no has 560 tuples (records), Course_no has 3 tuples, Lecturer_no has 52 tuples, dept_key has 4 tuples, subject_key has 56 tuples, date_key has 108 tuples, time_key has 11 tuples etc. Fact table (attendance) has around 200000 tuples.

19

Figure 3.1 shows the complete designing of data warehouse using star schema for MGMs College of Engineering, Nanded. Attendance is the fact table and dimension tables are student, course detail, faculty, department, date, subject, presence, result, room, time. Fact table is surrounded by dimension tables. (Dimension table) studentkey Student Name Student Address Year of Study Student Degree Attendence (Fact table) studentkey coursekey facultykey deptkey timekey roomkey presencekey subjectkey datekey Student coursekey Course Name

roomkey Building Location Capacity Room timekey Time Time Course Detail deptkey Department Name No. of staff Department subjectkey Name of subject Subject datekey Date-Month-Year Date Fig 3.1 DWH for Educational System

facultykay Employee ID Name of faculty Designation Salary Faculty presencekey Present status Presence

In this dissertation fact table and dimension table is designed in oracle 9i. Sample source code for dimension table is: create table student_no (studentno varchar2(11) constraint student_prim primary key, studentname varchar2(20), studentdegree varchar2(10),yearofstudy varchar2(20));

20

create table course_no(courseno varchar2(5) constraint course_prim primary key, coursename varchar2(25)); create table presence_key(presence_k varchar2(1) constraint subject_prim primary key, detail varchar2(25)); Source code for fact table is: create table attendence ( studentno varchar2(11) constraint fr_std references student_no(Studentno), course_no(courseno), lecturer_no(lecturerno), roomkey varchar2(5) courseno lecturerno deptkey constraint varchar2(5) varchar2(5) varchar2(5) fr_rk constraint constraint constraint fr_cr fr_lect fr_dept references references references resultkey

dept_key(deptkey), timekey varchar2(2) constraint fr_tk references time_key(timekey), references room_key(roomkey), varchar2(5) constraint fr_rs references result_key(resultkey), presence_k varchar2(1) constraint fr_pres references presence_key(presence_k), subject_k varchar2(5) constraint fr_sub references subject_key(subject_k), datekey varchar2(5) constraint fr_date references date_key(datekey) ); User will enter daily attendance using a GUI developed in JAVA. GUI is shown in Fig 3.2 and Fig 3.3.

21

Fig 3.2: User entering daily attendance of TECSE

22

Fig 3.3: User entering daily attendance of SECSE

23

Chapter 4 DATA MINING USING AD-HOC ASSOCIATION RULE


4.1 Association Rule
Association rules detect relationships or associations between specific values of nominal attributes in large data sets. There are two important quantities measured for every association rule: support and confidence. The support is the fraction of transactions that contain both X and Y. The confidence is the fraction of transactions containing X, which also contain Y. For a given transaction database T, an association rule is an expression of the form XY, where X and Y are subsets of A & A is the total number of items and XY holds with confidence , if % of transactions in D that support X also support Y. The rule X Y has support in the transaction set T if % of transactions in T supports X U Y. The intuitive meaning of such a rule is that a transaction of the database which contains X tends to contain Y. Association rule mining searches for interesting relationships among items in a given data set. An association rule in the form of LHS RHS, where both LHS and RHS are sets of items. The interpretation of such a rule is that if every item in LHS is purchased in a transaction, then it is likely that the items in RHS are purchased as well. There are two important measures for an association rules: Support: The support for a set of items is the percentage of transactions that contain all these items. The support for a rule LHS RHS is the support for the set of items LHS U RHS. Confidence: Consider transactions that contain all items in LHS. The confidence for a rule LHS RHS is the percentage of such transactions that also contain all items in RHS.

24

Finding association rules have two phases. In the first phase, all sets of items with high support (often called frequent item sets) are discovered. In the second phase, the association rules among the frequent item sets with high confidence are constructed. Since the computational cost of the first phase dominates the total computational cost Association rules are widely used for prediction, but it is important to recognize that such predictive use is not justified without additional analysis or domain knowledge. Regular association-rule mining is that it requires transaction-level data. Standard association rules can express correlations between values of a single dimension of the star schema. Standard association rule cant retrieve data from multiple dimensions; some associations become evident only when multiple dimensions are involved. Standard association-rule mining discovers correlations among items within transactions. The problem of mining association rules can be decomposed into two sub problems: Find all set of items whose support is greater than the user-specified minimum support, . Such item sets are called frequent item sets. Use the frequent item sets to generate the desired rules.

There are several association rules: Apriori algorithm Partition algorithm Pincer-Search algorithm Dynamic item set counting algorithm FP-tree growth algorithm Incremental algorithm Generalized association rule

Apriori algorithm: Apriori is an influential algorithm for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties. Apriori employs an iterative approach

25

known as level-wise search, where k-itemsets are used to explore (k+1)-itemsets. The set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database [26]. Partition algorithm: Partition algorithm reduces the number of scans of the database in to two parts and splits the database into partitions in order to be able to load each partition into memory. During the second scan, only the itemsets that are frequent in at least one partition are used as candidates and are counted to determine if they are frequent in the entire database, thus reducing the set of candidates. In partitioning algorithm, D is divided into P partitions D1, D2,..,Dp. Partitioning may improve the performance of finding large itemsets in several ways: 1. By considering the advantage of the large item set property, we know that a large itemset must be in at least one of the partitions. 2. Partition algorithms may be able to adapt better to limited main memory. Each partition can be created such that it fits into main memory. 3. By using partitioning, parallel and/or distributed algorithms can be easily created, where each partition can be handled by a separate machine. 4. Incremental generation of association rules may be easier to perform by treating the current state of the database as one partition and treating the new entries as a second partition [25]. Pincer search algorithm: Pincer Search algorithm relies on the combined approach for determining the maximum frequent set. The Pincer Search algorithm generates all maximal frequent itemsets [23]. Dynamic item set counting algorithm: Dynamic item set counting algorithm is alternative to Apriori Itemset Generation. Itemsets are dynamically added and deleted as transactions are read. Relies on the fact that for an itemset to be frequent, all of its subsets

26

must also be frequent, so we only examine those itemsets whose subsets are all frequent [22]. FP-tree growth algorithm: The FP-growth algorithm is currently one of the fastest approaches to frequent item set mining. It is based on a prefix tree representation of the given database of transactions (called an FP-tree), which can save considerable amounts of memory for storing the transactions. The basic idea of the FP-growth algorithm can be described as a recursive elimination scheme: in a preprocessing step delete all items from the transactions that are not frequent individually, i.e., do not appear in a user-specified minimum number of transactions. Then select all transactions that contain the least frequent item (least frequent among those that are frequent) and delete this item from them. Recurs to process the obtained reduced (also known as projected) database, remembering that the item sets found in the recursion share the deleted item as a prefix. On return, remove the processed item also from the database of all transactions and start over, i.e., process the second frequent item etc. In these processing steps the prefix tree, which is enhanced by links between the branches, is exploited to quickly find the transactions containing a given item and also to remove this item from the transactions after it has been processed [21]. Incremental algorithm: An incremental updating algorithm for the maintenance of previously discovered association rules is applied on data cubes. Due to the huge amounts of data usually in process, using the data cubes accelerates the job and avoids scanning the whole database after every update on the data. This algorithm also suggests a way to practically perform the incremental mining process without affecting the original (functioning) database [20]. Generalized association rule: Generalized association rules are rules that contain some background knowledge, therefore, giving a more general view of the domain. This knowledge is codied by a taxonomy set over the data set items. Many researches use taxonomy in different data mining steps to obtain generalized rules. In general, those researches reduce the obtained set by pruning some specialized rules using a subjective measure, but rarely analyzing the quality of the rules. In this context, this paper presents a

27

quality analysis of the generalized association rules, where a different objective measure has to be used depending on the side a generalized association item occurs. Based on this fact, a grouping measure was generated according to the generalization side. These measure groups can help the specialists to choose an appropriate measure to evaluate their generalized rules [24]. The standard association rule mining question for educational environment would be: What is the presence of students? In this question relates only one dimension. Principal can ask question like, What is the presence of third year students from Computers Science Branch? Another typical question can be asked: What is the presence of third year students from Computers Science Branch during 23 August 2005 to 22 September 2005? Standard association rules can express correlations between values of a single dimension of the star schema. Values of the other dimensions may also be correlated. Standard association-rule mining works with transaction level data that reflects individual records. Several association rules can be found out if more than one dimension will consider. The standard association rule mining cant be used to retrieve these types of query. Only ad-hoc association rule can be used to retrieve these types of query.

4.2 Mining process using ad-hoc association rule


Analysis is done using ad-hoc association rule. The mining process starts with the user defining the extended association rule. The choice has been made through a simple interface with combo box. Our choice of architecture allows complete flexibility about choosing different dimensions value. This architecture is completely error free. Once all parameters are chosen, the next step of the mining process is the creation of a sequence of SQL queries that implements the extended association rule specified by the users choice of

28

parameter values. We can compare the students obtained by results for association rule. Through second approach we can find student whose enrollment number is Y05CSBE133 is present in a lecture we can see which students are present in that lecture. It will determine the level of faculty and will give support & confidence. In one lecture of a particular faculty which students are present? It will determine the expertise of faculty in that subject by determining the level of students present in that lecture. Again we can calculate the support & confidence. Through first approach, we can find association rule for students which are present during a particular days & time period from entire college or department. We can find association rules for which students are present together in a particular time. We can find which students are interested to attend lecture in morning or afternoon or evening. After analyzing the most preferred time from students college can change the college time, e.g. instead of 11:00 AM to 5:30 PM college timing can be shifted to 8:00 AM to 2:30 PM. Through second approach, we can find association rule for students from a defined class, department, and subject which are present on a particular day. With this approach we can find different association rule for different department, subject and class. Which students are attending lecture together of a particular subject? Through third approach, we can find association rule for student which is not present during particular days in n different subjects. We can also find presence or absence of student during particular days in different subjects. Through fourth approach, we can find association rule for students whose presence is less than x% in n different subjects from a particular class and department. Through this approach we can find level of students, which are always together.

4.3 Ad-hoc Association Rule for academic data


Ad-hoc association rule means temporary association rules. It provides temporary relation between data. Ad-hoc association rule is able to find out data from multiple dimensions of star schema & can compare it also for more information. Ad-hoc association rules are useful for retrieving data from multiple tables that are not connected directly. Adhoc association rule mining requires analytical level data. Ad-hoc association rule can retrieve data from analytical level data. 29

Ad hoc mining can allow the user to mine interactively from a subset of the dataset to get a feel of the association rules generated [5]. Ad hoc association rule is the concepts which can be apply on any type of analytical data. In this dissertation, educational data is used for MGMs college of engineering for analysis. By using ad-hoc association rule we can solve query which has multiple dimensions e.g. What is the presence of third year students from Computers Science Branch during 23 August 2005 to 22 September 2005 during 11 AM to 2 PM? Another typical query: List the number of students whose attendance is less than 75% in three different subject from third year Computers Science Branch during 23 August 2005 to 22 September 2005 during 11 AM to 2 PM? This type of query can only be solved by ad hoc association rule. In ad-hoc association rule there is no any limitation on dimensions. So, we can get exact answer for any type of query which involves much more dimension. The query response time is more for those queries whose involves more dimensions. Typical query has multiple dimensions and refers to data which is in dimension table. Inner join operation is used to join fact table and various dimension tables and comparison of dimension tables data to other dimension tables data with the help of where clause. Inner join can be used as a nested to join more than two dimension table. The syntax of inner join is: select column_name FROM table1 inner join (table2 INNER JOIN table3 ON table2.param=table3.param) on table2.param1=table1.param1 where conditions

Syntax for nested inner join is:

30

select column_name from table1 inner join (table2 inner join (table3 inner join table4 on table3.param1 = table4.param1) on table2.param2 = table3.param2) on table1.param3 = table3.param3 where conditions. select column_name from (table1 inner join (table2 inner join table3 on table2. param = table3. param) on table1.param = table2. param) where date_key.datekey between ' datekey' and datekey' and time_key.ltime between ltime and ltime; In this dissertation four approaches are proposed. For every association rule we are interested to know about support and confidence. Support is the fraction of presence that contains both X and Y, here X, Y are student or date on which student is present. Confidence is the fraction of presence containing X, which also contains Y. We are always interested in those rules that have high support and confidence.

Approach 1: Mathematical Equation:

date = dn,time = tn

(student _ present ) <= days & timethresh old date = d 1,time = t1


Days Times

Algorithm:
1. Attributes will be entered by user

2. Use INNER JOIN (DEPT_KEY, STUTENT_NO, PRESENCE_KEY, DATE_KEY, SUBJECT_KEY, ATTNDENCE) INNER JOIN will join all dimensions table to fact table & each other. Apply conditions entered by user

31

select studentno from presence_key inner join (date_key inner join (attendence inner join time_key on attendence.timekey on = time_key.timekey) = on date_key.datekey = attendence.datekey) presence_key.presence_k attendence.presence_k where

date_key.datekey between ' datekey' and datekey' and time_key.ltime between ltime and ltime and presence_key.presence_k = ' presence_k'; 3. Use counter int count=0; //initialize counter to zero while(rs.next()) { String studentno = rs.getString(1); //get data from data warehouse System.out.println("Enrollment Number:" + studentno); // print value count++; //incremented counter by 1 } rs.last(); //Close loop System.out.println("Total : "+count); //Print value of total count 4. Calculate total number of scheduled lectures & store it in a variable Y //For finding total enrolled students in college select studentno from (date_key inner join (attendence inner join time_key on attendence.timekey = time_key.timekey) on date_key.datekey = attendence.datekey) where date_key.datekey between ' datekey' and datekey' and time_key.ltime between ltime and ltime; 5. Find percentage Z = (count/Y)*100;

Standard Query:
Find percentage of students present in campus between several days during particular time.

Query1:
Find percentage of students present in campus between dates 29-Aug-05 and 29-Oct-05 during times 11 A.M and 3:15 P.M. Result of Query1 is shown in figure Fig: 4.1 32

Fig: 4.1: Output of query 1 Studentno 'Y05CSTE107' is present during 11 AM to 01 PM will also present during 3:30 PM to 05:30 PM. Support is 42.11 % (16/38) & Confidence is 100 % (16/16) Studentno 'Y05CSTE107' is present during 3:30 PM to 05:30 PM will also present during 11 AM to 01 PM. Support is 42.11 % (16/38) & Confidence is 84.21% (16/19) Studentno 'Y05CSTE111' is present during 11 AM to 01 PM will also present during 3:30 PM to 05:30 PM. Support is 57.89 % (22/38) & Confidence is 70.96 % (22/31) Studentno 'Y05CSTE111' is present during 3:30 PM to 05:30 PM will also present during 11 AM to 01 PM. Support is 57.89 % (22/38) & Confidence is 100 % (22/22)

33

Query2:
Find percentage of students present in campus between dates 29-Aug-05 and 29-Oct-05 during times 11 A.M and 5:30 P.M. Result of Query2 is shown in figure Fig: 4.2

Fig: 4.2: Output of query 1

Related query:
Find percentage of students present in campus from a particular department between several days during particular time.

Query3:
Find percentage of students present in campus between dates 29-Aug-05 and 29-Oct-05 during times 11 A.M and 3:15 P.M from CSE department. Result of Query3 is shown in figure Fig: 4.3

34

Fig: 4.3: Output of query 3

Approach 2: Mathematical Equation:


(

1 if (present students) else 0 class, dept, studentno, presence, date, subject)

It is another approach for finding total students presents by applying various conditions.

Algorithm:
1. Attributes will be entered by user Date Class Department Subject

35

2. Use INNER JOIN (DEPT_KEY, COURSE_NO, STUTENT_NO, PRESENCE_KEY, DATE_KEY, SUBJECT_KEY, ATTNDENCE) INNER JOIN will join all dimensions table to fact table & each other. Apply conditions entered by user

select studentno from presence_key inner join (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) on presence_key.presence_k = attendence.presence_k where student_no.studentno = 'studentno' and date_key.datekey between 'd1' and d2 and subject_key.subject_k = ' subject_k ' and presence_key.presence_k = ' presence_k '; 3. Use counter int count=0; //initialize counter to zero while(rs.next()) { String studentno = rs.getString(1); //get data from data warehouse System.out.println("Enrollment Number:" + studentno); // print value count++; //incremented counter by 1 } rs.last(); //Close loop System.out.println("Total : "+count); //Print value of total count 4. Calculate total number of scheduled lectures & store it in a variable Y //For finding total enrolled students for a particular course from particular department and class select studentno from (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) where student_no.studentno = ' studentno ' and date_key.datekey between 'd1' and 'd2' and subject_key.subject_k = 'subject_k '; 5. Find percentage

36

Z = (count/Y)*100;

Standard Query:
Find the presence of students on a particular day from a particular department from a particular class in a particular subject.

Query1:
Find the presence of students on date 14-SEP-05 in subject TCPIP from final year computer science & engineering department. Result of Query1 is shown in figure Fig: 4.4 Percentage of student present in the lecture of TCPIP is 81.63 % on 14-SEP-05.

Fig: 4.4: Output of approach 2 for BECSE class with subject TCPIP on 14-SEP-05

37

When studentno 'Y05CSBE137' is present studentno 'Y05CSBE147' is also present. Support is 13.15 % and confidence is 12.12 %. When studentno 'Y05CSBE147' is present studentno 'Y05CSBE137' is also present. Support is 13.15 % and confidence is 100 %. When studentno 'Y05CSBE106' is present studentno 'Y05CSBE137' is also present. Support is 76.32 % and confidence is 85.29 %. When studentno 'Y05CSBE137' is present studentno 'Y05CSBE106' is also present. Support is 76.32 % and confidence is 87.87 %.

Approach 3: Mathematical Equation:

P/A of student = (C < D < S < D1)


Another approach is split the support into five parts. We are interested in presence or absence of students. First we will see presence or absence of students in a particular class then in a particular department. Then we can see presence or absence of students in a particular subject then on a particular date. When we apply more conditions we can get more accurate data(less in volume) for further analysis. Let C be the classes. In a class we can see different departments D. Department can be further split into subjects S. And subjects can be further divided into different days D1. D1 will give less data for taking decisions. C will give more data. D1 will give less data in which we are interested.

Algorithm:
1. Attributes will be entered by user Student Number Dates Subject

38

2. Use INNER JOIN (DEPT_KEY, COURSE_NO, STUTENT_NO, PRESENCE_KEY, DATE_KEY, SUBJECT_KEY, ATTNDENCE) INNER JOIN will join all dimensions table to fact table & each other. Apply conditions entered by user

select datekey from presence_key inner join (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) on presence_key.presence_k = attendence.presence_k where student_no.studentno = 'studentno' and date_key.datekey between 'd1' and d2 and subject_key.subject_k = ' subject_k ' and presence_key.presence_k = ' presence_k '; 3. Use counter int count=0; //initialize counter to zero while(rs.next()) { String datekey = rs.getString(1); //get data from data warehouse System.out.println("Datekey:" + datekey); // print value count++; //incremented counter by 1 } rs.last(); //Close loop System.out.println("Total : "+count); //Print value of total count 4. Calculate total number of scheduled lectures & store it in a variable Y //For finding total engaged lectures for a particular course from particular department and class select datekey from (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) where student_no.studentno = ' studentno ' and date_key.datekey between 'd1' and 'd2' and subject_key.subject_k = 'subject_k '; 5. Find percentage Z = (count/Y)*100;

39

Standard Query:
Find the presence or absence of a particular student from a particular department and from a particular class between several days in one subject, three subject subjects, and five subjects.

Query1:
Find presence of student between dates 15-Sep-05 and 13-Nov-05 in three subjects from final year computer science & engineering department whose enrollment number is Y05CSBE133. Result of Query1 is shown in figure Fig: 4.5

Fig: 4.5: Output of approach 3 for three subjects

40

When studentno 'Y05CSBE137' is present studentno 'Y05CSBE147' is also present. Support is 13.15 % and confidence is 12.12 %. When studentno 'Y05CSBE147' is present studentno 'Y05CSBE137' is also present. Support is 13.15 % and confidence is 100 %. When studentno 'Y05CSBE106' is present studentno 'Y05CSBE137' is also present. Support is 76.32 % and confidence is 85.29 %. When studentno 'Y05CSBE137' is present studentno 'Y05CSBE106' is also present. Support is 76.32 % and confidence is 87.87 %.

Query2:
Find presence of student between dates 15-Sep-05 and 13-Nov-05 in five subjects from final year computer science & engineering department whose enrollment number is Y05CSBE133. Result of Query is shown in figure Fig: 4.6

Fig: 4.6: Output of approach 3 for five subjects

41

Approach 4: Mathematical Equation:


Class, dept, studentno, subject = 1, presence, date

( A1 A2 . Ai) < presence

threshold

It is most powerful approach for finding typical query with full flexibility. Split the support into five parts. Desired output is presence or absence of students with presence threshold & subject threshold.

Algorithm:
1. Attributes will be entered by user Percentage threshold Class Department Subject threshold

2. Use INNER JOIN (DEPT_KEY, COURSE_NO, STUTENT_NO, PRESENCE_KEY, DATE_KEY, SUBJECT_KEY, ATTNDENCE) NESTED SELECT is used for calculating percentage. INNER JOIN will join all dimensions table to fact table & each other. Apply conditions entered by user

select distinct k.studentID from (select a.STUDENTNO as studentID, a.subject_k, a.deptkey, (select count(PRESENCE_K)*100 from attendence b1 where b1.PRESENCE_K='P' and b1.STUDENTNO=a.STUDENTNO and b1.subject_k = ' subject' and b1.deptkey = 'dept')/(select count(PRESENCE_K) from attendence b2 where b2.STUDENTNO=a.STUDENTNO and b2.subject_k = ' subject' and b2.deptkey = 'dept') as PercentageAttd from attendence a) k where k.PercentageAttd<= percentage"' and k.subject_k = 'subject' and k.deptkey = 'dept' ;

42

select studentno from presence_key inner join (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) on presence_key.presence_k = attendence.presence_k where student_no.studentno = 'studentno' and date_key.datekey between 'd1' and d2 and subject_key.subject_k = ' subject_k ' and presence_key.presence_k = ' presence_k '; 3. Use counter int count=0; //initialize counter to zero while(rs.next()) { String studentno = rs.getString(1); //get data from data warehouse System.out.println("Enrollment Number:" + studentno); // print value count++; //incremented counter by 1 } rs.last(); //Close loop System.out.println("Total : "+count); //Print value of total count 4. Calculate total number of scheduled lectures & store it in a variable Y //For finding total enrolled students for a particular course from particular department and class select studentno from (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) where student_no.studentno = ' studentno ' and date_key.datekey between 'd1' and 'd2' and subject_key.subject_k = 'subject_k '; 5. Find percentage Z = (count/Y)*100;

Standard Query:
Find percentage of students whose presence is less than X% in Y different subjects in a particular class having particular branch.

43

Query1:
Find percentage of students from third year electronics & telecommunication branch whose presence is less than 75% in 3 different subjects. Result of Query1 is shown in figure Fig: 4.7

Fig 4.7: Output of approach 4 for TEECT with three subjects & 75% presence

Query2:
Find percentage of students from third year electronics & telecommunication branch whose presence is less than 75% in 4 different subjects. Result of Query2 is shown in figure Fig: 4.8

44

Fig 4.8: Output of approach 4 for TEECT with four subjects & 75% presence

45

Chapter 5 CONCLUSION AND FUTURE WORK


5.1 Conclusion
Many educational institutes have data in database. For improving academic standard those data needs to be analyzed by different views. If educational institutes use data warehouse for storing data, analysis will be much simpler. Data warehousing and data mining tools are much costly. Any academic institutes dont want to expend much amount only for improving academic standard. In this dissertation a novel way of analysis is proposed. Major advantage of our framework is data warehousing & data mining tools are not necessary. Any non technical user can enter academic data into star schema (representation of data in data warehouse) with the help of GUI. Data will be stored in data warehouse directly. Data once entered in data warehouse cant be changed or updated. Any user can fire any type of proposed query with the help of GUI. Any type of proposed query will get exact data not approximate within some micro seconds. All data has been retrieved through dimension tables of star schema. Data mining tools can retrieve data only from fact table. Dimensions of star schema have much more information, which cant be ignored. We retrieve data through dimensions of star schema & after comparison of dimensions of star schema which is really impossible for any data mining tools. Data mining tools are much costly. Any academic institute can not spend much amount only for improving academic standard. If they will purchase mining tools, they cant efficiently utilize it as our approaches. They also need data warehousing tool for designing of data warehouse, after that they can use any data mining tools. Our approaches work much better than any data mining tools & we also proposed the way of construction of data warehouse using star schema. Our system is fully dynamic. Any user can enter the data in fact table & another person can retrieve it from network. Our system is compatible with any type of network. Our framework can be used to improve academic standard by any educational institute weather it is school, college, or coaching center. Like data warehousing tools data mining tools are also much costly. In this dissertation we proposed complete solution of improving academic standard.

46

In this dissertation framework is tightly integrated with the data warehousing technology. In addition to integrating the mining with the database technology by keeping all queries processing within the data warehouse, this approach introduces the following two innovations: 1) Extended association rules using the other non-item dimensions of the data warehouse, which results in more detailed and ultimately actionable rules for any academic institutes. 2) Defining association rules for aggregated data.

5.2 Future Work


Analysis can be done by retrieving data from snowflake & fact constellation schema representation of data warehouse. Can be aim to undertake a further performance study with result dimension, larger data sets, and various types of indexes. Primary keys of fact table can be generated dynamically. For visualization different visualization scheme can be use e.g., bar chart, histogram, etc.

47

References
[1] Svetlozar Nestorov and Nenad Jukic. Ad-hoc Association-Rule Mining within the Data Warehouse. IEEE (2002), Proceedings of the 36th Hawaii International Conference on System Sciences (HICSS03) [2] Bhavani M. Thuraisingham, Marion G. Ceruti Understanding Data Mining and Applying it' to Command, Control, Communications and Intelligence Environments. IEEE (2000). [3] Ken Collier, Bernard Carey, Donald Sautter, Curt Marjaniemi A Methodology for Evaluating and Selecting Data Mining Software. IEEE(1999), Proceedings of the 32nd Hawaii International Conference on System Sciences 1999 [4] Michael H. Smith, Witold Pedrycz Expanding the Meaning of and Applications for Data Mining, IEEE (2000) [5] Wee Hyong Tok, Twee Hee Ong, Wai Lup Low, Indriyati Atmosukarto, Stephane Bressan, Predator-Miner: Ad hoc Mining of Associations Rules Within a Database Management System. IEEE (2002), Proceedings of the 18th International Conference on Data Engineering (ICDE.02) [6] Maria Lupetin A Data Warehouse Implementation Using the Star Schema [7] Agrawal R., Imielinski T. and A. Swami. Mining Association Rules Between Sets of Items in Large Databases. Proceeding of ACM SIGMOD International Conference. (1993), 207-216. [8] Agrawal R. and Srikant R. Fast Algorithms for Mining Association Rules. Proceeding of International Conference On Very Large Databases VLDB. (1994), 487-499. [9] R. Agrawal H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996). [10] Hilderman R.J., Carter C.L., Hamilton H.J., and Cercone N. Mining Association Rules from Market Basket Data Using Share Measures and Characterized Itemsets. International Journal of Artificial Intelligence Tools. 7 (2), (1998), 189-220. [11] Chaudhri S. and Dayal U. An overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record. 26 (1), (1997), 65-74. 48

[12] Inmon, W. H. Building the Data Warehouse. Wiley (1996). [13] Kimball, R., Reeves, L., Ross, M., and Thornthwhite, W. The Data Warehouse Lifecycle Toolkit. Wiley (1998). [14] Leavitt, N. Data Mining for the Corporate Masses. IEEE Computer. 35 (5), (2002), 2224. [15] S. Sarawagi, S. Thomas, R. Agrawal. Integrating Mining with Relational Database Systems: Alternatives and Implications. Proceedings of ACM SIGMOD Conference (1998), 343-354 [16] S. Tsur, J. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, A. Rosenthal. Query Flocks: A Generalization of Association-Rule Mining. Proceedings of ACM SIGMOD Conference, (1998), 1-12. [17] Wang, K., He Y., and Han J. Mining Frequent Itemsets Using Support Constraints. Proceedings of Internationl Conference on Very Large Databases VLDB, (2000), 43-52 [18] Watson, H. J., Annino, D. A., and Wixom, B. H. Current Practices in Data Warehousing. Information Systems Management. 18 (1), (2001). [19] Berry, M. and Linoff, G. Data Mining Techniques for Marketing, Sales and Customer Support. Wiley (1997). [20] Ravindra Patel, D. K. Swami and K. R. Pardasani, Lattice Based Algorithm for Incremental Mining of Association Rules, International Journal of Theoretical and Applied Computer Sciences Volume 1 Number 1 (2006) [21] Christian Borgelt, An Implementation of the FP-growth Algorithm, ACM (2005) [22] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, Shalom Tsur, Dynamic Itemset Counting and Implication Rules for Market Basket Data [23] Dao-I Lin, Zvi M. Kedem, Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set (1997) [24] V.O. de Carvalho, S.O. Rezende, and M. de Castro (Brazil), Evaluating Generalized Association Rules through Objective Measures, From Proceeding (549) Artificial Intelligence and Applications (2007) [25] Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques pp-[348-54] [26] Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques pp-[230-35]

49

You might also like