Project Report For ME
Project Report For ME
Subject-oriented: The data in the database is organized so that all the data elements relating to the same real-world event or object are linked together. Time-variant: The changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time. Non-volatile: Data in the database is never over-written or deleted, once committed, the data is static, read-only, but retained for future reporting. Integrated: The database contains data from most or all of an organization's operational applications, and that this data is made consistent. As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages: Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance. Offline Data Warehouse - Data warehouses at this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.) Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization. There are many advantages to using a data warehouse, some of them are: It enhances the end-user access to a wide variety of data. Decision support system users can obtain specified trend reports, e.g. the item with the most sales in a particular area/country within the last two years.
------------------------------------------------------------------------------------------------------------
Middle tier: OLAP Server -----------------------------------------------------------------------------------------------------------Data warehouse Data marts Monitoring Administration
Operational databases
External sources
Metadata
OLAP Server
Data Warehouse
Serve
Data Marts
Data Sources
Data Storage
OLAP Engine
Front-End Tools
Fig 1.2: Data Warehouse: A Multi-Tiered Architecture There is a difference between data warehouse and normal database. Every company conducting business inputs valuable information into transactional-oriented data stores. The distinguishing traits of these online transaction processing (OLTP) databases are that they handle very detailed, day-to-day segments of data, are very write-intensive by nature and are designed to maximize data input and throughput while minimizing data contention and resource-intensive data lookups. By contrast, a data warehouse is constructed to manage aggregated, historical data records, is very read-intensive by nature and is oriented to maximize data output. Usually, a data warehouse is fed a daily diet of detailed business data in overnight batch loads with the intricate daily transactions being aggregated into more historical and analytically formatted database objects. Naturally, since a data warehouse is a collection of a business entitys historical information, it tends to be much larger in terms of size than its OLTP counterpart.
Goals of data mining is prediction and description. Prediction makes use of existing variables in the database in order to predict unknown or future values of intereset. Description focuses on finding patterns describing the data and the subsequent presentation for user interpretation.There are several data mining techniques fulfilling these objectives. Some of these are associations, classifications, sequential patterns and clustering. The basic premise of an association is to find all associations, such that the presence of one set of items in a transaction implies the other items. Classification develops profiles of different groups. Sequential patterns identify sequential patterns subject to a user-specified minium constraint. Clustering segments a database into subsets or clusters. Another approach of the study of data mining techniques is to classify the techniques as i. User-guided or verification-driven data mining.
ii.
Most of the techniques of data mining have elements of both the models. 1.2.1 Data Mining using Association rule The goal of the association rules algorithm is to detect relationships or associations between specific values of nominal attributes in large data sets. There are two important quantities measured for every association rule: support and confidence. The support is the fraction of transactions that contain both X and Y. The confidence is the fraction of transactions containing X, which also contain Y. For a given transaction database T, support is , confidence is . An association rule is an expression of the form XY, where X and Y are subsets of A and XY holds with confidence , if % of transactions in D that support X also support Y. The rule X Y has support in the transaction set T if % of transactions in T supports X U Y.
DIAGNOS2: secondary diagnosis as an ICD9 code DATESERV: date of the procedure was performed ADJUST: the adjustment to the charge for services CHARGE: the final actual charge for the procedure AGE: age of patient at time of procedure COUNT: the frequency of the event The first eight variables from PATIENT to DATESERV are dimension variables. The final four variables, ADJUST, CHARGE, AGE and COUNT are numerical variables to be used for arithmetical or statistical operations. Users of the Clinical Information System will want to look at the data summarized to various levels. Joining selected dimension tables to the fact table will provide the user with a dataset on which to aggregate the needed information. For example, to analyze the charge by patient, by quarter, by location would require a join of four tables: the fact table, the patient dimension table, the location dimension table, and the dataserv dimension table. The resultant data file will then be aggregated by using the Proc Summary step to produce a dataset for analysis [6]. On-line Analytical Processing (OLAP) is the analytical capabilities provided by the data warehouse or data mart. One can view granular data or various aggregations of data for business analyses using graphical-user-friendly tools.
Fig 1.3: Star schema for Outpatient Clinical information system 1.3.2 Predator-Miner [5] 7
Predator-Miner extends Predator with a relational like association rule mining operator to support data mining operations. Predator-Miner allows a user to combine association rule mining queries with SQL queries. This approach towards tight integration differs from existing techniques of using user-defined functions, stored procedures, or reexpressing a mining query as several SQL queries in two aspects: (i) By encapsulating the task of association rule mining in a relational operator, it allow association rule mining to be considered as part of the query plan, on which query optimization can be performed on the mining query holistically. (ii) By integrating it as a relational operator, it can be leverage on the mature field of relational database technology. Authors define ad hoc data mining to be a flexible and interactive data mining process performed over a sub-set of the data without the need for data pre-processing. The motivation for ad hoc data mining is to allow the end-user to get a quick analysis of the results from the mining process prior to performing mining on the entire dataset, and to reduce the hassle (disturbance) of a pre-processing step. Novelty of the proposed system is tight integration of association rule mining operation within a relational database, Ad Hoc Association Rule Mining, Integration of Association Rule mining queries and relational queries.
To analyze the academic data of MGMs college of engineering we divided this dissertation in two modules which is shown in fig 1.5. In first module we design data warehouse using star schema. In second module analysis can be done by using proposed approaches. Construction of DWH (Star schema) Fig 1.5: Proposed work Analysis of data by proposed approaches
Fig 1.6: System Architecture 1.4.1 Design Data Warehouse Data Warehouse is implemented using star schema. Star schema is the mostly used schema in data warehouse design. Dimensional modeling, which is the most prevalent technique for modeling data warehouses, organizes tables into fact tables containing basic quantitative measurements of a business subject and dimension tables that provide descriptions of the facts being stored. There are several ways a data warehouse can be structured: multidimensional, star, and snowflake. The dimension table is a database table that contains the detailed data. The fact table is a database table that contains the summary
level data. In this dissertation we proposed designing of data warehouse for educational data. For this purpose, firstly data is analyzed then put up into fact table & dimension table as per analysis. Fact table contain data, which is going to be changed frequently. Dimension tables contain data, which is not going to be changed in future. Fact table usually contain measures. For this data warehouse one fact table and nine dimension tables are designed. Steps in designing of data warehouse: 1. Analyzing the data of educational system by different views. 2. Construction of Data Warehouse using star schema. 3. Bitmap join index is used for fast access of data warehouse. 4. Data has been entered into dimension table and fact table respectively. 1.4.2 Mining Using Ad-hoc Association Rule Mining is the process of retrieving knowledge from data warehouse. Association rules describe the co-occurrence of pattern. There are two important measured for every association rule: support and confidence. The support is the fraction of attendance that contains both X & Y. The confidence is the fraction of attendance containing X, which also contains Y. The support measures the significance of the rule, so we are interested in rules with relatively high support. The confidence measures the strength of the correlation. So, rules with low confidence are not meaningful. Ad-hoc means temporary, Ad-hoc association rule means, to find association rules on temporary relations. The tables which are not connected directly to each other but we are trying to retrieve data from those tables & want to compare those tables for efficient knowledge. By using ad-hoc association rule data can retrieve from analytical level database. Ad-hoc association rule able to find out data from multiple dimensions of star schema & can compare it also. Data mining tools can retrieve data only from fact table not from dimension table. Dimensions of star schema contain much information. Dimensions of star schema can not be avoided. By using ad-hoc association rule data can be retrieved from dimensions of star schema. Four types of approaches are suggested in this dissertation. Type 1:
10
date = dn,time = tn
Sample query: Find percentage of students present in campus between dates 29-Aug-05 and 29-Oct-05 during times 11 A.M and 3:15 P.M. Type 2:
(
Sample query: Find the presence of students on date 14-SEP-05 in subject TCPIP from final year computer science & engineering department. Type 3: Let C be the classes. In a class we can see different departments D. Department can be further split into subjects S. And subjects can be further divided into different days D1.
11
Type 4:
Class, dept, studentno, subject = 1, presence, date
threshold
Sample Query: Find percentage of students from third year electronics & telecommunication branch whose presence is less than 75% in 3 different subjects.
12
Star schema provides direct and intuitive mapping between the business entities being analyzed by end users and the schema design.
Star schema provides highly optimized performance for typical star queries; but star schema takes more space than snowflake schema. Query response time of snowflake schema is more than star schema.
Star schema is widely supported by a large number of business intelligence tools, which may anticipate or even require that the data-warehouse schema contain dimension tables.
Star schema has much simpler structure than snowflake or constellation schema. Additional nodes can be added or deleted without much effort.
Star schema mostly used structures of data warehouse. Support indexing with much lower complexity while designing than other schemas.
The first step in designing a fact table is to determine the granularity of the fact table. Granularity constitutes two steps: (i) Determine which dimensions will be included. (ii) Determine where along the hierarchy of each dimension the information will be kept. The determining factors usually goes back to the requirements. In star schema dimension tables are not normalized. Star schemas are used for both simple data marts and very large data warehouses.
13
Dimension Table Key1 Value Value Fact Table Key1 Key2 Key3 Key4 Measures Dimension Table Key3 Value Value Fig 2.1 Star schema
A simple query against the base dimension table can provide sub-second response, but a query that involves multiple joined snowflakes can take more time. The advantage of this type of schema is its simplicity; it's understandable by end users. Other advantages are low maintenance (since the diagram is simple), it is relatively easy to define new hierarchies and the numbers of connections are less. A star schema is characterized by one or more very large fact tables that contain the primary information in the data warehouse and a number of much smaller dimension tables (or lookup tables), each of which contains information about the entries for a particular attribute in the fact table. Query to star schema involves joining of fact table and a number of lookup tables (also called as dimension tables). Each lookup table is joined to the fact table using a primary-key to foreign-key join, but the lookup tables are not joined to each other. A star join is a primary-key to foreign-key join of the dimension tables to a fact table. The fact table normally has a concatenated index on the key columns to facilitate this type of join.
14
2.2 Indexing
2.2.1 B-tree indexing: B-tree indexes are well suited for OLTP applications in which users' queries are relatively routine (and well tuned before deployment in production), as opposed to ad hoc queries, which are much less frequent and executed during nonpeak business hours. Because data is frequently updated in and deleted from OLTP applications, bitmap indexes can cause a serious locking problem in these situations. B-tree eliminates the redundant storage of search key values. Deletion may occur in a non-leaf node (more complicated). Oracle supports dynamic B-tree-to-bitmap conversion, but it can be inefficient. Null values are not indexed in B-tree indexes. It speeds up known queries. It is well suited for high cardinality. The space requirement is independent of the cardinality of the indexed column. It is relatively inexpensive when we update the indexed column since individual rows are locked. It performs inefficiently with low cardinality data. It does not support ad hoc queries. More I/O operations are needed for a wide range of queries. The indexes can not be combined before fetching the data. A fully developed B-tree index is composed of the following three different types of index pages or nodes:
A leaf node contains index items and horizontal pointers to other leaf nodes. 2.2.2 Bitmap Indexing: In a Bitmap Index, each distinct value for the specified column is associated with a bitmap where each bit represents a row in the table. Two values of bitmap are 1 and 0. Value of bit is 1 i.e. row contains that value and 0 i.e. it doesn't. Bitmap indexes are a great boon to certain kinds of application. When there are bitmap indexes on tables then updates
15
will take out full table locks. Bitmap indexes are good for low-cardinality columns. Bitmap index scans are more efficient than table scans even when returning a large fraction of a table. Indexes are created to allow Oracle to identify requested rows as efficiently as possible. Strategy behind bitmap indexes is very different from the strategy behind B tree indexes. Inserts and deletes on a table will result in updates to all the associated indexes. Bitmap indexes are primarily intended for data warehousing applications. They are not suitable for OLTP applications with large numbers of concurrent transactions modifying the data. Key facts about bitmap indexes are: If a B tree index is not an efficient mechanism for accessing data, it is unlikely to become more efficient simply because we convert it to a bitmap index. Bitmap indexes can usually be built quickly, and tend to be surprisingly small. The size of the bitmap index varies with the distribution of the data. Updates to bitmapped columns, and general insertion/deletion of data can cause serious lock conflict. Updates to bitmapped columns, and general insertion/deletion of data can degrade the quality of the indexes quite dramatically. Bitmap index specialized type of index. Designed for querying on multiple keys. Each bitmap index is built on a single key. Bitmap is an array of bits. Bitmap indices are generally quite small compared to the actual relation size. Records are at least ten of bytes to hundreds of bytes long, whereas a single bit represents the record in a bitmap. Space occupied by a single bitmap is usually less than 1% of the space occupied by the relation Reduced response time for large classes of ad hoc queries Reduced storage requirements compared to other indexing techniques Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory
16
2.2.3 Bitmap Join Indexing: Bitmap Join Index extended the concept such that the index contains the data to support the join query, allowing the query to retrieve the data from the index rather than referencing the join tables. Since the information is compressed into a bitmap, the size of the resulting structure is significantly smaller than the corresponding materialized view. Bitmap join indexes represent the join of columns in two or more tables. With a bitmap join index, the value of a column in one table, dimension table, is stored with the associated ROWIDS of the like value in the other tables that the index is defined on. This provides fast join access between the tables---if that query uses the columns of the bitmap join index. In a data warehouse environment, a bitmap join index might be a more efficient way of accessing data than a materialized-view join. When using a bitmap join index in a warehouse environment, join will be created using an equi-inner join between the primary key column(s) of the dimension tables and the foreign key column(s) of the fact table. SQL query for bitmap join index is: Create bitmap index my_bitmap_index on attendence (A.studentno) from student_no A, attendence B where A.studentno=B.studentno; CREATE BITMAP INDEX my_bitmap_index ON fact_table (dimension_table.col2) FROM dimension_table, fact_table WHERE dimension_table.col1=fact_table.col1; There are a few restrictions on bitmap join indexes. These are the following: The bitmap join index is built on a single table. Oracle will allow only one of the tables of a bitmap join index to be updated, inserted, or deleted from at a time. A bitmap join index cannot be created on an index-organized table or a temporary table. 17
Every column in the bitmap join index must be present in one of the association dimension tables. The join operations on the bitmap index must form either a star or snowflake schema.
Either primary key columns or unique constraints must be created on the columns that will be join columns in the bitmap join index. All the primary key columns of the dimension table must be part of the join criteria of the bitmap join index. All restrictions on normal bitmap indexes apply to bitmap join indexes. Parallel DML (insertion, deletion, up gradation or modification) is currently only supported on the fact table. Parallel DML on one of the participating dimension tables will mark the index as unusable.
Only one table can be updated concurrently by different transactions when using the bitmap join index. The columns in the index must all be columns of the dimension tables. The dimension table join columns must be either primary key columns or have unique constraints. If a dimension table has composite primary key, each column in the primary key must be part of the join. If more than one primary key exists in a table it is called composite primary key.
18
19
Figure 3.1 shows the complete designing of data warehouse using star schema for MGMs College of Engineering, Nanded. Attendance is the fact table and dimension tables are student, course detail, faculty, department, date, subject, presence, result, room, time. Fact table is surrounded by dimension tables. (Dimension table) studentkey Student Name Student Address Year of Study Student Degree Attendence (Fact table) studentkey coursekey facultykey deptkey timekey roomkey presencekey subjectkey datekey Student coursekey Course Name
roomkey Building Location Capacity Room timekey Time Time Course Detail deptkey Department Name No. of staff Department subjectkey Name of subject Subject datekey Date-Month-Year Date Fig 3.1 DWH for Educational System
facultykay Employee ID Name of faculty Designation Salary Faculty presencekey Present status Presence
In this dissertation fact table and dimension table is designed in oracle 9i. Sample source code for dimension table is: create table student_no (studentno varchar2(11) constraint student_prim primary key, studentname varchar2(20), studentdegree varchar2(10),yearofstudy varchar2(20));
20
create table course_no(courseno varchar2(5) constraint course_prim primary key, coursename varchar2(25)); create table presence_key(presence_k varchar2(1) constraint subject_prim primary key, detail varchar2(25)); Source code for fact table is: create table attendence ( studentno varchar2(11) constraint fr_std references student_no(Studentno), course_no(courseno), lecturer_no(lecturerno), roomkey varchar2(5) courseno lecturerno deptkey constraint varchar2(5) varchar2(5) varchar2(5) fr_rk constraint constraint constraint fr_cr fr_lect fr_dept references references references resultkey
dept_key(deptkey), timekey varchar2(2) constraint fr_tk references time_key(timekey), references room_key(roomkey), varchar2(5) constraint fr_rs references result_key(resultkey), presence_k varchar2(1) constraint fr_pres references presence_key(presence_k), subject_k varchar2(5) constraint fr_sub references subject_key(subject_k), datekey varchar2(5) constraint fr_date references date_key(datekey) ); User will enter daily attendance using a GUI developed in JAVA. GUI is shown in Fig 3.2 and Fig 3.3.
21
22
23
24
Finding association rules have two phases. In the first phase, all sets of items with high support (often called frequent item sets) are discovered. In the second phase, the association rules among the frequent item sets with high confidence are constructed. Since the computational cost of the first phase dominates the total computational cost Association rules are widely used for prediction, but it is important to recognize that such predictive use is not justified without additional analysis or domain knowledge. Regular association-rule mining is that it requires transaction-level data. Standard association rules can express correlations between values of a single dimension of the star schema. Standard association rule cant retrieve data from multiple dimensions; some associations become evident only when multiple dimensions are involved. Standard association-rule mining discovers correlations among items within transactions. The problem of mining association rules can be decomposed into two sub problems: Find all set of items whose support is greater than the user-specified minimum support, . Such item sets are called frequent item sets. Use the frequent item sets to generate the desired rules.
There are several association rules: Apriori algorithm Partition algorithm Pincer-Search algorithm Dynamic item set counting algorithm FP-tree growth algorithm Incremental algorithm Generalized association rule
Apriori algorithm: Apriori is an influential algorithm for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties. Apriori employs an iterative approach
25
known as level-wise search, where k-itemsets are used to explore (k+1)-itemsets. The set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database [26]. Partition algorithm: Partition algorithm reduces the number of scans of the database in to two parts and splits the database into partitions in order to be able to load each partition into memory. During the second scan, only the itemsets that are frequent in at least one partition are used as candidates and are counted to determine if they are frequent in the entire database, thus reducing the set of candidates. In partitioning algorithm, D is divided into P partitions D1, D2,..,Dp. Partitioning may improve the performance of finding large itemsets in several ways: 1. By considering the advantage of the large item set property, we know that a large itemset must be in at least one of the partitions. 2. Partition algorithms may be able to adapt better to limited main memory. Each partition can be created such that it fits into main memory. 3. By using partitioning, parallel and/or distributed algorithms can be easily created, where each partition can be handled by a separate machine. 4. Incremental generation of association rules may be easier to perform by treating the current state of the database as one partition and treating the new entries as a second partition [25]. Pincer search algorithm: Pincer Search algorithm relies on the combined approach for determining the maximum frequent set. The Pincer Search algorithm generates all maximal frequent itemsets [23]. Dynamic item set counting algorithm: Dynamic item set counting algorithm is alternative to Apriori Itemset Generation. Itemsets are dynamically added and deleted as transactions are read. Relies on the fact that for an itemset to be frequent, all of its subsets
26
must also be frequent, so we only examine those itemsets whose subsets are all frequent [22]. FP-tree growth algorithm: The FP-growth algorithm is currently one of the fastest approaches to frequent item set mining. It is based on a prefix tree representation of the given database of transactions (called an FP-tree), which can save considerable amounts of memory for storing the transactions. The basic idea of the FP-growth algorithm can be described as a recursive elimination scheme: in a preprocessing step delete all items from the transactions that are not frequent individually, i.e., do not appear in a user-specified minimum number of transactions. Then select all transactions that contain the least frequent item (least frequent among those that are frequent) and delete this item from them. Recurs to process the obtained reduced (also known as projected) database, remembering that the item sets found in the recursion share the deleted item as a prefix. On return, remove the processed item also from the database of all transactions and start over, i.e., process the second frequent item etc. In these processing steps the prefix tree, which is enhanced by links between the branches, is exploited to quickly find the transactions containing a given item and also to remove this item from the transactions after it has been processed [21]. Incremental algorithm: An incremental updating algorithm for the maintenance of previously discovered association rules is applied on data cubes. Due to the huge amounts of data usually in process, using the data cubes accelerates the job and avoids scanning the whole database after every update on the data. This algorithm also suggests a way to practically perform the incremental mining process without affecting the original (functioning) database [20]. Generalized association rule: Generalized association rules are rules that contain some background knowledge, therefore, giving a more general view of the domain. This knowledge is codied by a taxonomy set over the data set items. Many researches use taxonomy in different data mining steps to obtain generalized rules. In general, those researches reduce the obtained set by pruning some specialized rules using a subjective measure, but rarely analyzing the quality of the rules. In this context, this paper presents a
27
quality analysis of the generalized association rules, where a different objective measure has to be used depending on the side a generalized association item occurs. Based on this fact, a grouping measure was generated according to the generalization side. These measure groups can help the specialists to choose an appropriate measure to evaluate their generalized rules [24]. The standard association rule mining question for educational environment would be: What is the presence of students? In this question relates only one dimension. Principal can ask question like, What is the presence of third year students from Computers Science Branch? Another typical question can be asked: What is the presence of third year students from Computers Science Branch during 23 August 2005 to 22 September 2005? Standard association rules can express correlations between values of a single dimension of the star schema. Values of the other dimensions may also be correlated. Standard association-rule mining works with transaction level data that reflects individual records. Several association rules can be found out if more than one dimension will consider. The standard association rule mining cant be used to retrieve these types of query. Only ad-hoc association rule can be used to retrieve these types of query.
28
parameter values. We can compare the students obtained by results for association rule. Through second approach we can find student whose enrollment number is Y05CSBE133 is present in a lecture we can see which students are present in that lecture. It will determine the level of faculty and will give support & confidence. In one lecture of a particular faculty which students are present? It will determine the expertise of faculty in that subject by determining the level of students present in that lecture. Again we can calculate the support & confidence. Through first approach, we can find association rule for students which are present during a particular days & time period from entire college or department. We can find association rules for which students are present together in a particular time. We can find which students are interested to attend lecture in morning or afternoon or evening. After analyzing the most preferred time from students college can change the college time, e.g. instead of 11:00 AM to 5:30 PM college timing can be shifted to 8:00 AM to 2:30 PM. Through second approach, we can find association rule for students from a defined class, department, and subject which are present on a particular day. With this approach we can find different association rule for different department, subject and class. Which students are attending lecture together of a particular subject? Through third approach, we can find association rule for student which is not present during particular days in n different subjects. We can also find presence or absence of student during particular days in different subjects. Through fourth approach, we can find association rule for students whose presence is less than x% in n different subjects from a particular class and department. Through this approach we can find level of students, which are always together.
Ad hoc mining can allow the user to mine interactively from a subset of the dataset to get a feel of the association rules generated [5]. Ad hoc association rule is the concepts which can be apply on any type of analytical data. In this dissertation, educational data is used for MGMs college of engineering for analysis. By using ad-hoc association rule we can solve query which has multiple dimensions e.g. What is the presence of third year students from Computers Science Branch during 23 August 2005 to 22 September 2005 during 11 AM to 2 PM? Another typical query: List the number of students whose attendance is less than 75% in three different subject from third year Computers Science Branch during 23 August 2005 to 22 September 2005 during 11 AM to 2 PM? This type of query can only be solved by ad hoc association rule. In ad-hoc association rule there is no any limitation on dimensions. So, we can get exact answer for any type of query which involves much more dimension. The query response time is more for those queries whose involves more dimensions. Typical query has multiple dimensions and refers to data which is in dimension table. Inner join operation is used to join fact table and various dimension tables and comparison of dimension tables data to other dimension tables data with the help of where clause. Inner join can be used as a nested to join more than two dimension table. The syntax of inner join is: select column_name FROM table1 inner join (table2 INNER JOIN table3 ON table2.param=table3.param) on table2.param1=table1.param1 where conditions
30
select column_name from table1 inner join (table2 inner join (table3 inner join table4 on table3.param1 = table4.param1) on table2.param2 = table3.param2) on table1.param3 = table3.param3 where conditions. select column_name from (table1 inner join (table2 inner join table3 on table2. param = table3. param) on table1.param = table2. param) where date_key.datekey between ' datekey' and datekey' and time_key.ltime between ltime and ltime; In this dissertation four approaches are proposed. For every association rule we are interested to know about support and confidence. Support is the fraction of presence that contains both X and Y, here X, Y are student or date on which student is present. Confidence is the fraction of presence containing X, which also contains Y. We are always interested in those rules that have high support and confidence.
date = dn,time = tn
Algorithm:
1. Attributes will be entered by user
2. Use INNER JOIN (DEPT_KEY, STUTENT_NO, PRESENCE_KEY, DATE_KEY, SUBJECT_KEY, ATTNDENCE) INNER JOIN will join all dimensions table to fact table & each other. Apply conditions entered by user
31
select studentno from presence_key inner join (date_key inner join (attendence inner join time_key on attendence.timekey on = time_key.timekey) = on date_key.datekey = attendence.datekey) presence_key.presence_k attendence.presence_k where
date_key.datekey between ' datekey' and datekey' and time_key.ltime between ltime and ltime and presence_key.presence_k = ' presence_k'; 3. Use counter int count=0; //initialize counter to zero while(rs.next()) { String studentno = rs.getString(1); //get data from data warehouse System.out.println("Enrollment Number:" + studentno); // print value count++; //incremented counter by 1 } rs.last(); //Close loop System.out.println("Total : "+count); //Print value of total count 4. Calculate total number of scheduled lectures & store it in a variable Y //For finding total enrolled students in college select studentno from (date_key inner join (attendence inner join time_key on attendence.timekey = time_key.timekey) on date_key.datekey = attendence.datekey) where date_key.datekey between ' datekey' and datekey' and time_key.ltime between ltime and ltime; 5. Find percentage Z = (count/Y)*100;
Standard Query:
Find percentage of students present in campus between several days during particular time.
Query1:
Find percentage of students present in campus between dates 29-Aug-05 and 29-Oct-05 during times 11 A.M and 3:15 P.M. Result of Query1 is shown in figure Fig: 4.1 32
Fig: 4.1: Output of query 1 Studentno 'Y05CSTE107' is present during 11 AM to 01 PM will also present during 3:30 PM to 05:30 PM. Support is 42.11 % (16/38) & Confidence is 100 % (16/16) Studentno 'Y05CSTE107' is present during 3:30 PM to 05:30 PM will also present during 11 AM to 01 PM. Support is 42.11 % (16/38) & Confidence is 84.21% (16/19) Studentno 'Y05CSTE111' is present during 11 AM to 01 PM will also present during 3:30 PM to 05:30 PM. Support is 57.89 % (22/38) & Confidence is 70.96 % (22/31) Studentno 'Y05CSTE111' is present during 3:30 PM to 05:30 PM will also present during 11 AM to 01 PM. Support is 57.89 % (22/38) & Confidence is 100 % (22/22)
33
Query2:
Find percentage of students present in campus between dates 29-Aug-05 and 29-Oct-05 during times 11 A.M and 5:30 P.M. Result of Query2 is shown in figure Fig: 4.2
Related query:
Find percentage of students present in campus from a particular department between several days during particular time.
Query3:
Find percentage of students present in campus between dates 29-Aug-05 and 29-Oct-05 during times 11 A.M and 3:15 P.M from CSE department. Result of Query3 is shown in figure Fig: 4.3
34
It is another approach for finding total students presents by applying various conditions.
Algorithm:
1. Attributes will be entered by user Date Class Department Subject
35
2. Use INNER JOIN (DEPT_KEY, COURSE_NO, STUTENT_NO, PRESENCE_KEY, DATE_KEY, SUBJECT_KEY, ATTNDENCE) INNER JOIN will join all dimensions table to fact table & each other. Apply conditions entered by user
select studentno from presence_key inner join (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) on presence_key.presence_k = attendence.presence_k where student_no.studentno = 'studentno' and date_key.datekey between 'd1' and d2 and subject_key.subject_k = ' subject_k ' and presence_key.presence_k = ' presence_k '; 3. Use counter int count=0; //initialize counter to zero while(rs.next()) { String studentno = rs.getString(1); //get data from data warehouse System.out.println("Enrollment Number:" + studentno); // print value count++; //incremented counter by 1 } rs.last(); //Close loop System.out.println("Total : "+count); //Print value of total count 4. Calculate total number of scheduled lectures & store it in a variable Y //For finding total enrolled students for a particular course from particular department and class select studentno from (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) where student_no.studentno = ' studentno ' and date_key.datekey between 'd1' and 'd2' and subject_key.subject_k = 'subject_k '; 5. Find percentage
36
Z = (count/Y)*100;
Standard Query:
Find the presence of students on a particular day from a particular department from a particular class in a particular subject.
Query1:
Find the presence of students on date 14-SEP-05 in subject TCPIP from final year computer science & engineering department. Result of Query1 is shown in figure Fig: 4.4 Percentage of student present in the lecture of TCPIP is 81.63 % on 14-SEP-05.
Fig: 4.4: Output of approach 2 for BECSE class with subject TCPIP on 14-SEP-05
37
When studentno 'Y05CSBE137' is present studentno 'Y05CSBE147' is also present. Support is 13.15 % and confidence is 12.12 %. When studentno 'Y05CSBE147' is present studentno 'Y05CSBE137' is also present. Support is 13.15 % and confidence is 100 %. When studentno 'Y05CSBE106' is present studentno 'Y05CSBE137' is also present. Support is 76.32 % and confidence is 85.29 %. When studentno 'Y05CSBE137' is present studentno 'Y05CSBE106' is also present. Support is 76.32 % and confidence is 87.87 %.
Algorithm:
1. Attributes will be entered by user Student Number Dates Subject
38
2. Use INNER JOIN (DEPT_KEY, COURSE_NO, STUTENT_NO, PRESENCE_KEY, DATE_KEY, SUBJECT_KEY, ATTNDENCE) INNER JOIN will join all dimensions table to fact table & each other. Apply conditions entered by user
select datekey from presence_key inner join (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) on presence_key.presence_k = attendence.presence_k where student_no.studentno = 'studentno' and date_key.datekey between 'd1' and d2 and subject_key.subject_k = ' subject_k ' and presence_key.presence_k = ' presence_k '; 3. Use counter int count=0; //initialize counter to zero while(rs.next()) { String datekey = rs.getString(1); //get data from data warehouse System.out.println("Datekey:" + datekey); // print value count++; //incremented counter by 1 } rs.last(); //Close loop System.out.println("Total : "+count); //Print value of total count 4. Calculate total number of scheduled lectures & store it in a variable Y //For finding total engaged lectures for a particular course from particular department and class select datekey from (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) where student_no.studentno = ' studentno ' and date_key.datekey between 'd1' and 'd2' and subject_key.subject_k = 'subject_k '; 5. Find percentage Z = (count/Y)*100;
39
Standard Query:
Find the presence or absence of a particular student from a particular department and from a particular class between several days in one subject, three subject subjects, and five subjects.
Query1:
Find presence of student between dates 15-Sep-05 and 13-Nov-05 in three subjects from final year computer science & engineering department whose enrollment number is Y05CSBE133. Result of Query1 is shown in figure Fig: 4.5
40
When studentno 'Y05CSBE137' is present studentno 'Y05CSBE147' is also present. Support is 13.15 % and confidence is 12.12 %. When studentno 'Y05CSBE147' is present studentno 'Y05CSBE137' is also present. Support is 13.15 % and confidence is 100 %. When studentno 'Y05CSBE106' is present studentno 'Y05CSBE137' is also present. Support is 76.32 % and confidence is 85.29 %. When studentno 'Y05CSBE137' is present studentno 'Y05CSBE106' is also present. Support is 76.32 % and confidence is 87.87 %.
Query2:
Find presence of student between dates 15-Sep-05 and 13-Nov-05 in five subjects from final year computer science & engineering department whose enrollment number is Y05CSBE133. Result of Query is shown in figure Fig: 4.6
41
threshold
It is most powerful approach for finding typical query with full flexibility. Split the support into five parts. Desired output is presence or absence of students with presence threshold & subject threshold.
Algorithm:
1. Attributes will be entered by user Percentage threshold Class Department Subject threshold
2. Use INNER JOIN (DEPT_KEY, COURSE_NO, STUTENT_NO, PRESENCE_KEY, DATE_KEY, SUBJECT_KEY, ATTNDENCE) NESTED SELECT is used for calculating percentage. INNER JOIN will join all dimensions table to fact table & each other. Apply conditions entered by user
select distinct k.studentID from (select a.STUDENTNO as studentID, a.subject_k, a.deptkey, (select count(PRESENCE_K)*100 from attendence b1 where b1.PRESENCE_K='P' and b1.STUDENTNO=a.STUDENTNO and b1.subject_k = ' subject' and b1.deptkey = 'dept')/(select count(PRESENCE_K) from attendence b2 where b2.STUDENTNO=a.STUDENTNO and b2.subject_k = ' subject' and b2.deptkey = 'dept') as PercentageAttd from attendence a) k where k.PercentageAttd<= percentage"' and k.subject_k = 'subject' and k.deptkey = 'dept' ;
42
select studentno from presence_key inner join (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) on presence_key.presence_k = attendence.presence_k where student_no.studentno = 'studentno' and date_key.datekey between 'd1' and d2 and subject_key.subject_k = ' subject_k ' and presence_key.presence_k = ' presence_k '; 3. Use counter int count=0; //initialize counter to zero while(rs.next()) { String studentno = rs.getString(1); //get data from data warehouse System.out.println("Enrollment Number:" + studentno); // print value count++; //incremented counter by 1 } rs.last(); //Close loop System.out.println("Total : "+count); //Print value of total count 4. Calculate total number of scheduled lectures & store it in a variable Y //For finding total enrolled students for a particular course from particular department and class select studentno from (date_key inner join (student_no inner join (attendence inner join subject_key on attendence.subject_k = subject_key.subject_k) on student_no.studentno = attendence.studentno) on date_key.datekey = attendence.datekey) where student_no.studentno = ' studentno ' and date_key.datekey between 'd1' and 'd2' and subject_key.subject_k = 'subject_k '; 5. Find percentage Z = (count/Y)*100;
Standard Query:
Find percentage of students whose presence is less than X% in Y different subjects in a particular class having particular branch.
43
Query1:
Find percentage of students from third year electronics & telecommunication branch whose presence is less than 75% in 3 different subjects. Result of Query1 is shown in figure Fig: 4.7
Fig 4.7: Output of approach 4 for TEECT with three subjects & 75% presence
Query2:
Find percentage of students from third year electronics & telecommunication branch whose presence is less than 75% in 4 different subjects. Result of Query2 is shown in figure Fig: 4.8
44
Fig 4.8: Output of approach 4 for TEECT with four subjects & 75% presence
45
46
In this dissertation framework is tightly integrated with the data warehousing technology. In addition to integrating the mining with the database technology by keeping all queries processing within the data warehouse, this approach introduces the following two innovations: 1) Extended association rules using the other non-item dimensions of the data warehouse, which results in more detailed and ultimately actionable rules for any academic institutes. 2) Defining association rules for aggregated data.
47
References
[1] Svetlozar Nestorov and Nenad Jukic. Ad-hoc Association-Rule Mining within the Data Warehouse. IEEE (2002), Proceedings of the 36th Hawaii International Conference on System Sciences (HICSS03) [2] Bhavani M. Thuraisingham, Marion G. Ceruti Understanding Data Mining and Applying it' to Command, Control, Communications and Intelligence Environments. IEEE (2000). [3] Ken Collier, Bernard Carey, Donald Sautter, Curt Marjaniemi A Methodology for Evaluating and Selecting Data Mining Software. IEEE(1999), Proceedings of the 32nd Hawaii International Conference on System Sciences 1999 [4] Michael H. Smith, Witold Pedrycz Expanding the Meaning of and Applications for Data Mining, IEEE (2000) [5] Wee Hyong Tok, Twee Hee Ong, Wai Lup Low, Indriyati Atmosukarto, Stephane Bressan, Predator-Miner: Ad hoc Mining of Associations Rules Within a Database Management System. IEEE (2002), Proceedings of the 18th International Conference on Data Engineering (ICDE.02) [6] Maria Lupetin A Data Warehouse Implementation Using the Star Schema [7] Agrawal R., Imielinski T. and A. Swami. Mining Association Rules Between Sets of Items in Large Databases. Proceeding of ACM SIGMOD International Conference. (1993), 207-216. [8] Agrawal R. and Srikant R. Fast Algorithms for Mining Association Rules. Proceeding of International Conference On Very Large Databases VLDB. (1994), 487-499. [9] R. Agrawal H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996). [10] Hilderman R.J., Carter C.L., Hamilton H.J., and Cercone N. Mining Association Rules from Market Basket Data Using Share Measures and Characterized Itemsets. International Journal of Artificial Intelligence Tools. 7 (2), (1998), 189-220. [11] Chaudhri S. and Dayal U. An overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record. 26 (1), (1997), 65-74. 48
[12] Inmon, W. H. Building the Data Warehouse. Wiley (1996). [13] Kimball, R., Reeves, L., Ross, M., and Thornthwhite, W. The Data Warehouse Lifecycle Toolkit. Wiley (1998). [14] Leavitt, N. Data Mining for the Corporate Masses. IEEE Computer. 35 (5), (2002), 2224. [15] S. Sarawagi, S. Thomas, R. Agrawal. Integrating Mining with Relational Database Systems: Alternatives and Implications. Proceedings of ACM SIGMOD Conference (1998), 343-354 [16] S. Tsur, J. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, A. Rosenthal. Query Flocks: A Generalization of Association-Rule Mining. Proceedings of ACM SIGMOD Conference, (1998), 1-12. [17] Wang, K., He Y., and Han J. Mining Frequent Itemsets Using Support Constraints. Proceedings of Internationl Conference on Very Large Databases VLDB, (2000), 43-52 [18] Watson, H. J., Annino, D. A., and Wixom, B. H. Current Practices in Data Warehousing. Information Systems Management. 18 (1), (2001). [19] Berry, M. and Linoff, G. Data Mining Techniques for Marketing, Sales and Customer Support. Wiley (1997). [20] Ravindra Patel, D. K. Swami and K. R. Pardasani, Lattice Based Algorithm for Incremental Mining of Association Rules, International Journal of Theoretical and Applied Computer Sciences Volume 1 Number 1 (2006) [21] Christian Borgelt, An Implementation of the FP-growth Algorithm, ACM (2005) [22] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, Shalom Tsur, Dynamic Itemset Counting and Implication Rules for Market Basket Data [23] Dao-I Lin, Zvi M. Kedem, Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set (1997) [24] V.O. de Carvalho, S.O. Rezende, and M. de Castro (Brazil), Evaluating Generalized Association Rules through Objective Measures, From Proceeding (549) Artificial Intelligence and Applications (2007) [25] Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques pp-[348-54] [26] Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques pp-[230-35]
49