0% found this document useful (0 votes)
94 views

Data Mining & Business Intelligence

The document outlines the syllabus for the Third Year Information Technology course at Mumbai University, focusing on Data Mining and Business Intelligence. It details course objectives, outcomes, and modules covering topics such as data warehousing, data exploration, classification, clustering, frequent pattern mining, and business intelligence applications. The course aims to equip students with the skills to analyze data and apply data mining techniques for business decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
94 views

Data Mining & Business Intelligence

The document outlines the syllabus for the Third Year Information Technology course at Mumbai University, focusing on Data Mining and Business Intelligence. It details course objectives, outcomes, and modules covering topics such as data warehousing, data exploration, classification, clustering, frequent pattern mining, and business intelligence applications. The course aims to equip students with the skills to analyze data and apply data mining techniques for business decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 322
Mumbai University Third Year of Information Technology (2019 Course) Subject Code Subject Name Credits ITC601 Data Mining and Business Intelligence 03 Course Objectives : 1. | To introduce the concept of data warehouse data Mining as an important tool for enterprise data management and as a cutting-edge technology for building competitive advantage 2. To enable students to effectively identify sources of data and process it for data mining, 3. | to make students well versed in all data mining algorithms, methods of evaluation. 4. | To impart knowledge of tools used for data mining 5 | To provide knowledge on how to gather and analyze large sets of data to gain useful business understanding. © | to impart skills that can enable students to approach business problems analytically siness value from data. identifying opportunities to derive bu: Course Outcomes : On successful completion, Demonstrate an understanding of the imp ae the principles of business intelligence. prepare the data needed for data mining using pre preprocessing techniques. 2, | Organize and 3, | Perform exploratory analysis of the data to be used for mining. 4. Implement the appropriate data mining methods like classification, clustering or Frequent pattern mining on large data sets. s data mining algorithms. ply metrics to measure the performance of variou: lem domain, use the data collect Define and ap} 1 problems: Analyze the prob priate data mining technique, 0 solve practical ply the appro} n support 6. | Apply BI t interpret and vist enterprise ap) provide decisio syalize the results and of course, learner/student will be able to: jortance of data warehousing and data mining and | | fed in a Module Course Module / Contents 0 Prerequisite Basic Knowledge of databases Data Warehouse (DWH) Fundamentals with Introduction to Data Mining DWH characteristics, Dimensional modeling : Star, Snowflakes, OLAP operation, OLTP vs OLAP, Data Mining as a step in KDD, Kind of patterns to be mined, Technologies used, Data Mining applications. Self-learning Topics : Data Marts, Major issues in Data Mining. (Refer Chapter 1) 04 Data Exploration and Data Preprocessing ‘Types of Attributes, Statistical Description of Data, Measuring Data Similarity and Dissimilarity. Why Preprocessing? Data Cleaning, Data Integration, Data Reduction: ‘Attribute Subset Selection, Histograms, Clustering, Sampling, Data Cube aggregation, Data transformation and Data Discretization: Normalization, Binning, Histogram Analysis Self-learning Topics: Data Visualization, Concept hierarchy generation (Refer Chapter 2) 06 Classification Basic Concepts; Classification methods: 1. Decision ‘Tree Induction: Attribute Selection Measures, Tree pruning. 2, Bayesian Classification: Naive Bayes Classifier. Prediction: Structure of regression models; ‘Accuracy and Error measures, Precision, Recall, Cross Validation, Bootstrap, Introduction of ‘AdaBoost and Random forest. Simple linear regression, Holdout, Random Sampling Ensemble methods, Bagging, Boosting, Self-learning Topics : Multiple linear regression, logistic regression, Random our classifier, SVM forest, nearest neight (Refer Chapter 3) |) | anal 08 tl Module Course Module / Contents Periods Clustering and Outlier Detection Cluster Analysis: Basic Concepts; Partitioning Methods: K-Means, K Medoids; | Hierarchical Methods: Agglomerative, Divisive, BIRCH; Density-Based Methods: DBSCAN | 08 What are outliers? Types, Challenges; Outlier Detection Methods: Supervised, Semi Supervised, Unsupervised, Proximity based, Clustering Based. | Self-learning Topics : Hierarchical methods : Chameleon, Density based | methods: OPTICS, Grid based methods: STING, CLIQUE (Refer Chapter 4) Frequent Pattern Mining | Basic Concepts : Market Basket Analysis, Frequent Itemset, Closed Itemset, | and Association Rules; Frequent Itemset. Mining Methods: The Apriori Algorithm: Finding Frequent Itemset Using Candidate Generation, Generating Association Rules from Frequent [temset, Improving the Efficiency of Apriori, A pattern growth approach for mining Frequent Itemset, Mining Frequent| 08 Itemset using vertical data formats; Introduction to Advance Pattern Mining: Mining Multilevel Association Rules and Multidimensional Association Rules. Self-learning Topics : Association Mining to Correlation Analysis, lift, Introduction to Constraint-Based Association Mining (Refer Chapter 5) | Business Intelligence What is BI? Business intelligence architectures; Definition of decision support system; Development of a business intelligence system using Data Mining for business Applications like Fraud Detection, Recommendation System 04 Self-learning Topics : Clickstream Mining, Market Segmentation, Retail industry, Telecommunications industry, Banking & finance CRM, Epidemic prediction, Fake News Detection, Cyberbullying, Sentiment Analysis etc. (Refer Chapter 6) Total 39 aoa SSS Table of Contents {We Data Mining & Business IntelligeNES (Mu) 1 a Warehouse (OWH) Fundamentals with Introduction to Data Mining 1-1 to 1-44 Chapter 1: Dat ‘OLTP vs OLAP Data Mining as @ step] 9 istics, Dimensional model kes, OLAP operation, Syllabas : DWH character se KD, Kind af patterns to be mined, Technologies used, Data Mining applications. ing : Star, Snow seittearning Toples: Data Marts, Major issues in Data Mining 11 DWH Characteristics . LALA Definition Data Warehouse 112 Benefits of Data Warehousing 113 Features of a Data Warehouse. 12 Dimensional modelling: Star, Snowflakes... 1.21 Whatis Dimensional Modelling?” 1.2.2 Difference between Data Warehouse del and ER model 1.23 Comparison between Dimensional Mo« 1.24 Information Package Diagram nm 125 StarSchema. 1.26 STAR schema Keys... 12.7 The Snowflake Schema 128 Star Flake Schema: — 17 1.2.9 Differentiate between Star Schema and Snowflake Schema, 1-8 18 1.2.10 Fact Tables and Dimension Tables... 1211 Factless Fact Table mono 1.2.12 Fact Constellation Schema or Families of Star. 1.213 Examples on Star Schema and Snowflake Schema. 13. OLAP operation... 13.1 OLAP operations or OLAP Techniques... 1.3.1(A) Consolidation or Roll Up eed 13.1(8) Drill-down. 13.1(Q)_ Slicing and dicing.- 13.1(0) Dice. 13.1(E) Pivot / Rotate 13.1(F) Other OLAP operations... 132 Examples of OLAP. 14 OLTPvsOLAP..... 15 Data Miningas a step in KD. USA Definiti0M neem 152 _ KDD Process (Knowledge Discovery in Databases . 153 Architecture of Typical Data Mining System. 1.36 se 37 fF Data Mining & Business Intelligence (MU) Table of Contents 1.6 Kind of Patterns to be Mined... o 1.6.1 Data Mining Functionalities.... 1.7 Technologies Used ono 171 Statistics. 1.72 Machine Learning... 1.7.3 Information Retrieval (IR) 1.74 Database Systems and Data Warehouses... 1.75 Decision Support System 1.8 Data Mining Applications. 19 Selfearning Topics 1.9.1 Data Marts 19.2 Major Issues in Data Mining. eee Eee Chapter 2: Data Exploration and Data Preprocessing 2-1 to 2-53 Syllabus : Types of Attributes, Statistical Description of Data, Measuring Data Similarity and Dissimilarity. Why Preprocessing ? Data Cleaning, Data Integration, Data Reduction : Attribute Subset Selection, Histograms, Clustering, Sampling, Data Cube) aggregation, Data transformation and Data Discretization : Normalization, Binning, Histogram Analysis. Self-learning Topics Data Visualization, Concept hierarchy generation. 21 Typesof Attributes 211 Attributes Types 22 Statistical Description of Data 221 Central Tendency aa 2.22 Dispersion of Data E. 2.23 Graphic Displays of Basic Statistical Descriptions of Data 27 23. Measuring Similarity and Dissimilarity... 23.1 Data Matrix versus Dissimilarity Matrix nnnonns 23.2 Proximity Measures for Nominal Attributes: 233 Proximity Measures for Binary Attributes. 234 Dissimilarity of Numeric Data : Minkowski Distance... 235 Proximity Measures for Ordinal Attributes... 236 Dissimilarity for Attributes of Mixed Types mmm 23:7 Cosine Smarty nnn 24 Why Preprocessing?. 241 Why Pre-processing is Required 7... 242 Different Forms of Data Pre-processing... 25 DataCleaning 25.1 Reasons for "Dirty" Data. Data Mining & Business Intelligence (MU) Table of Contents 252 Steps in Data Cleansing... 253 Missing Values... 254 Noisy Data... 25.4(A) Binning. 25.4(B) Outlier analysis by clustering, 254(C) Regression.. 25S Inconsistent Data. | __ Data Integration. 2.6.1 Introduction to Data Integration. 2.6.1(A) Entity Identification Problem 2.6.1(B) Redundancy and Correlation Analysis. 2.6.1(C) Tuple Duplication. 26.1(0) Data Value Conflict Detection and Resolution. 7 Data Reduction 27:1 Data Cube Aggregation 272 Dimensionality Reduction 272(A) Attribute subset selection. 2:73 Data Compression 274 — Numerosity Reduction 27.4(A)_ Histogram Analysis 27.4(B) Clustering. 27.4(C) Sampling 28 Data transformation and Data Discretization 281 Data Transformation. 282 Data Discretization 283 Data Transformation by Normalization 284 _ Discretization by Binning. 285 __ Diseretization by Histogram Analysis ~ 29 Self-learning Topics ~ 291 Data Visualisation... 292 Concept Hierarchies: 9.2(A) Concept hierarchy generation fr categorical data _ SS in 31 to 3-78 Chapter 3: Classification Syllabus Basic Concepts; Classification methods: I Decision “Tree induction: Attribute Selection Measures, Tree pruning, ca, Ieat ayes Classifier. Prediction: Structure of regression models; Simple linear rerestion. ACOA a sa ravi Nae eye Holdou, Random Sampling, Cross Validation, Bootstrap, introduction of Ensemble es aces eaBoost and Random forest. Self-learing Topic Multiple near regression, logistic regression ‘methods, Bagging, Boosting, ‘Random forest, nearest neighbour classier, SYM. Hy eteeetes Table of Contents W Data Mining & Business intelligence (MU) 34 32 33 35 36 OS Basie Concept : Classification 3.1 Classification Problem 3.12 Classification Example. — 3.13 Classification isa Two Step Process 3.44 Difference between Classification and Prediction.» 3.15 Issues Regarding Classification and Prediction. Classification Method 3.2.1 Decision Tree Induction. 3.2.1(A) Appropriate Problems for Decision Tree Learning. 3.21(8) Decision Tree Representation enn 3.2.1(0) Attribute Selection Measure. 32.4(0) Algorithm for Inducing a Decision Tree. 322 Tree Pruning.. 323 Examples of D3, Bayesian Clasification: Naive Bayes Classifier... 331 Bayes’ Theorem. 332 Basics of Bayesian Classification... 333 Naive Bayes Classifier: Examples. 334 Rule based Classification... 335 Other Classification Methods... 3441 Structure of Regression Model... 342 Linear Regression... 3.42(A)_ Simple linear regression. Model Evaluation and Selection... 351 Accuracy and Error Measures. 352 — Holdout.- 353 Random Sub-sampling... 354° Cross-Valldation (CV) xm 35S Bootstrapping nnmrnimnnn Introduction of Ensemble methods... 361 Bagged (or Bootstrap) trees. | Miltipte Linear RegresS100 wee ®W Data Mining & Business Intelligence (MU) 5 Table of Contents 373 Random forest oe TD 3.74 KeNearest-Neighbor Classifiers 3.75 Support Vector Machine (SVM). 3.7.5(A) Tuning Hyperparameters ‘Syllabus : Cluster Analysis : Basic Concepts; Partitioning Methods: K-Means, K Medoid Divisive, BIRCH; Density-Based Methods: DBSCAN. What are outliers? Types, Challenges; Outlier Detection Methods : Supervised, Semi Supervised, Unsupervised, Proximity based Clustering Base. SelPleaming Topi) Hierarchical methods : Chameleon, Density based methods: OPTICS, Grid based methods STING, CLIQUE, 4 Cluster Analysis. 4.1.1 Whatis Clustering? 41.2 Categories of Clustering Methods. 4.13 Different Distance Measures that can be used to Compute Distances between Two Clusters 41.4 Difference between Classification and Custering 42 Partitioning Methods : K-Means, K Medotds 42.1 K-means Clustering: (Centroid Based Technique) nnn 422 K-Medoids (Representative Object based Technique), 423. Sampling Based Method. ee 43. Hierarchical Methods : Agglomerative, Divisive, BIRCH. nti 43.1 Agglomerative Hierarchical Clustering. 432 Divisive Hierarchical Clustering rnemmnmnnn 433 BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) 434 Advantages and Disadvantages of Hierarchical Clustering. 44° Density-Based Methods: DBSCAN... cil 441. DBSCAN (Density Based Methods) 45 Whatisan Outlier?.. 451 Applications. 4.6 Types of Outer nm 461 Global Outliers... 462 Contextual (or Conditional) Outliers 463 Collective Outlers.. 47 Challenges of Outlier Detect em —arnerennnnemenmmmsnnnn css Outlier Detection Methodenmmnnn 48:4 Supervised, Semi - Supervised, Unsupervised Methods. 48.2 Statistical Methods, Proximity-based Methods and Clustering-based Methods. A idk on 68 chee teh Table of Contents W Data Mining & Business Intelligence (MU) 49 Proximity based Approaches. 49.1 Distance-based Outlier Detection and a Nested Loop Method 492 AGrid based Method. = 493 Density based Outlier Detection... 410 Clustering based Approaches nem 411 SelFlearning Topics 411.4 Hierarchical methods: Chamele0n.nnw0-e 411.2 Density based methods OPTICS.. 4113 Grid based methods: STING, CLIQUE nna Chapter § : Frequent Pattern Mining 5-1 to 5-48 syllabus : Baste Concepts : Market Basket Analysis, Frequent Itemset, Closed Itemset, and Association Rules, Mining Methods: ‘The Apriori Algorithm: Finding Frequent Itemset Using Candidate Generation, Generating Association Rules from Frequent Itemset, Improving the Efficiency of Aprori, A pattern growth approach for mining Frequent Itemset, Mining Frequent Itemset using vertical data formats; Introduction to Advance Pattern Mining : Mining Multilevel Assoclation Rules and Multidimensional Association Rules, Self-learning Topies : Association Mining to Correlation Analysis lift, Introduction to Constraint-Based Association Mining 5A Basic Concept : Market Basket Analysis. 5.11 What is Market Basket AnalySiS? econ _ 512 Howisit Used?.... es = 5.1.3 Applications of Market Basket Analysis. 52 Frequent itemsets, Closed Itemsets and Association Rules. 521 Frequent itemsets. nes 522 Closed Itemset5 orem 523 Association Rules... 523(A) Large Itemses.... 53 Frequent Pattern Mining... a3 S4 Frequent Itemset Mining Method .. 5441 April Algorithm for Finding Frequent Itemsets using Candidate Generation, 542 Generating Association rules from frequent itemsets.... ‘Advantages and Disadvantages of Aprior\Algorithman., Solved Examples on AprOri Algorithme ‘Approach for Mining Frequent emsets (FP-Growth). finitlon of FP-tree,..... W Data Mining & Business Intelligence (MU) £ Table of Contents, 5, ‘xample SA Example of FP Treen, oe 8 55.5 Mining Frequent Patterns from FP Tree seu, —— 55.6 ; Benefits of the FP-Tree Structure... cy Merete Pa S10 Anchor Ming aren Aa 5.12 ‘ Introduction to Constraint based Association Mining. Chapter 6: Business Intelligence Syllabus : What is BI? Business intelligence architectures; Definition of decision support system; Development of a business intelligence system using Data Mining for business Applications like Fraud Detection, Recommendation System Sel learning Topics : Clickstream Mining, Market Segmentation, Retail industry, Telecommunications industry, Banking & finance CRM, Epidemic prediction, Fake News Detection, Cyberbullying, Sentiment Analysis etc 6.1 _ Whatis Business Intelligence?. 6.2 Business Intelligence Architectures... 62.1 The Three Major Components of BI Architecture. = 622 Different Components of a Busines Intelligent System... z 2 63 Definition of Decision Support System enon : a. ee, 64 Development of a Business Intelligence System. bai 65 Business Intelligence wren es sce : 65.1 Business Intelligence I5SU€S.nrnrnnmnn 66 Fraud Detection 6.7 Recommendation System. 68 —Clickstream Mining. 68.1 Clickstream Data: Collection and Restoration. 682. Clickstream Data: Visualisation and Categorisation. 69 Market Segmentation... G94 Market Segmentation for Market Trend Analysis 69.2 Sales Trend Analysis. 610 Retail Industry. 6.11 Telecommunications Industry ~- 6.12 Banking and Finance mw CRM ee 6434 Data Mining Challenges and Opportunities Epidemic Prediction... Data Warehouse (DWH) Fundamentals with aak Introduction to Data Mining Pinay DWH characteristics, Dimensional modeling : Star, Snowflakes, OLAP operation, OLTP vs OLAP Data Mining as a step in KDD, Kind of patterns to be mined, Technologies used, Data Mining applications |Sel-leaming Topics: Data Marts, Major issues in Data Mining 1.1 _ DWH Characteristics 1.1.1 Definition Data Warehouse ‘The term Data Warehouse was defined by Bill Inmon in 1990, in the following way: "A warehouse is a subject: oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process”, He defined the terms in the sentence as follows Subject Oriented Data that gives information about a particular subject instead of about a company’s ongoing operations. Integrated Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Time-variant ‘Alldata in the data warehouse is identified with a particular time period. Non-volatile © Datals stable in a data warehouse. More data is added but data is never removed, This enables management to gain a consistent picture of the business. «Ralph Kimball provided a much simpler definition ofa data warehouse ie. “data warehouse is a copy of transaction data specifically structured for query and analysis” «This is a functional view of a data warehouse. Kimball did not address how the data warehouse is built ike In mon di, rather he focused on the functionality of a data warehouse. 11.2 Benefits of Data Warehousing + Potential high returns on investment and delivers enhanced business intelligence : Implementation of data warehouse requires a huge investment in lakhs of Rs. But It helps the organization to take strategic decisions based on past historical data and organization can Improve the results of various processes like marketing ‘segmentation, inventory management and sales. Data Warehouse (DWI) Data Mining & aval etitive advantage previously unknow' ‘+ Competitive advantagé makers can access that data (0 tal ‘om multipte sources Is available the data from mult in integrated form, business Ms © Saves Time : As the data fr ‘There is no need to retrieve a astomer service and productivity, «Better enterprise intelligence It improves the ¢u in data warehouse Is cleaned So data quality ts © High quality data high. 1.1.3 Features of a Data Warehouse Characteristics/ Features of a Data Warehouse ‘Acommon way of introducing data warehousing sto refer to the characteristics of a data warehouse 1. Subject Oriented 2, Integrated 3 Time V Nonvolatile 1. Subject Oriented © Data warehouses are designed to help analyze data, For example, to learn more about banking da ‘warehouse can be built that concentrates on transactions, loans, et ised to answer questions like "Which customer has taken maximum loan amount for by subject matter, loan In this case, makes the data © This warehouse can bi last year?” This ability to define a warehouse subject oriented. (Oporaonalappatons ata warohoum mubjcta a7 : |= | . Fig. 1.1.1 ; Data Warehouse Is subject Oriented warehous 2. Integrated ‘+ A data warehouse is constructed by integrating multi es like, relational databases, iple, heterogeneous data sources lik a a ; relational databases, Dy sen collected is cleaned and then data integration techniques are applied, which ensures consistency In 'ng conventions, encoding structures, attribute measures etc. among different data sources, : Example Data warsoue atic SS Fig, 1.1.2: Integrated Data Warehouse Data Warehouse (DWH). Data Mining & Business Intelligence (MU) 3, Non-volatile -e removed or changed because the Nonvolatile means that, once data entered into the warehouse, It cannot b purpose of a warehouse is to analyze the data. 4, Time Variant job, a data warehouse mer record has details of his jich ‘A data warehouse maintains historical data. For eg. A custo! z= would maintain all his previous jobs (historical information transactional system I .s not possible to retrieve ol .) when compared to @ only maintains current job due to which it Ider records. 1.2 Dimensional modelling : Star, Snowflakes ensional Modelling? 1.2.1 Whatis «It isa logical design technique used for data warehouses jal OLAP products available today «+ Dimensional model is the underlying data model used by many of the commere inthe market. ‘onal model uses the relational model with some important restrictions users in a data warehouse. itipart key called the fact table and a set of © Dimensi «itis one of the most feasible technique for delivering data to the end «Every dimensional model is composed of atleast one table with a mul smaller tables called dimension tables. arehouse Modeling and Operational Database Modeling 1.2.2. Difference between Data W: Operational Database Modeling Data Warehouse Modeling Current Values are the Data Content. Data is Archived, Derived, Summarized. for transactions. Data structure is Optimized for complex queries. Data structure is Optimized ‘Access frequency is Medium to low. ‘Access frequency is High. ss type is Read, Update, Delete Data access type is only Read. Data Acces Usage is Ad hoc, Random, Heuristic. Usage is Predictable, Repetitive. Response time is in Several seconds to minutes. ‘Sub - seconds. Response time is in Relatively small number of users. Large Number of users. between Dimensional Model and ER model 1.2.3. Comparison Dimensional Model ER Model Sr.No. se querying for business analyst and |Support for OLTPand ODS(Operational Data store) 1. |Support ad-h (data warehouse and ‘complex analyzes multidimensional database) Entities are linked through a series of joins. Iked through a series of joins. 2, _ | Entities are lin! Data Warehouse (DWH) ER Model Simply the view of the data model. You can|The data model has only one dimension. ¢ different views of the rotate the data cube to data Us asymmetric Itis symmetric. All tables look the same, Permit redundancy. Remove the redundancy in data, 6 [itis extensible w accommodate unexpected new|If the model is modified, the applications are data elements and new design decisions. The |modified. application is not changed. 7. {Its robust. The dimensional model design can be|It is variable in structure and very vulnerable to done independent of expected query patterns. __| changes in the user’s querying habits. 8, |The model s easy and understandable. The model for enterprise is very hard for people to} visualize and keep in their heads, lly models a business. It is a body) ‘of standard approaches for handling common. nodeling situations in the business world. ‘The model does not really model a business. It 9, |The model r models the micro relationships among data elements. 1.2.4 Information Package Diagram ‘+ Information package diagram Is the approach to determine the requirement of data warehouse. + Itgives the metrics which specifies the business units and business dimensions, The information package dlagram defines the relationship between the subject or dimension matter and key performance measures (facts). The Information package diagram shows the details that users want so its effective for communication between the user and technical staff Table 1.2.1: Information Package for Hotel Occupancy Hotel Room Type Time Room Status Hotel Id Room id Time id Status id Branch Name _| room type Year Status Description Branch Code _| room size Quarter | Region number of beds | Month Address type ofbed | Date clty/stat/zip | max occupants | day of week construction year | Suite day of month, renovation year holiday lag Data Warehouse (DWH). Facts (@) Occupied Rooms (&)_Vacant Rooms () Unavailable Rooms — ()_ No of occupants (e) Revenue 12.5 Star Schema © "Star Schema is the most popolar schema design for a data warehouse, Dimensions are stored in a Dimension table and every entry has its own unique identifier. Every Dimension table is related to one or more fact tables. All the unique identifiers (primary keys) from the dimension tables make up for a composite key in the fact table Fig. 1.2.1 : Sales Star Schema ‘The fact table also contains facts. For example a combination of store_id, date key and product id giving the amount of a certain product sold on a given day ata given store Foreign keys for the dimension tables are contained in a fact table. For eg. (date key, product id and store id) are all three foreign keys. Ina dimensional modeling fact tables are normalised, whereas dimension tables are not. ‘The size of the fact tables is large as compared to the dimension tables. ‘The Facts in the star schema can be classified into three types (a) Fully-additive : Additive facts are facts that can be summed up through all of the dimensions in the fact table. (b) Semi-additive : Semi-additive facts are facts that can be summed up for some of the dimensions in the fact. table, but not the others. Exampl jank Balances : You can take a bank account as Semi- Additive since a current balance for the account can’t be summed as time period: but if you want see current balance of a bank you can sum all accounts current balance. (0) Non-additive : Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. Eg. : Ratios, Averages and Variance. Advantages of Star Schema ade up of multiple dimension tables and one fact table. For C bes aspects of a business. It is m Care ea i business, some of the dimension tables would be customer, book, catalog and year. Th ; ee ‘uld contain information about the books that are ordered from each catalog by each customer fact table wor during a particular year. © Reduced Joins, Faster Query Operation. ‘+ Its fully denormalized schema ad Data Mining & Business Intelligence (MU) 1-6 Data Warehouse (DWH) Simplest DW schema Easy to understand. Easy to Navigate between the tables due to less number of Joins. Most suitable for Query processing, Example i ier e =] evo 4 Co all Ey — tsa ‘tem_name far im] | be ve vemay | [ec car = a ae eae.” oa ey =e dotiers_sold a ‘eo prownco| avg_sales: ‘country jars Fig. 1.2.2: Sales Star Schema 1.2.6 STAR schema Keys i Primary Keys : The primary key of the dimension table is one ofthe attribute value which Identifies each row {ina dimension table uniquely. Example: In Professor_Dimdimension, Prof.id is primary key of dimension Professor _ ‘Course_ Section Dim Corel Period_Dim ae ‘Semester_id_ Couronne — ‘Unis Couree grades ifs Foon Cour. Foor capacy Semester 36 Studont_id Pete aed sudo Om Sudend Sudo nae Major Fig. 1.2.3 Data Warehouse (DWH)... Data Mining & Business ateltigence (MU) 1 2, Surrogate Keys «These keys are system generated sequence numbers: fe They donot have any built in meanings. ‘s _Alldata warehouse keys must be meaningless surrogate Keys: «© Onemust not use the original production keys. 1s Afour byte integer makes a good surrogate key. Foreign Keys very Dimension table has one-to-many relationship with the fact table. So the primary key of each dimension table s a foreign key in the fact table. 1.2.7. The Snowflake Schema ing low distinct values, textual «A snowflake schema is used to remove the low cardinality ie attributes ha attributes froma dimension table and placing them in a secondary dimension table «For eg: in Sales Schema, the product category n the product dimension table can be removed and placed in a secondary dimension table by normalizing the product dimension table. This process 'S carried out on large dimension tables. «Ibis a normalization process carried out to manage the sizeof the dimension tables. But this may affect its performance as joins needs to be performed. Inastar schema, ifall the dimension tables are normalised then this schema is called as snowflake schema, and if is called as star flake schema. only few ofthe dimensions ina star schema are normalised the 1.2.8 Star Flake Schema « Itisahybrid structure (ie. star schema + snowflake schema) «Every fact points to one tuple in each of the dimensions and has ‘additional attributes. © Does not capture hierarchies directly. + Straightforward means of capturing a multiple dimension data ‘model using relations, es ‘Sales_Person_Dim Ftime.key a) Sales_Person_key/ ‘Month = ao a | Fact_Table ‘Year Department_id ucataricae = (Beatin tor Product key iets Prod_name Es 3 [Ered_ssonior_| Fig. 1.2.4 : Sales Snowflake Schema (conti..) Data Warehouse 1 1 Data Mining & Business Ineligence (MU) Sales tact abe Se une oy tor. ey ocation| Tocaton_Kay ‘aroot = ity bey Lod ey key oy siato_or prownce county Fig, 1.2.4 : Sales Snowflake Schema 1.2.9 Differentiate between Star Schema and Snowflake Schema Sr.No. Star Schema Snowflake Schema 1. | star schema contains the dimension tables | A Snowflake schema contains in-depth joins because the ‘mapped around one or more fact tables. tables are split in to many pieces. ] 2._| Itisade-normalized model. Itis the normalized form of Star schema 3._| Noneed to use complicated joins. Have to use complicated joins, since it has more tables. 4. | Queries results fast. There will be some delay in processing the Query. 5. _| Star Schemas are usually not in BCNF form. | In Snowflake schema, dimension tables are in 3NF. so Ail the primary keys of the dimension tables | there are more dimension tables which are linked by are inthe fact table. primary ~ foreign key relation. | 1.2.10 Fact Tables and Dimension Tables * Adimensional model consists of Fact tables and dimension tables, measurements, © Numericand additive are the most useful facts. table. © A fact in the fact table depends on many facts, Each dimensional model has a primary table which is a fact table that is meant to contain the business A facttable has many to many relationships; it contains a set oftwo or more foreign keys that join to a dimension For eg. In sales schema, sales_amount fact depends on Product, Location and time. These factors are called as dimensions, ae WF Data Mining & Business Intelligence (MU) 19 Data Warehouse (DWH), Dimensions are factors on which a given fact depends, The sales amount fact can also be thought of as a function of three variables. Sales.amount = (product, location, time) Likewise in a sales fact table we may include other facts like sales unit and cost. Dimension tables are companion tables to a fact table ina star schema, Each dimension table is defined by its primary key that serves as the basis for referential integrity with any given fact table to which itis joined. Most dimension tables contain textual information. To understand the concepts of facts, dimension, and star schema, let us consider the following scenario: Imagine standing in the marketplace and watching the products being sold and writing down the quantity sold and the sales amount each day for each product in each store, Note that a measurement needs to be taken at every intersection ofall dimensions (day, product, and store). ‘The information gathered can be stored in the fact table as shown in Table 1.2.2 Table 1. Fact Table Sales Fact Table Date Key Product Key Store Key Sales Unit Sales Amount Cost ‘The facts are Sales Unit, Sales Amount, and Cost (note that all are numeric and additive), which depend on dimensions Date, Product, and Store. The details of the dimensions are stored in dimension tables. 1.2.11 Factless Fact Table Factless table means only the key available in the Fact there is no measures available. Used only to put relation between the elements of various dimensions. ‘Are useful to describe events and coverage, ie. the tables contain information that something has/has not happened. Often used to represent many-to-many relationships. The only thing they contain is a concatenated key, they do still however represent a focal event which is identified by the combination of conditions referenced in the dimension tables. ‘An Example of Factless fact table can be seen in the Fig. 1.2.5. Date tay Recount dimension a Bato dimen ee Customer key 20 dmenaion [Employee dimens si bet (Customer dimension [Leankoy }+-—{[Tean dimension Fig. 1.2.5 : A Factless Fact Table Data Warehouse (DWH) Data Mining & Business Intelligence (MU) 1-10 ‘There are two main types offactess fact tables Event tracking tables Use a factless fact table to track events of interest to the organization. For example, attendance at a cultural ‘event can be tracked by creating a fact table containing the following foreign keys (Le. links to dimension tables)-event identifier, speaker/entertainer identifier, participant identifier, event type, date. This table can then be queried to find out information, such as which cultural events or event types are the most popular. ‘+ Following example shows factless fact table Which records every time a student attends a course or Which class has the maximum attendance? Or What is the average number of attendance of agiven course? © All the queries are based on the COUNT() with the GROUP BY queries. So we can first count and then apply other aggregate functions such 4s AVERAGE, MAX, MIN. Coverage Tables The other type of factless fact table is called Coverage table by Ralph. It is used to support negative analysis Feport. For example a Store that did not sell a product for a given period. To produce such report, you need to havea fact table to capture all the possible combinations. You can then figure out what is missing, Fig. 1.2.6 : Example of Event Tracking Tables Common examples of factless fact table : © Ex-Visitors to the office. ‘© Ust of people for the web click * Tracking student attendance or registration events, 1.2.12 Fact Constellation Schema or Families of Star | Q. _Wote short note on Fact constellation, ie Fact Constellation 4s ts name implies, its shaped lke a constellation of stars (Le, star schemas), ‘This schema (s more complex than star or snowflake varieties, which ls due tothe fac that it contains multiple fact tables. This allows dimension tables to be shared amongst the fact tables ‘A schema ofthis type should only be used for applications that need a high level of sophistication For each star schema or snowflake schema its possible to construct a fact constellation schema, That solution is very flexible, however it may be hard to manage and support. The main disadvantage ofthe fact constellation schema is a more complicated design because many varants of aggregation must be considered, Ima fact constellation schema, different fact tables are explicitly assigned tothe dimensions, which are for given facts relevant. Winter Data Mining & Business Intell ; Data Warehouse (DWH).. e (MU) 1 © This may be useful in c ses when some facts are associated with a given dimension level and other facts with a deeper dimension level * Use of that model should be reasonable when for example, there Isa sales fact table (with detalls down to the exact date and invoice header id) and afact table with sales forecast which is calculated based on month, client id and product id + Inthat case using two different fact tables on a different level of grouping is realized through a fact constellation model Family of stars Dimenaon Dimension table table Fact ‘able ‘Dimension | mension table pee Fact table Dimension) Biren table (at Fig. 1.2.7: Family of stars al inl = aa branch key — Jo ' ‘branch “name vn ocation key ' boc en mm] a a Ee oe Fig. 1.2.8 : Sales Fact Consolidation WW Data Mining thusiness tnt Data Warehouse (DW) 2.13 6 4.1.2.1; All electronics company have sales department Sales consider four denensions namely time, tem, branch ang 'ocation The schema contains a central fact table sales with two measures dollars_sold and lunit_sold Design, Mar schoma, snowflake schema and fact constellation for same. Soin (a) Star Schema Fig. P.1.2.1 : Sales Star Schema (b) Snowflake Schema Fig. P.1.2.1(a) : Sales Snowflake ‘Schema —— eo —= W Data Mining & Business Intelligence (MU) Data Warehouse (DWH). (©) Fact Constellation Ex. 1.2.2: Timo TTime_koy Day Day-of the Wook Salon a Tino key ‘Quaror Hom oy year Branch. Koy Tocation- Koy Unis_sold ie Dole. sod am ‘Avg s2i68 Oana Fact table Brand Type Supplier ype Shipping FTimo_kay iem_Key. iBtiper hay Shipper From Jocavon Shrper.tey Fo_ ection [Shipper_nama Dotars. cos canon Soy Unis_shipped Shipper. pe Fig. P.1.2.1(b) : Fact constellation for sales ‘The Mumbai university wants you to help design a star schema to record grades for course completed by Students, There are four dimensional tables namely course _section, professor, student, period with attributes as follows Course_section Attributes :Course_Id, Section number, Course _name, Units, Room_id, Roomeapacty, During a given semester the college offers an average of 500 course sections Professor Attributes :Prof_id, Pro!_Name, Title, Department_Id, department_name Student Attributes :Student_id, Student_name, Major. Each Course section has an average of 60 students Period Attributes :Semester_id, Year. The database will contain Data for 30 months periods. The only tact that is to be recorded in the fact table is course Grade ‘Answer the following Questions (a) Design the star schema for this problem, (b) Estimate the number of rows in the fact table, using the assumptionsstated above and also estimate the {otal size ofthe fact table (in bytes)assumingthat each field has an average of 5 bytes. (@) Can you convert this star schema to a snowlla ‘schema if itis possible. ‘schema ? Justityyouranswer and design vw Data WF Data Mining & Business Intelligence (MU) 1-14 Cours Petod_Oim ‘Section.no Somestr Course_name [eer Unis coe Room_id en, Foom_capaciy Protessor_Dim Prot_id Prot_rame Tite Doparinantid Baparimant_nam Fig. P.2.2.2: University Star Schema (b) Total Courses Conducted by university = 500 Bach Course has average students = 60 ‘© University stores data for 30 months ‘© Total Student in University for all courses in 30 months = 50060 = 30000 ‘© Time Dimension = 30 months = 5 Semesters (Assume 1 semester = 6 months) + Now, Number of rows of fact table = 30000*5 = 150000 (one student has 5 grades for 5 semesters) (©) Snowflake Schema © Yes, the above star schema can be converted to a snowflake schema, considering the following assumptions. ‘* Courses are conducted in different rooms, so course dimension can be further normalized to rooms dimension as shown in the Fig. P.1.2.2(a).. Room_0im_ _Course_Dim [Room id (Course Period _Dim [Room 30 [Name [Room Fig. P.1.2.2(a) : University Snowflake Schema Data Mining & Business Intelligence (MU) 115 Data Warehouse (DWH ). Professor belongs to a department, and department dimension is not added in the star schema, 0 professor dimension can be further normalized to department dimension. Similarly students can have different major subjects, so It can also be normalized as shown in the Fig, P.1.2.2(a)- Draw star schema for “Hotel Occupancy” considering dimensions lke Time, Hotel etc. Design star and snowllake schema for “Hotel Occupancy” considering dimensions tke Time, Hotel, Room, etc ii) Calculate the maximum number of base fac table records for the values given below Time period 5 years Hotels 150 ii) looms: 750 rooms in each Hote! (about 400 occupied in each hotel daily) \¥) Information requiremenis are recorded for “Hotel occupancy’ consilering dimensions ike Hotel, room and Time, Few Facts recorded are vacant rooms, occupied rooms, number of occupants etc Answer the folowing questions fortis problem i) Design the star schema ii) Can you convert this star schema to @ snowllake schema ? I! yes, justly and draw the snowflake scheme Soin. : (@)_ Draw the Star Schema Booms_Din Tie Di Boom Year =a ‘uarar peor here Month Prypes oreo] a ‘Day of Month Fact_Table ene Folday Flag [Hotel Occupanoy| Branch Cato Room Stasi ProDaie From Daa No of Occupants aio [Hoists —| Reon Sa Di Branch Code Salus Branch Name’ Siius Dow ich actress Fig. P.1.2.3(a) : Hotel Occupancy Star Schema eT J. as HOTEL DIMENSION — a DIMENSION HOTEL IO ge PACT TABLE HOTEL NAME Frwe 0 pa TED Fore pe ore at ROOM ID [CUSTOMER [BOOKING 1D ROOM 10 [Rumer OF niciTs PRICE customen Stenson [euSTOWER I] §— oom oMENSION \ gooxtnG OMENSION NAME ROMO BOOKING ID [Ewa ROOM TPE GRECK IN OAT laooRESS ROOMFLOOR] Foor DAYS HORE ROOM WE NUnaFA SranDARO [Roow count | |COUNTRY = ~ ROOM = DESCRIPTION en ‘SWOKING ¥_N- [STREET NA -— Fig. P.1.2.3(b) : Hotel Occupancy Star Schema “TIME DMENSION HOTEL DIMENSION TE 10 HOTEL ID es FACT TABLE ee MONTH i MOTEL ADDRESS WEEK HOTEL pore Aponess) Bar FOOW'D a cusToNE AOOKING / / 800KING BIMENSION custours BOOKING 0 Birension / | ‘CHECK DATE [eusTower 1] Room oMension [NO OF DAYS ROOM COUNT oom Type DIMENSION TATION ‘STANDARD RATE] DESCRIPTION. SMOKING YN LocaTion DIMENSION LOCATION 10, Fig. P.1.2.3(¢) : Hotel Occupancy snow flake Schema Wane Data Warehouse (DWH) — ‘Maximum number of fact table record ‘Time period = 5 years x 365 days = 1825 ‘There are 150 hotels, Each stores daily sale = 400 Maximum number of fact table records: 1825 x 150x 400 = 1 billion y Product, store, tme , promotion. The in consider the following dimensions, namel mit_sales, doliars_sales and dollar_cost. Ex.1.2.4: For a Supermarket Chi schema contains a central fact tables sales facts wih three measures ur Design star schema and calculate the maximurm number cof base fact table records for the values given below ‘Time period :5 years Store : 300 stores reporting daily sales product : 40,000 products in each store{about 4000 sel each store daily) given day Promotion : sold item may be in only one promotion in store on 8 Soln. + (a) Star schema Frooson Name aaa Promotor Type ie 2 ion fc abe Deny Te Day of weak rows Aeaaten [caer pe Week Naber momore [esa De Wont Pores, [__Prsmoton Cost ion arbor] 5902" fercay” | Statbute Sear | 125 cave PROMOTION | End Date = Mead owns Marea] tay Fg Fig,P.1.2.4 : Sales Promotion Star Schema (b) Time period = 5 years x 365 days = 1825 There are 300 stores, Each stores daily sale = 4000 Promotion = 1 ‘Maximum number of fact table recor 2billion rds: 1825 x 300 x 4000 x Cg eR AN AE ee Data Warehouse 1:18 ¥ Data Mining & Business Intelligence (MU) nt academic fact database. Ex.1.25: Drawa Star Schema for Student acade Glad ‘Student_Dim [Soden Roy] Suen ress iy Soe Z Devo ood pendomie Faet Ta = [student Sox [Project _key_ fsaus.toy Ser0o key Fime-koy Prt Dem Projecto Desotei me.Dem Tipe Tine-hoy [eee Your Ouerar Men aay Fig. P.1.2.5 : Student Academic Star Schema Ex. 1. List the dimensions and facts for the Clinical Information System and Design Star and Snow Flake Schema Soln. : Dimensions 1. Patient’ == 2, Doctor 3. Procedure 4. Diagnose 5. Date ofService 6. —_Location 7. Provider Facts 1. Adjustment -«-2.Charge = 3.Age Pater Date. Servoe_Oim Pationt_key_ [Datoser_ Name = ey [seeder] Fact Tabi for Procedure Mont {Bling Herr une Doctor. Pair Dost Koy bso Locaton_im Rome Teeaton [Teceion-Fy jAftiation . ~~ [Name =a Eeca lower feeonre (Diagnos_key_ Besorpt Padus [sito lees a [ype Diagnos_Om ray Desrot Sibu [ese —] Fig. P.1.2.6 : Clinical Information Star Schema ia Mining, & Business Intelligence (MU) 1-19 Data Warehowss (OWN) Pann Orr Unie 4 tatrom La Ptr Foy Daicax ia Tae a Goro’ Fou Yat or Prcare ee Noting Herp nner orton Bock Kay Rar a ation moray Perish Spodialzation Om / Procidute_Dim [Procndure kery| Oy Prarie 457) iro ar Puvier Oen [Spocazaton toy Parton. Major [Bevcnnt erp wnt Chang | [Betariment Sabor = a | [aru iagnos_Om Digna lay ‘Bova Buhari [Bou] Fig. P.1.2.6(a) : Clinical Information Snow Flake Schema Es. 127: Drawa Siar Schema for brary Management Soin. a = a a = 5 5 eo ownage a ss a = a | er) face || ee fm fae a aa = ae ea a Ss ae ae a er pean = = — = | Fig. P.1.2.7 : Star schema for Library e128: 'A manutacturing company has a huge sales network. To control the sales, it is divided in the regione. Each tegion has multiple zones. Each zone has ditforent cites. Each sales person Is allocated diferent cities. The object a 10 rack sales figure at differant granulaty levels of region. Also fo count no, of producis sold. Create {data warehouse schoma to take into consideration of above granulanty levels for region, sales person and the ‘quartony, yearly and monthly sales, Data Warehouse (Dwi) Data Mining & Business Intelligence (MU) 1-20 Soin. a Sole Poren_Oom == fee Vem / \ jeioe: Peon a]/ Lect." Prim a Seas ane | Ero = Broth a Pot ssi Fig. P.1.2.8 : Star schema for Sales “A bank wanis wo davelop a data warehouse for effective decision - making about their loan schemes, The bank provides oan to customers for varous purposes iko House BUG lalla aes stele personal bon of, The whole county i calogorzed iio a numbec frottage ce eee 1 interest rates that change from time to time region consists of a set of states; loan is disbursed to custome iho, at any von poi of tne, te eres types of eee record an ontr or each disbursement of oan to customer Mf sepect i fe above business S2n8i- (0. Design an information package diagram. Clearly explain all aspects ofthe diagram. Jearly Wentiying the fact tables, dimension tables, ther (i) Draw a star schema for the data warehouse cl attributes and measures, Oren Soin. : @ Time Customer Branch Location Time ke Customer: key Branch key Location key Day ‘Accountmumber__| Branch Area Region Day_of. week ‘Account. type Branch home State Month Loan.type City Quarter Street Year Holiday flag ii) Star Schema for a Bank Fig. P.1.2.9 : Star schema for Bank Data Mining & Business Data Warehouse (DWH). telligence (MU) Ex. 1.2.10: Draw star schema for video Rental Soin: ect! tom customer Toney Coan Kay Tom No ental Fcte Coser Name_| Fair Tomko Costes Code Now Falease Fag a re Genre Custer Koy sate assionen ee me ae Now yor “Tne (Quarters) A as CJ Toronio as ; Vencouvet Computer | secunty Home Phono Home Phone Enoralment Enttanmont tem yp0 tm yp0 Fig, 1.3.5: Slice operation 1.3.1(D)Dice Location Chicago, ‘Sos. Nene Tons coe Fae ses 1 Joos |] a2sif 14 |] 400 a Toren ee Ot feos Tee (uate) Ti (Quer) os a ie U ‘Computer ] ‘Home. ccomputy | senuty ent crieka tum ype tom 90 Fig. 1.3.6 : Dice operation cube For example, 4nd (item= "home entertainment” or “computer’), Dice operation carry out selection with respect to two or more dimensions of the dice operation is performed on the left cube based on three dimension as Location, tem as shown in Fig. 1.3.6 where the criteria is (location= "Torrento” or the given cube and produces a sub Time and QU or"Q2") Vancouver”) and (time = Wi Data Warehouse (DWit) Data Mining & Business Intelligence (MU) 1-30 1.3.1(E) Pivot / Rotate 1s the data axis to ive another presentation ‘+ Pivot technique is used for visualization of data. This operation rota ofthe data. «Forexample Fig. 1.3.7 shows the pivot operation where the Item and location axis ina 2-D sliceare rutsted en Location or typ (cits) ar Chicago Entoriainmant Now york Computer soseate, Prot Phone vanoowvar [aos |f ezs|] 14 |] 440 Computer | Security Home Phone chicago Entortainmont oe (ort08) tom typo Fig, 1.3.7 : Pivot operation 1.3.1(F) Other OLAP operations Drill across : This technique is used when there is need to execute a query Involving more than one fact table. 11SQL facilities to drill through the bottom level of the data cube. ‘¢ Drill through :This technique uses relatio 1.3.2 Examples of OLAP ‘Consider a data warehouse for a hospital where there are three dimension E131: (a) Doctor (o) Patient (©) Time ‘And two measures count |i) charge where charge Is the foe that the doctor charges a patient for a vist. Using the above example describe the following OLAP operations 1. Slice 2. Dice 3. Rollup 4, Drilidown 5. Pivot Soin. : There are four tables, out of 3 dimension tables and 1 fact table. Dimension tables 1. Doctor (DID, name, phone, location, pin, specialisation) 2. Patient (B1D.name, phone, state, city, location, pin) 3. Time (ZiD.day, month, quarter, year) WH ratanacee Data Mining & Business Int Data Warehouse (DWH) clligence (MU) Fact Table: Fact table (DIDPID.TID, count, charge) Fig.P.1.3.1(a) Operations 1. Slice : Slice on fact table with DID = 2,, this cuts the cube at DID = 2 along the time and patient axis thus it will display a slice of cube, in which time on x and patient on y axis, Fig.P.1.3.1(0) 2. Dice : It is a sub cube of main cube. Thus it cuts the cube with more than one predicate like dice on cube with DID = 2, and DID = 01 and PID = 01 and PID = 03 and TID = 02, 03, Fig.P.1.3.1(c) 3. Roll up : It gives summary based on concept hierarchies. Assuming there exists concept hierarchy in patient table as state->city-slocation. Then roll up will summarise the charges or count in terms of city or further roll up will give charges for a particular state etc a Data Warehouse (DWH) wh 132 0 7 2 250 0 o a @, ie (M0011, 0012 and we Doro eres Of ee) cer seertor ~ |» Le, o o 00 cS , 360 200 [oo a a 7) ez) Fig. P.1.3.1(4) 4. Drill down : It 1s opposite to roll up that means if currently cube is summarized with respect to city then drill mariaation with respect to location, 06 7 0 (200 7 60. 30 Fig.P.1.3.1(¢) 5. _Plvot It rotates the cube, sub cube or rolled -up or drilled ~down cube, thus changing the view of the cube. Ex, 1.3.2; Tho collage wants 10 record the marks forthe courses completed by siudenis using the dimensions : 4) Course, b) Student, ©) Time and a measure of Aggregate marks, CCroato a Cube and desenbe following OLAP operations () Rolup ()Detldown (il) Stee (W) Dice Phot. Soln, : ‘OLAP Operations 1. _Silee For single students get the marks forall courses and all semester is slice operation as cube is generated with respect to one dimension Student Dim 2, Dice : Get the marks of single (or multiple) students for one (or more) semester but for all courses is dice ‘operation an cube Is generated with respect to two dimensions Le, Student_Dim and Time_Dim 3, Roll Up: Get the year wise marks forall students is roll up as semester wise marks are added to get year wise ‘marks and Its one level up hierarchy of dimension : Time_Dim. 4. Drill Down : Get the marks based on sectlon_td ts drill down operation as its one level down hierarchy for Course_Dim dimension. cs Oren 1 Data Mining & Business Intelligence (MU) Data Warehouse (DWH). '5, Pivot : Getting the room wise reports or course wise reports is pivot operation. In pivot one can view or rotate the information as per requirement. wpe = Fig. P.2.3.2(@) o,{ so |» | o | 7 a o a onpercae Cr cosas ‘Steng (Fr) sider ['s0 | 9 | & | 70 Stusent2 Suen user GG % Os Fig. P.1.3.1(0) suten o,f 180 o Suen ude shh ‘Soot on Nov Dee Gy yy (Course Fig. P.1.3.1(b) sudecty ‘student sue 7 swsin 4 o,[ @ | 9 | | a Cy % 0, | 50 o Te Fig. P.1.3.1(d) 1.4 _OLTP vs OLAP COLAP or the On Line Analytical supports the multidimer ess to the various views o OLAP provides fast, steady, and proficient acc ‘The complex queries can be processed. I's easy t Information in data warehouse is related to more supplier, etc. to analyze information by processing complex queries 0 Data warehouse Is generally used to analyse the informal than one dime snsional view of data. information. sn multidimensional views of data. tion where huge amount of historical datas stored, sion lke sales, market trends, buying pattern, ‘On-Line Analytical Processing Definition + Definition given by OLAP councit (www.olape ware technology that endl (OLAP) is a category of soft 8 councitor) OnLine Anata! Prose iu appt, managers and executives to gain insight ore art variety of possible views of information that ina wide varie! ‘onality of the enterprise as understood by into data through fast, consistent, interactive acce has been transformed from raw data to reflect the real dimensi the user. Application Differences EB oure LAP, ‘Transaction oriented Subject oriented High Read activity High Create/Read/Update/ Delete (CRUD) activity Few users Many users: Batch updates ~ single source Continuous updates - many sources Historical information Real-time information Tactical decision-making. Strategic planning “Uncontrolled”, generalized delivery Controlled, customized delivery RDBMS and/or MDBMS RDBMS Operational database Modelling Objectives Differences Informational database OLAP i ore i [High transaction volumes using few records at atime. Low transaction volumes using many records at a time. eee needs of online v/s scheduled batch] [processing, [Design for on-demand online processing. |Highty volatile data [Non-volatile data, [Data redundancy - BAD, [Data redundancy - GOOD, [Few levels of granularity. Multiple levels of granularity. [Simpler database designs with business-friendly| [Complex database designs used by IT personnel. constructs, yates 35 Data Warehouse (DWH)... OLTP OLAP. Single purpose model - supports Operational System. | Multiple models - support Informational Systems. Full set of Enterprise data. Subset of Enterprise data. Eliminate redundancy. Plan for redundancy. Natural or surrogate keys. Surrogate keys Validate Model against business Function Analysis. | Validate Model against reporting requirements. Technical metadata depends on __ business | Technical metadata depends on data mapping results. requirements, This moment in time is important. ‘Many moments in time are essential elements. 15 _ Data Mining as a step in KDD Q._Whatis Data Mining ? What is Data Mining ? © Data Mining is a new technology, which helps organizations to process data through algorithms to uncover ‘meaningful patterns and correlations from large databases that may otherwise be not possible with standard analysis and reporting © Data mining tools can help to understand the business better and also improve future performance through predictive analytics and make them proactive and allow knowledge driven decisions. ‘© Issues related to information extraction from large databases, data mining field brings together methods from several domains like Machine Learning, Statistics, Pattern Recognition, Databases and Visualization, '* Data mining field finds its application in market analysis and management like for e.g. customer relationship management, cross selling, market segmentation. It can also be used in risk analysis and management for forecasting, customer retention, improved underwriting, quality control, competitive analysis and credit scoring. 15.1 Definition * Data mining is processing data to identify patterns and establish relationships. Data mining is the process of analyzing large amounts of data stored in a data warehouse for useful information which makes use of artificial intelligence techniques, neural networks, and advanced statistical tools (such as luster analysis) to reveal trends, patterns and relationships, which otherwise may be undetected. Vines Data Mining & Business Intelligence (MU) 136 © Data Mining is a non-trivial process of identifying © Valid © Novel © Potentially useful, understandable patterns in data 1.5.2 KDD Process (Knowledge Discovery in Databases) 0. Explan data mining a2 «step KOO. La. Explain KOO process wih diagram Ey «The process of discovering knowledge in data and application of dats =uning methods refers to the 1. Knowledge Discovery in Databases(KDD). «It tnchodes a wide variety of application domains, which include: Artificial tntelNGaaaEE=er® Recopw Machine Learning Statistics and Data Visualization. «The main goal includes extracting knowledge from large databaoes, the pal Iau on. "=r mining algorithms to identify useful patterns according to some predefined measures and thresholds

You might also like