DM Unit 2

data mining spectrum r18 jntuh

Uploaded by

saisudhir1728

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

181 views

DM Unit 2

data mining spectrum r18 jntuh

Uploaded by

saisudhir1728

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 55

} t “Association Rule Mining: Mi ining Various 19 Frequent Patterns — Associations and Correlations ~ Mining Methods Kinds of Association Rules ~ Correlation Analysis ~ Constraint Based Association Mining. Graph Pattern Mining, SPM- LEARNING OBJECTIVES Basie Concepts of Association Rule Different Kinds of Association Rules Various Algorithms for, Mining Association Rules ‘ation Rules Mi Mining Association Rule to Correlation Analysis 9 Multilevel and Multidimensional Ass Various Types of Constraints used in Constraint Based Mining i Basic Concepts of Graph Mining ‘Methods for Mining Frequent Subgraphs SSNS SARS SS Basic Concepts of Sequential Pattern Mining (SPM). INTRODUCTION ms in association analysis is termed as an itemset. met is termed as support count, The support of an rovided dataset. Where, the confidence is a measure that defines ¥ @ in the transaction containing P. However, association rules are | ‘cal attributes. There are two threshold conditions set by experts, te relevant association rules. Association rule can be mined, by Here, the number of mset Identifies how A collection of zero or man) transactions in a particular frequently a rule is applied to the P the frequent occurrence of the items o! specified over both categorical and numeri ‘which must be satisfied in order to genera! ; searching for itenisets that occur most frequently and producing strong association rule from the resultant itemser Thus, Apriori algorithm i the most preferred algoritim for mining frequent itemsets for a single-, dimensional boolean association rule. Whereas, the compressed form of input data Is called as FP-tree. 1a mining technique used for modelling comp! biological networks, xml documents etc. Whereas, Sequenti as of extracting the ordered events or elements that often occurs Graph mining is an important dat ted structures such as chemical compounds, protein structures, Pattern Mining (SPM) refers to the proces © patterns. SPECTRUM ALL-IN-ONE JOURNAL FOR ENGINEERING STUDENTS - WROAS ‘ teee SOLUTIONS) questions a JART-A) SHORT PART-F 4 ing. Z oe out axsociation rile min a Write about a itutes. These rules comprises gp, 1 ica rand confidence of extracted q, My ct ert generate relevant aston ed snction operation performed on ny ms either categories OF NUmerica vi ale A, => 4, 18 qual t0 the congo tid also contain A, Answer t Ascociation niles are spevil measures namely, utitity and certainty, Th which must ; There are two threshokl conditions set By eMpetts — re here Aan renresent CON yy) The variable &, represe! support (OF ey a feat over bath categorical ent the ese measures represent the satisti Association niesare expresseal as Ay These conditions can either bem, 606m, © [le up, represent the interval ranges of numIC! Jevel ie 4, 4, The rule 4, 53 4, mk ou ; GE Mention the Importance of Association Rule Mining. nll Answer: The importance of association rule mining i ir jing the tomer, behavior. P ? Beare eee analysis, product clustering, catalog design andy 1 2_Teplaysan important role in customer analytics, market basket I Q3._ Define frequent sets, confidence and support: om, OR i Define support and confidence measure. Aprils tpg, (Refer On Topics! Support, Confidence) | oR Write short note on support and confidence measures. 4 (Refer Only Topics: Support, Confidence) ‘ ; Answer : Frequent Sets 7 ‘The set of items that frequently appear as.a collective group in transactional database are called frequent item sex ‘Support q ‘The support of an itemset identifies how frequently a rule is applied to the i ‘Anyone oun 4 vy Bilty fs LABLE to tace LEGAL proseet ¢ % 317-2, Association Rulo Mining NoviDee.-18(R15), 21(0) tilevel Assoctatlo —— eahllevel sroetaten Mule Minleg Multi-dimenslonal Association Rule Mining Tremploys the following techniques, Go) Vaiform support {uy Redaced support (i) Group based Support 1. Titemploys the following techniques (i) Static discritization Irdoes not have the concept of predicates, 2. | Itconsiders different predicates as dimensions. eines data at multiple levels of abstenctions and gen-| 3, | 1t mines data at multiple dimensions and generates the crates the association rules association rule pAnswer 4 ‘Apriveay-18(R13), Q4(e) Aprioti Algorithm is the most preferred algorithm for mining single dimensional boolean association rule. This algorithm, exploits the prior knowledge associated with the characteristics of itemsets, which are occurring frequently. Apriori is based on ative process, $0 as to find out itemsets present at (7+1) level by scanning n-level itemsets. The process is initiated by scanning frequent itemsets at level-1 and combining every item count present at the same level, The only items that don’t violate minimum support threshold conditions are accumulated, The resultant items are used to scan frequent items, present at the next ssive level. The process is repeated, uintil a level, which doesn’t contain any item set is reached. Apriori Algorithm, makes use of Apriori property so as to enhance the efficiency of frequent itemsets generated level-by- vel, Another benefit of Apriori property is that, it decrease the size of search space. This property states that, “Every subset of {frequent itemset must also occur frequently”. What is the need of confidence measure In association rule mining? Dec.-19(R16), Q1(d) ‘An association rule can be extracted from a given frequent itemset by using a level-wise approach of Apriori algorithm. ch level in this approach, is associated with the frequency of items belonging to the rule consequent. The first step in this approach is to extract the high confidence rule that contain only one item in the rule consequent. Later these rules are used for rating new candidate rules. * For instance, if {abde) —» {e) and {abce) —> {d} are high confidence rule then by merging the consequents of both rule we rule {ad) —> {bc} can be generated. The figure below shows the extraction of association rule from the frequent ; . ; | : Hence, if any low confidence rules are extracted from these frequent jtemset, then the complete subgraph spanned by les can-be pruned or discarded, For example, from the above figure it ean be inferred that the confidence of {acd} —> {e) is ; hence rules that contain ‘a’ in its consequents can be pruned or discarded. SPECTRUM ALL-IN-ONE JOURNAL FOR ENGINEERING STUDENTS)... Figure: Pruning of Association Rules Basod on Confidence Measure40 ion. as. Discuss about the concepthierarchy general Answer Novs00e19(R15), 210 Concept Hierarchy Generation 2 hierarchical mapping of low= tevet arabes to higher-Tovel tribute values, so that the Use can easily understand the pattems. ‘Though the gen ization: GA data results in information Toss, it inereases the consistency the data for data mining tasks. Concept hierarchies are ustally apphed before data mining because mining gen’ alized data requies fewer inputontput operations and generates efficient results when compared to mining ungeneralized data set Concept hierarchies can be applied to various attributes like age, location, job eategory. Apart from generalization, concept hierarchies also allows specialization ic ifthe generalization of data 1s at very higher level then a specialization by performing drill down operation define values of lower levels to identify the hidden data relationships. Concept Hierarchy Generation for Numerical Data It is the discritization of numerical attributes where in data is reduced by substituting low-level concept with high level concept. The different techniques for concept hierarchy generation of numerical data are, 1. Binning approach Concept hierarchies detine 2. Histogram evaluation 3. Entropy-based discretization 4. Chi-square merging 5. Cluster Analysis 6. Intuitive partitioning discretization. Concept Hierarchy Generation for Categorical Data Categorical data is basically of discrete form i.e, its attributes have finite number of independent values, These values may not have any order. The different techniques used for generating concept hierarchy for categorical data are, Specifying attribute's partial ordering directly at schema @ level : i) Specifying portion of concept hierarchy by performing, data grouping (iii) Specifying attribute set instead of attributes partial * ordering +) _ Specifying only partial ordering on group of attributes Q9. Write the pseudo-code of apriori algorithm. Answer = “The pseudo-code of Apriori algorithm is as follows, Apriori_Algorithm (DB, min_supp): M= {set of singletons} While(IM|> 0): Sean DB to compute counts of each M, [E= sets in M with counts> min_supp » JOB) Display E M = Apriori_Gen_candidate (E). WARNING: XerouPhotdbPVive 6h this book Iya’ CRIMiNAL og aks von ‘ le a aca eee eae oo b tion, invokes “Apriori_G, above func ‘The above onsible for generating candida coat is respons ; function hati ncton is thal, every item i regu pe in some particule order i must DE ce “me pseudocode oF APHION EMO follows. ; . priori Gen_Canldate (2); for every pair of itemsets A, B ing, if and B share all items, e, C=AUB/iJoin iany immediate subset of Cis not py Prune C 10. write the FP-growth algorithi 4 Answer : Now Dxe9 FP-growth, employs divide and- congue stay atabase can be divided into smaller data, ape xs can be mined. Initially, the dithy then frequent itemset compressed into a Frequent Pattern Tree (FP-tree) Tie thendivided intoa set of projected databases called, ‘con Uatabases’, Each conditional database consists of ex frequent itemset. ; ‘The pseudo-code of this algorithm is as follows, FP_Growth_Alg (T, a 1 y itT contains apathe 4 foreach combination 2! of nodes ing Generate pattern 2! Uz with min_supp | 5 among nodesit, else for every x, in count table of T x, Uz with x, support build z's conditional pattern base generate pattern 2! build 2's condition FP-tree T! ifT #0 7 FP_Growth_Al; nz!) 11. What are the drawbacks of FP growth algo! ieewer : Model Paper? ‘The algorithm i ‘ _sfte flowing im is inetetve for sparse datasets Led ©, The size ofan Fi size of dataset d ‘unavailable, 4 | -tree increases, with the inet lue to which memory space 9) Iota we ge random st set 900” has th mM sparse data set, every itemset Sg S* ihe Probability of appearing in one oF MO Gi) tf an item. ‘Sequence is present in a single Fe 'S Tepresented in the form of FP-tree. te "Ss #04 Is LABLE to fac LEGAL proceodi 4rT-2, Association Rule Mining ~ cosnpressea form of input data is called as FP. ne compressed ripet data called as FP-tte, The situe of FP-ee is basically eonstueted by sequentially NB cach Of it toa path of FP-tree, The paths will get overlapped when ave common items, Ho c f i psastoas have commen items: However, te level of compression depends on the path of overlapping, Moreover, the fequent jd directly froin the memory if the “ “ size of frequent itemset is small to be adjusted into the main memory. cc helns in avoiding multiple passes over the data available in memory. 3. Define closed frequent Itemset. (Modet Papert, 01(4) | Dec.19(R16), 21(e)) OR Give a note on closed frequent item set. rer Novsbec.-17(R13), Q1(6) An itemset which is closed and whose support count is greater than or equal to minimum support is said to be closed i itemset. Closed frequent itemsets are basically used for determining the support count of non-cloged frequent items. The algorithm extracting closed frequent itemset based on support count is as follows, 1: Let the set of closed frequent itemsets be J. 2: Let the maximum size of the closed frequent itemsets be W,. 3: Calculate FI, = (/r € J, f]=N,.} 4: For N'= N.,,-1 down to 1 do : Calculate FI, tp 6: For each x ¢ Fl,,do 7:lfx ¢ Ithen 8: support = max {x’-supportx’e Fly, ¥C2"} 9: end if 10: end for U1: end for 114, Write short notes on Graph mining. wer: « Graph mining is an important data mining technique used fr modeling complicated structures such chemical compounds, in tuctures, biological networks, XML documents, the web, work flows. There exist different graph search algorithms ‘hich have been evolved in different fields like video indexing, text retrieval chemical informatics, computer vision. Frequent ures isthe most commonly used pattern in all the diferent graph pattems present, It ean be used for depicting graph ing graph, discriminating collection of graph, performing similarity search in PPh database n today's world different graph mining techniques have been developed withthe use of frequent sub structures Which have been put into practice. : ; 5 }er PART-! \4 T TH SOLUTIONS } 3 ‘SAY QUESTIONS wi ‘PART-B_ ES: : ND Cc IATIONS Al 2.1 MINING FREQUENT PATTERNS ~ ASSOC! Q15. Define the following, () temset 2 (i) Support count (il) Association rule (iv) Support and confidence () Correlation analysis. feed : ie is itemset. If there are n-items in an j = a collection of zero or many items in association analysis is termed as an itemset : itis called mull itemset. His called as n-itemset and if there are null iterns in an itemset then it is called i) Support Count sane es : Te am of transactions in a particular itemset is termed as support count. Mathematically it is expressed 5, oP) Keron ,e7) Where, | |~ Denotes number of elements in a set. : 4, ~ Denotes the subset of items in each transaction, “ (ili) Association Rule An implication expression of the type A —» B is called as an association rule, A and B are disjoint itemset ie., AA B= (Qv) Support and Confidence * Support : mung Pont ofan itemset identities how frequently a rue is applied to the Provided dataset. Support S of an ass rule A — Bis given as, SIA B)= PAU By * Confidence The confidence is a measure that defines the frequent occurrence of the items of B in the ic ini t confidence (C) level of association rule A ~» Bis piven as, Toe ae CU = B) = PIA) = agen. () Correlation Analysis ‘The correlation rule is expressed as, « and e) plain the criteria followed for Classification. «7 Answer + Arsoctation Rule Mining (Model Papers, aa{a) | Dec. 10 8 ; auility and certainty. These measures repres attributes, is if in order to oe i Senerate relevant association rules association $3 WARNING: xerox/Pnotocopying of tha book) (assooation Rule Mining 11-2 AS y Tarn es ave ENTE A wy suastion operation performed on eettain ‘oomdiboats can either be m, Te m6 [low ain «represent HUE chtegties OF ntinietical sp represent the interval ranges of numerical {he level of support foe tle, pnt evel ie pon contains A), then estes yet is equal to the => A, means, should also contain ,. ja traneast aqsociation fute ean be mined by considering the stops for ltemsets that Occur most Frequently save search rogent omscts ate defined asset of tems that occur Sex presetined minimum support threshold items, Produce Strong Associat Mremset Assoviation rules that satisfy {a as well as minimum confidence threshold conditions severated, Such association rules are referred to as n Rule from the Resultant = peag assocition rules. The major issue while mining association rule is that, Ses vey siiclt to extract the rules which have utility and EE ny measure level higher than predefined minimum ‘Sonor. contidence threshold level. Cassfying Association Rules The following are different criteria and based on which sion rules can be classified, (Type of values (ii) Dimensions of data Levels of abstractions Extensions to association mining. Gi) 0) Type of Values Association rules can be classified into following two types, based on the type of data values handled within the rule, (2) If rule specifies a relationship (association) between cxisting and non-existing items, then such rule is referred ion Rule. | to as, Boolean Associs () a mule specifies a relationship between quantit items, then such rule is referred to as Quantitative Association Rule. Dimensions of Data Association rules can bé classified into following two 'Ypes, based on the number of dimensions invo}ved in the rule, (+) Single Dimensional Association Rule | a rule consists of items/atributes, that refer to only single dimension, then such rule is referred toas a Single Dimensional Association Rule, Multiple Dimensional Association Rule be rule consists of items/attributes, that refer 10 at _ two dimensions, then such rule is referred to as lultidimensional Association Rule. i SPECTRUM ALL-IN-ONE JOURNAL FOR ENGINEERING STUDENTS °°." 0°"! (ill) Levels of Abstractions 1 rules can be classified, into following (wo rule set, Associ Aypes, hase on Levels of abstractions involved in th () Multilevel Association Rule : Ifa tule refers to items/atributes that are present a multiple levels of abstraction, then such rule is referred 10 a5, Multilevel Association Rule. (b) Single Level Association Rule that are not present at Ifa rule refers to items/attributes, such rule is referred different levels of abstraction, then to.as, Single Level Association Rule. (iv) Extensions to Association Mining, In addition to being extended to correlation analysis, association mining can also be extended to mine the following itemsets, (a) Max Patterns (ic, Maximal Frequent Patterns) ‘The frequent pattern, whdse proper super pattern is tm > frequent is called Max Pattern. (b) | Frequent Closed Itemsets ‘An itemset is said to be a Frequent Closed Itemset, if it is closed ic,, if there are no proper superset of the respective items. Q17. Explain market basket analysis and its relevance to association rule. ‘Answer : Market Basket Analysis ‘Market basket analysis is a process of determining groups or set of items which customers are likely to purchase together. It is the process of finding relationships or association between various items which are found in a single transaction. ‘The discovery of associations and correlations takes place while mining frequent itemsets among items in large transactional or relational data sets, Since massive volume of data are collected and stored continuously, many companies and industries are showing interest in mining such patterns from their database, Market basket analysis is a typical example of mining frequent itemset. . » ‘This technique helps in finding out the buying habits of customers by finding associations among different items that are placed in the*“shopping basket” of customers. ‘The discovery of association rules is completely depends ‘on the discovery of frequent sets. Ontis 3 44 Axeocianon miles refer t6 the probability of customer hasing one prodluct when he purehases some other proxluct bt ve various data ess smainty usext ty show the relationships a Ws also calles as a tule of detecting common sage ef oms The aheoouery of assoctation rules in market basket soalesie helps the retail stores ty assist in marketing. adver me. dtexigning. store layout, inventory control, sales promotion Yes ete Far instance, HPeustomers ate buying shear how Hikes are they ates buy tea (and what hind oF tea) on the Same Tipto the store. Ry this informative strategy the retailer ean get en idea of selevtive markcting and plan their shelf space which moreases the sales Example for Asvociation Rule Mining Let us consider an example for association rule mining, wivch is hasex! on market basket analysis, Ifa manager of all Somporers branch desires to know about the buying habits of the customers then he'she performs market basket analysis. Comader a query “what are the items that will be purchased Sequently by the customers”. To answer this query, market Desket analysis is carried out on retail data associated with the customer transaction. After performing the analysis, the Fewults generated are basically used for planning the marketing or advertising strategies. One such strategy is placing the items that are frequently: purchased together in the close proximity 0.25 to increase the level of sale. For instance, if the customer Purchase a compurer then he/she will purchase a Microsoft office software, Therefore, by placing both hardware and software fiers nearer to each other helps the customer to easily detect the desired items and thereby increasing the level of sales. In addition to planning the marketing strategies, market basket snalysis even helps the retailer to plan what items must be put con sale at lower prices. For instance, if customer purchases computer and UPS together, then reducing the price of UPS ‘automatically increases the sales of UPS and also the computers. Let us represent the set of items present in the store 125 universe. This implies that every item represents boolean variable (that specifies Whether a particular item is present in the store or not) and every basket represents boolean vector of values (that are assigned to the boolean variable). This boolean vector can be analyzed based on the buying patterns so as to generate the result, which specifies what are the items that are frequently purchased or associated together. Such patterns are represented using association rules. For instance, consider the following association rule wherein it is specified that if 2 customer purchases a computer then he/she will also buy a Microsoft Office software simultaneously. Computer = Microsoft office ms {Support = 3%, Confidence = 50%) In this example, the support of 3% specifies that there srobability of a customer purchasing computer and of office software together and confidence of 50% microsoft specifies that 50% of customers have purchased computer as well as Microsoft office software together. 7) WARNING: xerournatcconing oii boos 8 CRNA cd oud ity LAS ae ca sod Answer “types of Association Rules ‘The different types of association and relational databases are, rule operation 1. Boolean association rule Quantitative association rule mensional association rule 2. 3. Single- 4, Multi-dimensional association rule 5. Multilevel association rule, Boolean Association Rule Quantitative Association Rule ‘The discretization process employed is dj » used to fulfill certain mining requirements like! the certainty level ete. This approach considers} numerical attributes as quantities and hence i dynamic multidimensional association rule. 3. ‘Single-dimensional Association Rule Itconsists of only a single dimension, which can bem multiple times. ‘ 4. Multi-dimensional Association Rule tis a rule that comprises of at least two dit S. Multilevel Association Rule When data mining is performed at multiple abstraction, then the association rules which ares 19. Explain Apriori algorithm for mining assod rules, 3 Answer : Apriori Algorithm Apriori Algorithm is the most preferred als ‘mining single dimensional boolean association 1 algorithm, exploits the priot knowledge associat characteristics of itemsets, whieh are occurring ft Abriori is based on iterative process, so as to find out Present at (n+) level by scanning n-level itemsets. TH Ps ‘initiated by scanning the frequent itemsets at level fombining every itera count present at the same level. items that don’t violate minimum support threshold © ve accumulated. The resultant items are used t0 sea0 8 items, present at the next successive level. The PIO fepeated, until a level, which doesn’t contain any # reached,| a association Rule Mining putt ri Algorithn, Makes tse OF Aprioti property soas | (il) an 7 rhe eficiensy of frequent itemsets pengtated levels a Anoiher Benelit oC Apher: property is that, i decrease earch spave. This property states that, “Every subset ast also occur frequently" : vaveol ’ so item att pen oF A Hori Algorithm ape pecnto-code of Apribri algorithm is as follows, apeoet Algorithm (DB, min_supp); Me {sctof singletons} While(M]> 0: ‘Scan DB to compute counts of each M, B= sets in M with counts 2 min_supp * [DB] Display E M= Apriori_Gen_candidate (E). ‘The above function, invokes “Apriori_Gen_Candidate” gantin tht isresponsibje for generating candidate itemset, The ‘rasyuiste ofthe function is that, every item within the itemset eee bs aranged in some particular order. ‘The pscudo-code of Apriori_Gen_candidate function is as follows, : Apriori_Gen_Candidate (E); for every pair of itemsets A, B in E; if A and B share all items, except last C=AUB Moin if any immediate subset of C is not part of E; Prune C This function intum performs the following two steps, (Soin . (i) Prune. doin Inthis step, Join operation is performed on level M, for generating n-item sets (J,). This operation assists in searching for item sets, present at jevel m,, Consider m, and m, as item sets of evel m,,.m{j} indicates that, /* item is referred in the ‘zm setThe items in the item set mare arranged in accordance tolexicographic sequence i.e. m,[1] a (BPL, SAMSUNG) ee (BPL, LG) (nPL. SAMSUNG) S (ONIDA, SAMSUNG} ero) {ONIDA, Lo) oa (ONIDA, SAMSUNG} (SAMSUNG, LG) {ONIDA, La} — (SAMSUNG, LG} 5 Sadho Ceara M, by ee _ Sep hapa Ee ema : SS __| sony, ona 4 SONY, LG) 6 {Het oxtpay ; "PL, SAMSUNG) 3 (SONY, BL, oNIDa} {P46 4 (SONY, BPL, 14) soa nat vee we onto re only Iv ifthe g . (So: candid, ‘ ‘ i bs ion NIDAI, (56 MES Satisfy the conditions hata” ® 2 Ath SAMSUNG), gL ison SEG), (ep, san trv ONI Ay, ge) “OntDa), (Son IPA} (BPL, SAMSUNG! 7 SAMSUNG). (PL, ONDA NY BPL, Lg NYG}, 4BPL, ONIDA) /ARNING: xerox Pbctocopyiny of hig om "SO, tap } (Sony. i Rim * IS A Bet i, {BPL, ONIDA:| uw! 12 association Rule Mining con property is applied to re -, Apriori prop. Pr move candidate oe Preurring frequently. Let us consider, each i fe ini: 1 ‘sox BPL, ONIDA} Zee subse ae give by (SONY, BPL}, (BPL, Ie pal ard (SONY. ONIDA}. Since, these item psatsbefong tothe frequent 2-itemset Mf, the candidate Srpaiteset8 {SONY, BPL, ONIDA) is retained in / gONYs BPL, LG} , sete subsets are given by (SONY, RPL}, (SONY, Hey and {BPL, LG}. As these two item subsets also telong tothe frequent itemset Mf,, (SONY, BPL, LG) jeretained in J. {SONY ONIDA, LG} Its 2-item subsets are given by {SONY, ONIDA) {SONY, LG} and {ONIDA, LG}. Since, the 2-item subset (ONIDA, LG} does not exist in the frequent itemset M,, the ‘itemset {SONY, ONIDA, LG} is removed from I, {BPL, ONIDA, SAMSUNG} Its 2eitem subsets are (BPL, ONIDA), (BPL,/ SAMSUNG) and {ONIDA, SAMSUNG}. Since, the item subset (ONIDA, SAMSUNG} does not exist in the frequent 2-itemset M,, Hence, {BPL, ONIDA, SAMSUNG) is removed from I, {BPL, ONIDA, LG} , Its 2-item subsets are given by {BPL, ONIDA}, {BPL, LG) and {ONIDA, LG}. Since, the 2-item subset {ONIDA, LG} does not exist in frequent 2-itemset M4, hence (BPL, ONIDA, LG} is removed from J,- {BPL, SAMSUNG, LG} Its 2-item subsets are given by {BPL, SAMSUNG), (BPL, LG} (SAMSUNG, LG): But, the 2-item subset (SAMSUNG, LG} does not exist in the frequent 2itemset, M,. Hence, {BPL, SAMSUNG, LG} is removed from /,. Therefore, J, contains {SONY, BPL, ONIDA} and (SONY, BPL, LG). Now, scan all the transactions to find the support count of 3-itemset candidates in J, ‘The Apriori property helps in removing subsets that ate not frequent, thereby eliminating the unnecessary ‘Process of obtaining their counts. Therefore, /, includes only frequent sets. ined by comparing the support snsets in J, with the minimum support counts of 3-i count i, 2. To compute a candidate set of 4itemsets Jy 2 10 ertion a M, is performed which gives {SONY, PL, ONIDA, LG}. But, this itemset will be removed its3tem subset (BPL, ONIDA, LG) does not ‘is infrequent itemset M,. Therefore, /, is an emPIY Candidate sete, Z,= and hence the Aprion algorithm teminates, ” 47 Drawbacks of Apriorl Algorithm 1 2 ‘Thealgorithm genera results in heavy lo ss many candidate itemset, which ‘on memory and CPU utilization. ‘The algorithm requires many database seans that results in causing VO bottleneck. ——_incausing VO bottleneck, Q20. Discuss tho gonerating association rules from frequent itemsets. Answer + sets identified from transactional The frequent database can be used for deriving strong association rules. These are the rules that satisfy both minimum support and minimum confidence threshold conditions. The generation of association rules for frequent itemsets can be done using the following ‘equation of confidence, cont . = Supp.count(Xox, LX => ¥) = P(YIX) ‘Supp _count(X) a0 Here, the numerator specifies number of transactions that consists of itemset x’ or ‘y” and denominator specifies number of transaction that consists of only itemset ‘x’. which is expressed by considering support count eri The equation (1) specifies the conditional probability, ion. Depending on the result of generated (from equation (1)), the association rules can be derived by considering the following step-wise procedue, . Step 1: For every frequent itemset(m), derive all the possible non-empty subsets of ‘m’. Step 2: For these non-empty subsets (s), generate the rule “s => (n= y" only if the following condition is satisfied. Supp_cor ‘Supp-count(m) 5 win conf Supp count(s) It is not necessary to verify whether the generated association rule satisfy by the minimum support threshold condi itemsets, which already 1, because these rules are derived from frequent fy the desired support condition. itemsets, which already satisfy the desired support condor. Q21. What is more efficient method for generalizing ‘Answer frequent itemset by association rule? Explain. Maylune-19(R15), Q7(0) OR Explain how association rules are generated from frequent item sets. __AprivMay-48(R13), Q6(0) ‘An association rule can be extracted from a given ig a level-wise approach of Apriori algorithm. Each level in this approach, is associated with the frequency of items belonging to the rule consequent. The first step in this approach isto extract the high confidence rue that contain only one item in the rule consequent. Later these rules tre used for generating new candidate rules. For instance, if {abde} —» {c) and {abce} —> (d} are high confidence rule then by merging the consequent ofboth rule the can \didate rule {ad} — {bc} can be generated. The figure below shows the extraction of asdciaton rule from the frequent itemset {a, «4, €} SPECTRUM @LLIH-ONE JOURNAL FO ENGINEERING STUDENTS <0 0/7"48 ased on Confidence Measure es set, then the complete subgry Pruning of Association ae fence these frequent iter ; Hence, if any ow cont rules are Gaines ‘figure it can be inferred that the confidence of, tas Figu be pruned or discarded. nodes can he pruned or discandod, For example, from priori algorithm is as follows, tow, hence rules that contain ‘a’ in its consequents can ‘The algorithm for generating associating rule using Al Step: For cach frequent n itemset f,, m2 2do ‘Step2:Calculate H, = (XWVe f) ‘Step3: Call apriori (~) gen rules (f, H,)” Steptsend for Algorithm for ap - genrules (F,, H,) ‘Stepl: Calculate the size of frequent itemset ic., 1 =f, ‘Step2: Calculate the size of rule consequent Step3:if > k= 1 then Steps: H,_,~ Apriori~ gen (H,) StepS:for each k € k_,do ‘Step6: Confidence = 0 (f)/ 6 (f,—hy,,) ‘Step7: if confidence > minconfidence then ‘Step8: output the rule (f,—h,.,) > h,., ‘Step9: else Step 10: delete h,, from H,, Step Ui: end if Step 12: end for o Explain, how can you Improve the performance - Answer : Techniques for improving the efficiency of Aprior, ais umy i 1." Hashing the Itemsets “sTollows, ss se ie tem ont g ey I-itemsets from candidate I-itemsery eps PSiO8 examine . Scanned, These 2-itemsets can then be hashed (mapped) inte itis possible ig la Where n> 1)can be rede! respective bucket must be increased, SPATE buckets of ener e Zitemsets for every a ‘A2-itemset ean be deleted from the candidate sq i, "ash table. er mapping. bucket belonging to 2-itemset is less than the support thresh i WARNING: Xeroutiictotspying of tis book 0 CRIA °F Apriori algorithm, Only ithe j itemset i / v alu, *SOt frequent ie,, when the | Anyone fa.tion Rulo Minit associat ing 20 er of Transactl mber of ining whether the iternset inthis ing, each of the pearatthon dete te ara wth hat femset should be eemoved from ae freqent ‘ot not the itemset is not frequent then the transaction “nt he database, This is because if oi the transaction ss eat itemset (he. (1- Titers se, if eliemset ofa transaction isnot frequent, cape 0) Han eta seco terse subsequent Scans may also ener tenses that arent fequet ‘ . partitioning the Data tnthis method, the frequent itemsets are mined by performing two scans on the database. This method comprises the ig OSES posse tnthis phase, every transaction of a database (D) is as ae transactions are subdivided into ‘m*non-intersecting pa ted.using the following formula, minimum-support. for every individual igned a minimum support threshold value as tions, and then the minimum support count fart salu {mminimum = support * Total number oF warsactons ina respestive partion Ifa partition is ‘scanned and if'a frequent itemset is found in that partition, then such frequent itemset is known as “local eqentitemset™ Al the local n-frequent itemsets where n = 1,2, .. ate found in just one database scan. The identified frequent jenset may or may not be frequent with respect to the other partitions in database D. An itemset is said to be: potentially frequent ffi appears as frequent itemesct in one or more partitions, All the local frequent itemsets are combined so as to form a global fanfidateitemset with respect (o database D. Passel Inthis phase, ‘aeset which are the local frequent itemsets. This is done by ealculating the actual supporto ED. After identifying the global frequent itemset, the size of the partitions and the total number of partitions require dusbase are determined in such a way that every partition can be stored in the main memory. the database is scanned for second time so as to determine the global frequent itemsets from the candidate f every individual candidate itemset din the 4 Sampling the Data In this method, a sample searching for frequent itgmset in ‘S” is done in main memory, this method is efficient, itis less accurate because here sample determining’the frequent itemset. Such scanning may sometimes result in missing out some of the global frequent itemsets. To ‘edie this problem a support threshold value which isles than the minimum support values used for determining local frequent ences with respect to S(L'). The portion of database ‘D’, other than sample *S”is used for computing the actual frequencies +5” of a database *D" is selected randomly. The size of sample is selected in such a way that, ‘due to which only one scan of transactions in “S"is required. Though ‘S" is scanned instead of scanning the entire database *D" for of every individual itemset identified in LS. The entire database * fiequent temsets present in ‘D'. However, if time so as to determine the frequent itemsets 5. Dynamically Counting the Itemsets In this method, the entire database is divid possible to add a new candidate itemsets. This met and determines whether all the subsets of these itemsets are added. When compared to apriori algorithm, dynamic ite fed into blocks. Eacl “LS does not include all the frequent that were missed out while scanning the database for the first time. smset counting algori can be seanned only once if and only if, Z* contains all the itemsets, then the database is scanned for second th of the blocks are marked with an initial point at which itis thod is dynamic because it calculates the support value of every counted itemset frequent. If they are frequent, then a new candidate itemset can be ithm requires less number of database scans. 23, Make a comparison of Aprior\ and ECLAT algorithms for frequent item set mining in transactional databases. Apply these algorithms to the following data: [ Ti0 4,_ | Chillios, Potato, Milk, Cake, Sugar, Bread 3,_| Broad, Jam, Milk, Butter, Chillios 6. | Butter, Cheese, Paneer, Curd, Milk, Biscults Az Tonion, Paneer, Chillies, Garlic, Milk 8._| Bread, Jam, Cake, Biscults, Tomato. °07 specTRUM ALL-IN-ONE JOURNAL FOR ENGINEERING STUDENTS... swtDATA MINING [INTUsiy, ee EE ee are ‘Comparison of Apriori and ECLAT Algorithms -s Eelat Algorithm : Apriori Algorithm Ti data fot elo By Os a 1.” [The data format followed by this algorithm in frequent itersets is Ve 5 mining frequent itemsets is horizontal. sues less time for generating Freggge~ 2. [It consumes more time for generating frequent | 2. { Iteo oS z Taira peso iss cha 3. [tis sutabie for only small dataset dena TE icient than Apriori algorithm: 4. Titisk imum support decreases. | 4. _ It is more efficient than Apr on Seri isles efficient when minimum 5] eseerates age numberof candidates 5 it enerats ess numberof ante es Foner] 6] The major deawback oF his algorithm iy S| The major drawback of tis algorithm isis Tange space to store datasets in memory, and iterative processing. = 1 doesnot support prtioning 7,_tcan suppor paritionin Given data, — List of Items Bread, Milk, Sugar, TeaPowder, Cheese, Tomato Onion, Tomato, Chillies, Sugar, Milk Milk, Cake, Biscuits, Cheese, Onion Chilties, Potato, Milk, Cake, Sugar, Bread Bread, Jam, Milk, Butter, Chillies Butter, Cheese, Paneer, Curd, Milk, Biscuits Onion, Paneer, Chillies, Garlic, Milk Bread, Jam, Cake, Biscuits, Tomato, For the above dataset, let the support threshold be 40% This implies that thé minimum support count Applying Apriori Algorithm fff] feo] [a 1 Inthe irststep allthe transactions are scanned to obtain the number of occurrences of Ttemset Support Count Mill z Bread 4 Chiles 4 Sugar BE Cheese 3 Tomato 3 Onion 3 Cake 3 Biscuits Sarre p——Buter 2 Paneer [}-——fetato 1 Curd 7 Gaslie T WARNING: xo rn RIMINAL 5 Ang f Ys ABLE taco tcc oncwet?ssociation Rule Mining wwe 51 Frappe count oF a iems ate compared withthe << 3 Mma support count (Le. 3). The items Jam, Hutter, j_ (Sugar, Cheese} eee oct, Tea ower, Potato, Curd, Garlic are discarded i aise thei support Coun is Tess than 3 j_(Supet, Tomato} |? {Sugar, Onion} 1 | 1 Support Count eon 1 I | 7 (Sugar, Biscuits} 0 | Bread 4 (Cheese, Tomato} ! Critlies 4 {Cheese, Onion} 1 | Sugar 3 {Cheese, Cake} [e | Cheese 3 (Cheese, Biscuits} 2 itt Tomato 3 (Tomato, Onion} 1 Onion 3 (Tomato, Cake} 1 j Cake 3 {Tomato, Biscuits} 1 Biscuits 3 {Onion, Cake} 1 3. Inthe nexi-step, the support count of candidate set of (Onion, Biscuits} a | -itemsets is determined, {Cake, Biscuits} 2 j Tttemset | Support Count Milk, Bread} {Milk, Chillies} {Milk Sugar} {Milk, Cheese} {Milk, Tomato} (Milk, Onion} (Milk, Cake} (Milk, Biscuits) {Bread, Chillies} {Bread, Sugar} {Bread, Cheese} 1 (Bread, Tomato} 2 | (Bread, Onion} 0 (Bread, Cake} 2 {Bread, Biscuits} 1 {Chillies, Sugar} 2 {Chillies, Cheese} f | (Chitles, Tomato} 1 {Chitlies, Onion} |__{Chillies, Cake) {Chillies, Biscuits) 0 ©)) SPECTRUM ALL-IN-ONE JOURNAL FOR ENGINEERING STUDENTS. 5,0, The support count of pairs are compared with the minimum support count. If the value of any pair is less than the minimum count, that item is discarded. The resultant candidate 2-itemsets would be, Cr at [Serer com (Milk, Bread} 3 {Milk, Chillies} A {Milk, Sugar} 3 {Milk, Cheese} 3 {Milk, Onion} 3: In the next step, candidate sets of 3-itemsets are determined by applying Apriori property. Ttemsets ‘Support Count (Milk, Bread, Chillies) 2 (Milk, Bread, Sugary (Milk, Bread, Cheese) (Milk, Bread, Onion} (Milk, Chillies, Sugary (Milk, Chillies, Cheese} {Milk, Chillies, Onion} le (Milk, Sugar, Onion} _{Milk, Chieese, Onion! ‘Asthe support count of no itemset is greater than or equal to minimum support count, the 3-itemsets generated above (in step-4) are considered as frequent itemsets,DATA MINING LyNI UAYDE Ry ae 52 [[emtitk, Tomar} NIT pee ee ecuAT Abou al data set to (Milk, Onion} J 1. In the frst step, transform the horizosita 7 Nestical dataset «mitk, Cake Memsets TID Set__] Support Count (Milk, Biscuits} Bread (1.4.5.8) 4 {Bread, Chillies} Milk 4.5.6.7) 7 ‘Bread, Sugar) Sugar 3 {faread, Cheese} TeapowdeF ! \ aaron Cheese 3 {Bread, Onion} Tomato read, Cake! Onion (2.3.7) 3 (Bread, Cake} Chiles 4 {Bread, Biscuits) Cake 3 {Chillies, Sugar} c Biscuits 3 {Chillies, Cheese} Potato 1 {Chillies, Tomato} 24. Jam 2 {Ghillies, Onion} | Buter 2 (Chilles, Cake} Paneer 2 (Chillies, Biscuits) Curd 1 J {Sugar, Cheese) Gartic | {Sugar, Tomato} The support count ofan itemst if simpy the length {Sugar, Onion} Sumber of DS in which ha tem rset AS (Sugar, Cake} ‘minimum support is 3, thevabove table reduces to the [Suse Cake) following ui (Sugar, Biscuits) = {Cheese, Tomato) ia Asser |_Temicts | {Cheese, Onion) @) F Milk (1.2.3.4, 56,7) {Cheese, Cake) 3 i Grad eee ae) fe implies. Chillies (2.4.5,7) (Cheese, Biscuits} 8.6) 1 ‘Sugar 12.4) {Tomato, Onion) 2) Atay Cheese 11.3.6) (Tomato, Cakey a {8} t Tomato 11.2.8) (Tomato, Biscuitsy ‘8 j_Onion {2.3.7 (Onion, Cake} 8) Cake 13,4, 8} {0 a mee nion, Biscuits) GP The candidate (k+ 1) iemsets are generated based os Apriori property. Itemsets TID Sets | eit, Bread) 14,5) | «stink, Chines) £2,4,5,7) (Milk, Sugar} (Milk, Cheese} AS the mini is obtained, s 4M support count is 3, the fllowittsable cociation Rule Mining . 53 ynit-2_AS Fropeses we iferatd ntl support ountoftheitemsets | 2, The support count ofall items are compared with the Pes es tha minum sport coun mimins support count (ie, 2). The items Clip, Pin. Wemsets TID Sets Marke, Sketgh pens, Storybook, Notebook are discarded cea, Chis) Tas based their support count is fess than 2 Fah Brel Sugar (ay Ttemset ‘Support Count Fil Bread, Cheese) a | Pen 3 ‘Mh, Bread. Onion} ; | Ruler 3 4 TMU, Chillies, Sugar} (2, 4} ‘Sharpener. 2 PMA, Chillies, Cheese} fs Eraser 2 {NHh, Challies, Onions} 2.7) neil 2 {Mith, Sugar, Cheese} om Chant 2 ‘Milk, Sugat. Onion} 2) 3. Inthe next step, the support count of candidate set of {witk, Cheese, Onion} | _(3 2-itemsets is determined. 4 Asthe support count of all the itemsets in the above Candidate Set of 2-itemsets_| Support Count table are les than 3, the table generated under step 4 is, anaes eat considered asthe frequent item sel. (Pen, Ruler} : ‘Gae. Apply aprior’ algorithm to find frequent itemsets {Pen, Sharpener} 7 from the following transactional database. Let {Pen, Eraser} ~ min_sup = 30%. {Pen, Pencil} o Items_bought (Pen, Chart} jot 1__|Pen, notebook, ruler (Ruler, Sharpener) i 2___|Pencil, eraser, sharpener (Ruler, Eraser} . : 3___|Pen, ruler, chart, sharpener (Ruler, Pencil) a Sraser ‘{Ruler, Chart} 5 __ | Ruler, pin, story book, pen {Sharpener, Eraser} 6__| Marker, chart, sketchpens {Sharpenet. Chart) Answer Dec-19(R16), 5 {Starpener, Peel) i maser raser, Pe For the given dataset, minimum support is 30%, This {Ereser, Pencil) a ., 30 (Eraser, Chart} ‘nplie that the mitimum sipport count is +55 *6=2 ee ‘The Aprori algorithm on this itemsetis performedas follows, | 4, The support count of pairs are compared with the 1, In the first step, all the transactions are scanned to minimum support count, 2. Iits values less thatthe determine the number of occurrences of each item in ‘minimum count, itis discarded. The resultant candidate the transactional database. Ttemset Support Count Candidate Set of 2-itemsets [ Support Count Pen C 3 (Pen, Ruler 3 Ruler ser, Pencil) 2 Sharpener 5. Inthenextstep, candidate sets of 3-itemsets is determined by applying apriori property. Candidate Set of 3-itemsets | Support Count (Pen, Ruler, Sharpener) 1 (Pen, Fraser, Pencil} 0 (Ruler, Sharpener, Eraser) 0 ‘Sharpener, Eraser, Pencil Sketch pens {Ser 1 : 6: Assupport count of no itemset is greater than or equal to Notebook minimum support count, the 2-itemset generated above are considered as frequent item sets. SPECTRUM ALLAN-ONE JOURNAL FOR'ENGINEERING STUDENTS ¢-. ©54 925. Explain market basket analysis and its relovance fo association rule, Explain the Aprior/ algorithm using the following transactional data assuming that the support count Is 22%. Illustrate with an example. LIST OF ITEMS 001 | Milk, dal, sugar, bread 002 | Dal, sugar, wheat, jam {003 | Milk, bread, curd, paneer 004 | Wheat, paneer, dal, sugar 005 | Milk, paneer, bread 6 _| Wheat, dal, paneer, bread Answer : Nov/Dec.18(R15), Q7 ‘Market Basket Analysis and its Relevance For answer refer Unit-ll, Q17. Problem Given that, Transactional Data, LIST OF ITEMS: 001 | Milk, dal, sugar, bread | 002 | Dal, sugar, wheat, jam Milk, bread, curd, pancer 004 | Wheat, paneer, dal, sugar Milk, paneer, bread Wheat, dal, paneer, bread ‘Support threshold = 22% = Be 6(Number of transactions) = 1.32, 22 Forthe above dataset, assume that suppor thresholds 50%, This implies that the minimum support count is 50/100 x 5 = 3, ‘The Apriori algorithm on this itemset is performed as follows, 1, In this step, all the transactions are scanned to deter ine the number of occurrences of each item in the | ‘transactional ‘database. “ [itemset[ Support count] Dal Bread Paneer Sugar Wheat Jam WARNING: xeroxPhéiScopying of this book it &| CRIMINAL sa DATA MINING UNTU-HY DER, . Sa Corto ey withthe minimum support count (.e, 3) py and curd are discarded because their suppor cout than 3. a Ttemset 3. Inthe. he support count of cndidate stay, sets is determined. Candidate set of 2-itemsets {Dal, bread} {Dal, paneer} {Dal, mitk} {Dal, sugar} {Dal, wheat) {Bread, Paneer} | {Bread, milk} {Bread, sugar} {Bread, wheat} {Paneer, milk} {Paneer, sugar) {Paneer, wheat) | (Milk, sugar} {Milk, wheat} [ {Sugar, wheat} 4. Inthis step, the support count of pairs are comparedy the minimum support count. If its value is less that minimum count, it is discarded. The resultant cand 2-itemsets would be, {Dal, wheat} (Bread, Paneer} (Bread, milk} (Dal, stigar} In this step, candidate sets of 3-itemsets is dete by applying apriori Propérty. (Dal, wheat, paneer) {Bread, Paneer, milk) {Dal, sugar, wheat}Pr 12, Assocation Rule Mining nt : u pe supportcount of temset sequal minimum supe itemsets generated above are considered sets ovo the as frequent itt nthe generated frequency itemsets, multilevel associa. ing concept hicrarchy can be applied. Using this, port levels of counts can be applied on different . These include uniform support, reduced pon ules gus. Explain in detail the candidate generation procedures. answer # Requirements ‘An effective candidate generation procedure demands she following list of requirements, It should not generate unnecessary and inappropriate candidates. These candidates are produced due to the presence of atleast one infrequent subset. Consequently such candidates are regarded as infrequent with respect toanti-monotone property of support. 0 {i It must ensure that the set of candidate itemsets are complete: That is it should guarantee that there are no such frequent itemsets that are ignored by candidate generation procediire. Inorder to ensure this, the set of candidate itemsets must be included in the set of frequent itemset ie,v,:F, 6 ‘Support Count LG, Voltas, LG, Videocon, Allwyn T : Global Frequent Htemsots Drawbacks of Partition Algorithm 1. The algorithm degrades the performanice level of the systems since, itrequires double amotint of CPU's effort ‘when compared to apriori algorithm, 2. Thealgorithm is susceptible to distortion while distribut- ig data across the partition: Q28. With an example, explain the frequent itemset generation in FP- Growth algorithm. OR Write FP-growth algorithm. Answer: FP-Growth Algorithm FP-growth, employs divide and conquer strategy so that huge database can be divided into smaller databases and then frequent itemsets can be mined. Initially, the database is ‘compressed into a Frequent Pattern Tree (FP-Iree). The tree is then divided into a set of projected databases called, ‘conditional databases’. Each conditional database consists of exactly one frequent itemset. Divide and conquer strategy is applied in the following step-wise manner, WARNING: ‘ApritMay-18(R13), 26(a) DATA MINING UNTU-HYDER, a FP-Tree |: Const Step ‘The tree construction starts withthe creat rode, which is assigned a NULL value, The gn fy, transactions are processed 10 adda new branches gn’ gt Every transaction results in the creation ofa brangh, 9 My Gf i in each transaction is accorgy, fpracessing the items in eacl rah Theat ete graphic) order, rather than the order of itemsagttt in the database. Step 2: Creating a Count Table ‘The countofeach node is maintained ina sepa, ‘The table consists of columns: itém-10, coun, noe jg node link is used to point to the occurrence of the nog. om tree. ‘Step 3: Mining the FP-Tree After the construction of FP-tre, itis mined gy frequent pattern itemsets. This mining process is per by applying a recursive algorithm. The pseudocode gr algorithm is as follows, FP_Growth_Alg (T, 2): if T contains a path g for each combination z! of nodes in g Generate pattern 2! U z with min_supp among nodes in! else for every x, in count table of T generate pattern 2! =x, U 2 with x, support build z's conditional pattern base build z's condition FP-tree T! tT #0 FP_Growth_Alg(T', 2!) Example Findig Frequent Itemsets Using FP-growth Algorithn Given transactional database is as follows, ‘Table: Transaction Database Depending on the ing of fequent frequency of items and ‘L’ ordet# luent itemsets that are sorted in order ined. The ‘> : ae ona ‘The ‘L’ order of given transaction das L=(B:5), 424), (D4), (6:3), (C12), (C9) Pe minima snp coum is 6% i 5") Only those items are to i be considered whose frequency ¥2™ Seiler than orequalto3 ie, L= ((B:5), (4 :4)D:4vE™ XerowPhotocopying of this book Isa CRIMINAL ack. an H one found guily is LABLE to face LEGAL proceedings sole elem ee ee af2. Association Rule Mining IT Constructing an FP Tree step! ‘Thetree construction starts with the ereation of root node jeassigned with a “NULL” value, : Root = NULL{ } which The items in each transaction are processed to add pew banches into the tee, Every transaction, results in the mation ofa branch. The order of processing the items in each tonsoction is according tothe °L" order rather than the order of jens appearing in the transaction database, step? The first transaction, 7, = {4, D, B) is processed seoording to the L orderie.,L= {B, A, D). Hence, the order of 7,=B,4,D.Now anew branch is ereated forthe first transaction ‘wtere, “Bis connected to the root node as a left child, “A” is faked to °B” &8 left child and “Dis linked to “A” as its left chil Aflerevery transaction, the support count ofthe respéetive sransaction item is incremented by 1 (ie, B=1,4= 1, D= 1), NULL Figure (1): FP-tree for T,... ; Step 3 z The second transaction, T,= (D, A, £,B} is considered in the L order ie., {B, A, D, E}. Since, the path B— A — D already exists, there is no need to create a new branch for these items. But, the support count value of the respective items, must be incremented by one (i.e., B= 2, A = 2, D=2). Now, the item “E” is linked as the left child of ‘D" and *E”is initialized the support count value equal to I(i. Nutt Figure (2: FP-tree for T, peat SPECTRUM ALLIN-ONE JOURNAL FOR ENGINEERING STUDENTS =. “< Step 4 it? ls 1 ‘The third transaction, 7, = (4, B, E}, is considered in the L order ie, (B, A, £). Since, the path BA already exist, there is no need to create a new braiich for these items. But, the Support count value of the respective items must be incremented by one i.c., now the support count values of B, A is (3, 3) respectively. The item ‘E’ is linked as the right child of “A” and Bis ized the support count value equals to 1. NULL Figure Bk FP forT, Step 5 “The transaction ‘7,’ with L order {B, D, E} results in creation of new branctt from B-to-D where, “D" is linked as a right child of “B°. The item ‘E” is now linked as the left child of ‘D’. The support count values of the items B, D, E are (4,1, 1) respectively NULL igure (4): FP-tree of T,60 Step 6 , emrent of the support & The fifth transaction, results because, the path Ba-D already exists Figure (5k: FP-tree of T, table consists of the columns, item-ID, count and a ng Creating a Count Table ‘The count of each node is maintained in a separate table. The NULL. link. The node link is used to point to the occurrence of node in the tree, ItemID Count B s A “4 , D 4 E 3 Figure (6): FP-tree for Entire Transaction Database juent pattern sets, ] Mining the FP-tree Erequent Patterns ° After the construction of FP-tree, it is mined to find freq {B,A,D= 1}, {B,A7 1}, (B,D: 1) {B:3) a2 {B,As 1}, (BA: 1}, (Bi 1), (BAT ] © {B24}, (A:3} (B,D:4}, (A,D:3) {Bi 4) (BLA: 4} $B: 1), {B: 1), (Bi 1), {B: 1) Table: Mining of the FP-tr the node with the least number of ocurrences# The table shows the frequent patterns generated for each node. Initially, the FP-tree or the item that is the last item in the Z order is selected, '. It occurs thrice in the FP-tree and hence it has three of av sui, therefore after removing the sufix from ke eaka kel }. To form a conditional FP-tree for “E* the jtem 4D Te sess than the assumed minimum support count, There® only one frequent pattem is generated ie, {BE ! ‘nt Patterns. The last node ‘B” is not consile™ Path from the root node, The paths are {B, 4s BF Eis conside Here, the item is. 11), (B,D, £21). Here, 21), (BA, conditional pattern base would be (B, A.D: 1}, 1B, removed because, their support-counts are (2, 2) respectively, whi the conditional FP-tree consists of only a single node i., (B:3) and hence, Similarly, the remaining items are mined, Sos 0 generate respective frequey since it is directly cqnnected to the root node whose value is NULL, WARNING: xorouPhotoonig os bok's @ CRIMINAL et one oun Uys LUABLE ong ee § proceedings —<$ ea {B,D are pene creer e Ser SCE erEae gaa eau OOIC TIE Ce CEST EIEEE COTE ETO ESOT OEETESTOI yr-2_ Association Rule Mining ar unit EP Growth Algorith Sanacts oF a as ()—Best-cave Seenario se lgorim is netTective for sparse datasets because i petting e808 In best-case scenario, the FP-tree contain only single afte ® ranch node as cacl saction in it often share same rhe size ofan FP-ce increases withthe inerease inthe Mauch omic at eich aie eee () {geotdata set due to which memory space may become ee -ase Scenario unavailable, i) We tn large random sparse data set, every itemset sequence In worst (jas the probability of appearing in one or more records, se scenario, the size of FP-tree is same as the ction contains only unique original data as each trans: fan item c ent in a single record, then it set of items. isrey Deside this, the size of FP-tce is also determined based Gp. Explain in dotall tho construction of FP this; he sie OU oe ¢ of itemset, If the ordering of itemset _ the resultant FP-tree on the ordering sche answer # Mode! Papert, a4(p) | is from lowest to highest support ite pee appear to be denser. In addition to this, an FP-tree also provide an access to individual items in a tree by maintaining a list of ‘The compressed form of input data is ealled as FP-tre ‘pesrcture of FP-tce is basically constructed by sequentially reding each transaction from the dataset and then mapping | Q30. Construct the FP-Tree from tho given Transac- cach of it to 7 rath of F tee ~ te will get orcliprel tional Database. Explain the procedure in detall een ransations have common items. However, the level of ith mi = wiresion depends on the path of overlapping, Moreover, vith calnliguen SUP PCE Se the requentitemset can be obtained directly from the memory fut size of frequent itemset is small to be adjusted into the ‘main memory. This helps in avoiding multiple passes over the PACD CLE da available in memory. A,B,C, FL, M,O B, FH, J, 0,W B,C,K,S,P PM, Construction of FP-tree ‘The root node isthe initial node in FP-tree which is denoted tysull symbol. It can be extended in the following ways. 1. The support count of all the items is defined by | Answer + Nov/Dec.-18(R15) Q6 scanning the data set. The frequent items are arranged in decreasing order whereas infrequent items are deleted. ee eee aaa 2. FP-tree is constructed by making a second pass over the _ Items data. Two nodes are created, one to the leftand one to the 100_[RA,CD,GLMP right of null and a path is formed from null to the newly 200 [A,B,C FL, M,0 ‘eated nodes, To encode the transaction, each node in this path will be given a frequency count of 1. This is 300_| B, FH, J, 0, W. the second transaction. 400_[B, C.K, SP 500_]ALF.C.E. LP. M,N 3. Nowanew set of nodes are created for node at right side ana path is created by connecting them. This is second *SMinimum support = transaction. 4. Inthe third transaction, a common frequent item is ane oes based on the descending order of shared from the frst trangaction. Later’ path is formed | fequency and L order is a follows, between a node at left and a common prefix node. As a L= (FA), (C4), (4:3), (B3), (83), (P3), (L22), result, this path overlap witha path of first transaction, aslo, Gace Get, aos awe fis i ‘count for node at left wil RENARD) ‘As minimum support is 3, only the items whose 5 The above process repeats until all the transactions gels " mapped to one of another path of FP-tree. feaeney value is greater than or equal to 3 are considered. : refore, fay tt Size oF FP-tree grows and shrink depending on the | TH+" ea L= (6:4), (CA). (A3), (3), (M3), (P3)) © Besteate scenarid The items of transactions are now processed according to the order of listZ rather than the order of items in the dataset. (Worst-case scenario. : SPECTRUM ALL-IN-ONE JOURNAL FOR ENGINEERING STUDENTS:62 ‘Therefore, the frequent itemset will be, Frequent Htemset (ordered) ROAMP. RGABM rb CBP ROAM? _ Construction Steps of FP-tree Step 1 | S00 ‘The tree construction starts withthe ereation of root node Which is assigned with a NULL value Root =NULL{} Step 2 . ' A new branch (F > C > A > M—> P) is created for the first transaction where F is connected to root node, «C’ is ‘connected to F and so on. After every transaction, the support count of the respective item is incremented by 1. > NULL Figure: FP-tree For Transaction ‘100° Step 3 Anew branch (B, M) is created for the second transaction! from node A resulting into path F > C+ A> B > M. The values of the nodes involved in this transaction is incremented byl. NULL DATA MINING LNTUOERAG| ‘step a branch is ereated for third transaction fy “Another brat © rom ‘resulting into path F > B. The values of nodes i involye cremented by 1. veg this transition is incremented by Figure: FPtree for the transaction “300" Step : A new branch is created from the root node resin into path C—> B > P and the values of the nodes involved this transition is incremented by 1. NULL Figure: FP.tree for the Transaction “400° Step 6 For the transaction *5 FACoas the nodes ofthis 00", tee is not changed as the ph M~ P already exists. Therefore, the valuesol is path is incremented by 1, NULLre | 2. association Rule Mining 63 | i Greting : spe couut oF each node is maintained ina separate ta : igus iene esetence of node nthe te ‘Me. The tale consists of HemID, cout, and Node-fink, The node ii . NULL Ttem-ID | Count | Node-link F 4 ic 4 Aon A 3 B 3 Pal M 3 P 3 st. Enumerate on the FP-Tree representations with neat diagrams. answer? spree Representation For answer refer Unit-ll, Q29. Consider the following data set, TID |Item-bought 1 {pq} tars} {pats.t} Ps) {pans} {pans} - {Pp} {pas} earls ee wn {pgs i 10 {ant} The construction of FP-tree for the above data set could be, . . NULL, Figure: FP-tree for TID-2 g © SPECTRUM @LLAN-ONE JOURNAL FOR ENGINEERING STUDENTS » i i

DBMS in 5
No ratings yet
DBMS in 5
83 pages
DBMS Ninja Notes
No ratings yet
DBMS Ninja Notes
134 pages
Project Report "E-Commerce Recommendation"
No ratings yet
Project Report "E-Commerce Recommendation"
20 pages
Association Rule: Association Rule Learning Is A Popular and Well Researched Method For Discovering
No ratings yet
Association Rule: Association Rule Learning Is A Popular and Well Researched Method For Discovering
10 pages
Apriori Algorithm in Data Mining
No ratings yet
Apriori Algorithm in Data Mining
8 pages
DM Unit 3
No ratings yet
DM Unit 3
39 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
Parallel Database Systems
No ratings yet
Parallel Database Systems
17 pages
DM Unit 5
No ratings yet
DM Unit 5
47 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
12 pages
Data Mining-Rule Based Classification
No ratings yet
Data Mining-Rule Based Classification
4 pages
Subject Code: 80359 Subject Name: Data Warehousing and Data Mining Common Subject Code (If Any)
No ratings yet
Subject Code: 80359 Subject Name: Data Warehousing and Data Mining Common Subject Code (If Any)
9 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
103 pages
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
No ratings yet
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
5 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
25 pages
DBDM Unit-3
No ratings yet
DBDM Unit-3
30 pages
CA2-Question Bank MCQ (PEC-CSBS601D)
No ratings yet
CA2-Question Bank MCQ (PEC-CSBS601D)
9 pages
CS 2032 - Data Warehousing and Data Mining PDF
No ratings yet
CS 2032 - Data Warehousing and Data Mining PDF
3 pages
AD8402 - Artificial Intelligence (Unit III)
No ratings yet
AD8402 - Artificial Intelligence (Unit III)
24 pages
BCA-404: Data Mining and Data Ware Housing
No ratings yet
BCA-404: Data Mining and Data Ware Housing
19 pages
Data Analytics Unit 4
No ratings yet
Data Analytics Unit 4
22 pages
6.1 Emerging Databases
No ratings yet
6.1 Emerging Databases
18 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Dsf-Pyt-Lab Manual
No ratings yet
Dsf-Pyt-Lab Manual
50 pages
DBMS Unit-V
0% (1)
DBMS Unit-V
48 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
ML Lab Programs (1-12)
No ratings yet
ML Lab Programs (1-12)
35 pages
DataWarehouseMining Complete Notes
No ratings yet
DataWarehouseMining Complete Notes
55 pages
Data MIning & Data Warehousing-TCS-31
No ratings yet
Data MIning & Data Warehousing-TCS-31
2 pages
Frame-Based Expert Systems
No ratings yet
Frame-Based Expert Systems
50 pages
SKP Engineering College: A Course Material On
No ratings yet
SKP Engineering College: A Course Material On
212 pages
QB Solved m3
No ratings yet
QB Solved m3
4 pages
Data Mining Query Language
0% (1)
Data Mining Query Language
7 pages
ST 2
No ratings yet
ST 2
46 pages
Practice Questions For Theory - DBMS
No ratings yet
Practice Questions For Theory - DBMS
18 pages
Compact Representation of Frequent Item Set
No ratings yet
Compact Representation of Frequent Item Set
59 pages
20 431 Internship PPT Final
No ratings yet
20 431 Internship PPT Final
19 pages
Find - S Algorithm
No ratings yet
Find - S Algorithm
17 pages
Data Mining Important Questions
No ratings yet
Data Mining Important Questions
4 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
Adsa Imp Questions For Semester
No ratings yet
Adsa Imp Questions For Semester
1 page
Question Bank DWM 2022-23 Vii Semester B.E. Cse
No ratings yet
Question Bank DWM 2022-23 Vii Semester B.E. Cse
3 pages
Exercises 695 Ar
No ratings yet
Exercises 695 Ar
1 page
DWDM 1-5 QB Sols
No ratings yet
DWDM 1-5 QB Sols
193 pages
Data Mining Notes Jntuh Compress
No ratings yet
Data Mining Notes Jntuh Compress
62 pages
QB Students DM
No ratings yet
QB Students DM
12 pages
Dbms Mod4 PDF
No ratings yet
Dbms Mod4 PDF
36 pages
Decision Tree Induction Algorithm
No ratings yet
Decision Tree Induction Algorithm
2 pages
The Eclat Algorithm Final
No ratings yet
The Eclat Algorithm Final
12 pages
CS3491 AI Question Bank
100% (1)
CS3491 AI Question Bank
4 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
KCG College of Technology Karapakkam Chennai-600 097
No ratings yet
KCG College of Technology Karapakkam Chennai-600 097
3 pages
r05321204 Data Warehousing and Data Mining
No ratings yet
r05321204 Data Warehousing and Data Mining
5 pages
2.1 Exploratory Data Analysis Using Python
No ratings yet
2.1 Exploratory Data Analysis Using Python
12 pages
DWDM Unit - 1 MCQ: by Arpit Sharma 01629802018
No ratings yet
DWDM Unit - 1 MCQ: by Arpit Sharma 01629802018
27 pages
Mining Association Rules in Large Databases
No ratings yet
Mining Association Rules in Large Databases
77 pages
DMDW_Association Analysis
No ratings yet
DMDW_Association Analysis
12 pages

DM Unit 2

Uploaded by

DM Unit 2

Uploaded by

You might also like