Data Mining: Concepts and Techniques (3rd ed.)— Chapter 5

Sep 13, 2014Download as PPT, PDF3 likes3,295 views

Data Cube Computation: Preliminary Concepts Data Cube Computation Methods Processing Advanced Queries by Exploring Data Cube Technology Multidimensional Data Analysis in Cube Space Summary by Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2013 Han, Kamber & Pei. All rights reserved.

1
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 5 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2013 Han, Kamber & Pei. All rights reserved.

09/14/14 Data Mining: Concepts and Techniques 2

3
Chapter 5: Data Cube
Technology
 Data Cube Computation: Preliminary Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring Data
Cube Technology
 Multidimensional Data Analysis in Cube Space
 Summary

4
Data Cube: A Lattice of
Cuboids
time,item
time item location supplier
time,location
time,item,location
all
item,location
time,supplier
item,supplier
time,location,supplier
time, item, location, supplierc
location,supplier
time,item,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid

5
Data Cube: A Lattice of Cuboids
all
time item location supplier
item,location
time,location,supplier
 Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land)
2. (9/15, milk, Urbana, *)
3. (*, milk, Urbana, *)
4. (*, milk, Urbana, *)
5. (*, milk, Chicago, *)
6. (*, milk, *, *)
time,item
time,item,location
time, item, location, supplier
time,location
time,supplier
item,supplier
location,supplier
time,item,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid

6
Cube Materialization:
Full Cube vs. Iceberg Cube
 Full cube vs. iceberg cube
compute cube sales iceberg as
select month, city, customer group, count(*)
from salesInfo
cube by month, city, customer group
having count(*) >= min support
iceberg
condition
 Computing only the cuboid cells whose measure satisfies the
iceberg condition
 Only a small portion of cells may be “above the water’’ in a
sparse cube
 Avoid explosive growth: A cube with 100 dimensions
 2 base cells: (a1, a2, …., a100), (b1, b2, …, b100)
 How many aggregate cells if “having count >= 1”?
 What about “having count >= 2”?

7
Iceberg Cube, Closed Cube & Cube Shell
 Is iceberg cube good enough?
 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}
 How many cells will the iceberg cube have if having count(*) >=
10? Hint: A huge but tricky number!
 Close cube:
 Closed cell c: if there exists no cell d, s.t. d is a descendant of c,
and d has the same measure value as c.
 Closed cube: a cube consisting of only closed cells
 What is the closed cube of the above base cuboid? Hint: only 3
cells
 Cube Shell
 Precompute only the cuboids involving a small # of dimensions,
e.g., 3
For (A, A, … A), how many combinations to compute?
1210 More dimension combinations will need to be computed on the fly

8
Roadmap for Efficient Computation
 General cube computation heuristics (Agarwal et al.’96)
 Computing full/iceberg cubes: 3 methodologies
 Bottom-Up: Multi-Way array aggregation (Zhao, Deshpande &
Naughton, SIGMOD’97)
 Top-down:
 BUC (Beyer & Ramarkrishnan, SIGMOD’99)
 H-cubing technique (Han, Pei, Dong & Wang: SIGMOD’01)
 Integrating Top-Down and Bottom-Up:
 Star-cubing algorithm (Xin, Han, Li & Wah: VLDB’03)
 High-dimensional OLAP: A Minimal Cubing Approach (Li, et al.
VLDB’04)
 Computing alternative kinds of cubes:
 Partial cube, closed cube, approximate cube, etc.

9
General Heuristics (Agarwal et al.
VLDB’96)
 Sorting, hashing, and grouping operations are applied to the dimension
attributes in order to reorder and cluster related tuples
 Aggregates may be computed from previously computed aggregates,
rather than from the base fact table
 Smallest-child: computing a cuboid from the smallest, previously
computed cuboid
 Cache-results: caching results of a cuboid from which other
cuboids are computed to reduce disk I/Os
 Amortize-scans: computing as many as possible cuboids at the
same time to amortize disk reads
 Share-sorts: sharing sorting costs cross multiple cuboids when
sort-based method is used
 Share-partitions: sharing the partitioning cost across multiple
cuboids when hash-based algorithms are used

1100
Chapter 5: Data Cube
Technology
 Data Cube Computation: Preliminary Concepts
 Data Cube Computation Methods
 Multi-Way Array Aggregation
 BUC
 High-Dimensional OLAP
 Processing Advanced Queries by Exploring Data
Cube Technology
 Multidimensional Data Analysis in Cube Space
 Summary

1111
MMuullttii--WWaayy AArrrraayy AAggggrreeggaattiioonn
 Array-based “bottom-up” algorithm
 Using multi-dimensional chunks
 No direct tuple comparisons
 Simultaneous aggregation on multiple
dimensions
 Intermediate aggregate values are re-used
for computing ancestor cuboids
 Cannot do Apriori pruning: No iceberg
optimization

1122
Multi-way Array Aggregation for Cube
Computation (MOLAP)
 Partition arrays into chunks (a small subcube which fits in memory).
 Compressed sparse array addressing: (chunk_id, offset)
 Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces memory
access and storage cost.
What is the best
traversing order
to do multi-way
aggregation?
c3
c2
61 62 63 64
45 46 47 48
c 0c1
b3
b2
b1
b0
13 14 15 16
A
B
29 30 31 32
9
5
1 2 3 4
a0 a1
a2 a3
C
B
60
44
28 56
24 4036 52
20

13
Multi-way Array Aggregation for Cube
Computation (3-D to 2-D)
a l l
A B
A B
C
A C B C
A B C
 The best order is
the one that
minimizes the
memory
requirement and
reduced I/Os

14
Multi-way Array Aggregation for Cube
Computation (2-D to 1-D)

1155
Multi-Way Array Aggregation for Cube
Computation (Method Summary)
 Method: the planes should be sorted and computed
according to their size in ascending order
 Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
 Limitation of the method: computing well only for a small
number of dimensions
 If there are a large number of dimensions, “top-down”
computation and iceberg cube computation methods
can be explored

1166
Bottom-Up Computation (BUC)
 BUC (Beyer & Ramakrishnan,
SIGMOD’99)
 Bottom-up cube computation
(Note: top-down in our view!)
 Divides dimensions into partitions
and facilitates iceberg pruning
 If a partition does not satisfy
min_sup, its descendants can
be pruned
 If minsup = 1 Þ compute full
CUBE!
 No simultaneous aggregation
a l l
A B C
A C B C
D
A D B D C D
A B C A B D A C D B C D
A B C D
A B
1 a l l
2 A 1 0 B 1 4 C
7 A C 1 1 B C
1 6 D
9 A D 1 3 B D 1 5 C D
4 A B C 6 A B D 8 A C D 1 2 B C D
5 A B C D
3 A B

1177
BUC: Partitioning
 Usually, entire data set can’t
fit in main memory
 Sort distinct values
 partition into blocks that fit
 Continue processing
 Optimizations
 Partitioning
 External Sorting, Hashing, Counting Sort
 Ordering dimensions to encourage pruning
 Cardinality, Skew, Correlation
 Collapsing duplicates
 Can’t do holistic aggregates anymore!

1188
High-Dimensional OLAP? — The Curse
of Dimensionality
 None of the previous cubing method can handle high
dimensionality!
 A database of 600k tuples. Each dimension has
cardinality of 100 and zipf of 2.

1199
Motivation of High-D OLAP
 X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP:
A Minimal Cubing Approach, VLDB'04
 Challenge to current cubing methods:
 The “curse of dimensionality’’ problem
 Iceberg cube and compressed cubes: only delay the
inevitable explosion
 Full materialization: still significant overhead in
accessing results on disk
 High-D OLAP is needed in applications
 Science and engineering analysis
 Bio-data analysis: thousands of genes
 Statistical surveys: hundreds of variables

2200
Fast High-D OLAP with Minimal
Cubing
 Observation: OLAP occurs only on a small subset of
dimensions at a time
 Semi-Online Computational Model
1. Partition the set of dimensions into shell
fragments
2. Compute data cubes for each shell fragment while
retaining inverted indices or value-list indices
3. Given the pre-computed fragment cubes,
dynamically compute cube cells of the high-dimensional
data cube online

2211
Properties of Proposed Method
 Partitions the data vertically
 Reduces high-dimensional cube into a set of lower
dimensional cubes
 Online re-construction of original high-dimensional
space
 Lossless reduction
 Offers tradeoffs between the amount of pre-processing
and the speed of online computation

2222
Example: Computing a 5-D Cube with
Two Shell Fragments
 Let the cube aggregation function be
count
tid A B C D E
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
5 a2 b1 c1 d1 e3
 Divide the 5-D table into 2 shell
fragments: (A, B, C) and (D, E)
 Build traditional invert index or RID list
Attribute
Value
TID List List
Size
a1 1 2 3 3
a2 4 5 2
b1 1 4 5 3
b2 2 3 2
c1 1 2 3 4 5 5
d1 1 3 4 5 4
d2 2 1
e1 1 2 2
e2 3 4 2
e3 5 1

2233
Shell Fragment Cubes: Ideas
 Generalize the 1-D inverted indices to multi-dimensional
ones in the data cube sense
 Compute all cuboids for data cubes ABC and DE while
retaining the inverted indices
 For example, shell
fragment cube ABC
contains 7 cuboids:
 A, B, C
 AB, AC, BC
 ABC
 This completes the offline
computation stage
Cell Intersection TID List List Size
Ç
a1 b1 1 2 3 1 4 5 1 1
a1 b2 1 2 3 2 3 2 3 2

a2 b1 4 5 1 4 5 4 5 2
a2 b2 4 5 2 3 0

Ç

Ç

Ç

Ä

2244
Shell Fragment Cubes: Size and Design
 Given a database of T tuples, D dimensions, and F shell
fragment size, the fragment cubes’ space requirement is:
 For F < 5, the growth is sub-linear
é
êê
æ
è
OT
ö
ø
ù
ú ú (2F-1)
D
F
ç
 Shell fragments do not have to be disjoint
 Fragment groupings can be arbitrary to allow for
maximum online performance

 Known common combinations (e.g.,<city, state>)
should be grouped together.
÷
 Shell fragment sizes can be adjusted for optimal balance
between offline and online computation

2255
ID_Measure Table
 If measures other than count are present, store in
ID_measure table separate from the shell fragments
tid count sum
1 5 70
2 3 10
3 8 20
4 5 40
5 2 30

2266
The Frag-Shells Algorithm
1. Partition set of dimension (A1,…,An) into a set of k fragments (P1,
…,Pk).
2. Scan base table once and do the following
3. insert <tid, measure> into ID_measure table.
4. for each attribute value ai of each dimension Ai
5. build inverted index entry <ai, tidlist>
6. For each fragment partition Pi
7. build local fragment cube Si by intersecting tid-lists in bottom-up
fashion.

2277
Frag-Shells
A B C D E F …
ABC
Cube
DEF
Cube
D Cuboid
EF Cuboid
DE Cuboid
Cell Tuple-ID List
d1 e1 {1, 3, 8, 9}
d1 e2 {2, 4, 6, 7}
d2 e1 {5, 10}
… …
Dimensions

2288
Online Query Computation: Query
 A query has the general form
 Each ai has 3 possible values
1. Instantiated value
2. Aggregate * function
,a2
3. Inquire ? function
 For example, returns a 2-D data
cube.

a1
,K,an
:M

3??*1:count

2299
Online Query Computation: Method
 Given the fragment cubes, process a query as
follows
1. Divide the query into fragment, same as the shell
2. Fetch the corresponding TID list for each
fragment from the fragment cube
3. Intersect the TID lists from each fragment to
construct instantiated base table
4. Compute the data cube using the base table with
any cubing algorithm

3300
Online Query Computation: Sketch
A B C D E F G H I J K L M N …
Online
Cube
Instantiated
Base Table

3311
Experiment: Size vs. Dimensionality
(50 and 100 cardinality)
 (50-C): 106 tuples, 0 skew, 50 cardinality, fragment size 3.
 (100-C): 106 tuples, 2 skew, 100 cardinality, fragment size 2.

3322
Experiments on Real World Data
 UCI Forest CoverType data set
 54 dimensions, 581K tuples
 Shell fragments of size 2 took 33 seconds and 325MB
to compute
 3-D subquery with 1 instantiate D: 85ms~1.4 sec.
 Longitudinal Study of Vocational Rehab. Data
 24 dimensions, 8818 tuples
 Shell fragments of size 3 took 0.9 seconds and 60MB
to compute
 5-D query with 0 instantiated D: 227ms~2.6 sec.

3333
Chapter 5: Data Cube
Technology
 Data Cube Computation: Preliminary Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring Data Cube
Technology
 Sampling Cube: X. Li, J. Han, Z. Yin, J.-G. Lee, Y.
Sun, “Sampling Cube: A Framework for Statistical
OLAP over Sampling Data”, SIGMOD’08
 Multidimensional Data Analysis in Cube Space
 Summary

3344
Statistical Surveys and OLAP
 Statistical survey: A popular tool to collect information
about a population based on a sample
 Ex.: TV ratings, US Census, election polls
 A common tool in politics, health, market research,
science, and many more
 An efficient way of collecting information (Data collection
is expensive)
 Many statistical tools available, to determine validity
 Confidence intervals
 Hypothesis tests
 OLAP (multidimensional analysis) on survey data
 highly desirable but can it be done well?

3355
Surveys: Sample vs. Whole
Population
Data is only a sample of population
AgeEducation High-school College Graduate
18
19
20
…

3366
Problems for Drilling in Sampling Cube
 OLAP on Survey (i.e., Sampling) Data
 Semantics of query is unchanged, but input data is changed
Age/Education High-school College Graduate
18
19
20
…
Data is only a sample of population but samples could be
small when drilling to certain multidimensional space

3377
Challenges for OLAP on Sampling
Data
Q: What is the average income of 19-year-old high-school
students?
A: Returns not only query result but also confidence interval
 Computing confidence intervals in OLAP context
 No data?
 Not exactly. No data in subspaces in cube
 Sparse data
 Causes include sampling bias and query selection bias
 Curse of dimensionality
 Survey data can be high dimensional
 Over 600 dimensions in real world example
 Impossible to fully materialize

3388
Confidence Interval
 Confidence interval at :
 x is a sample of data set; is the mean of sample
 tc
is the critical t-value, calculated by a look-up
 is the estimated standard error of the mean
 Example: $50,000 ± $3,000 with 95% confidence
 Treat points in cube cell as samples
 Compute confidence interval as traditional sample set
 Return answer in the form of confidence interval
 Indicates quality of query answer
 User selects desired confidence interval

3399
Efficient Computing Confidence Interval
Measures
 Efficient computation in all cells in data cube
 Both mean and confidence interval are algebraic
 Why confidence interval measure is algebraic?
is algebraic
where both s and l (count) are algebraic
 Thus one can calculate cells efficiently at more general
cuboids without having to start at the base cuboid each
time

Boosting Confidence by Query Expansion
4400
 From the example: The queried cell “19-year-old college
students” contains only 2 samples
 Confidence interval is large (i.e., low confidence). why?
 Small sample size
 High standard deviation with samples
 Small sample sizes can occur at relatively low
dimensional selections
 Collect more data?― expensive!
 Use data in other cells? Maybe, but have to be careful

Query Expansion: Intra-Cuboid Expansion
4411
Intra-Cuboid Expansion
Combine other cells’ data into own to “boost”
confidence
 If share semantic and cube similarity
 Use only if necessary
 Bigger sample size will decrease
confidence interval
Cell segment similarity
 Some dimensions are clear: Age
 Some are fuzzy: Occupation
 May need domain knowledge
Cell value similarity
 How to determine if two cells’ samples
come from the same population?
 Two-sample t-test (confidence-based)

4422
Intra-Cuboid Expansion
What is the average income of 19-year-old college students?
Age/Education High-school College Graduate
18
19
20
…
Expand query to include 18 and 20 year olds? Vs. expand
query to include high-school and graduate students?

Query Expansion: Inter-Cuboid Expansion
4433
 If a query dimension is
 Not correlated with cube value
 But is causing small sample size by
drilling down too much
 Remove dimension (i.e., generalize to
*) and move to a more general cuboid
 Can use two-sample t-test to
determine similarity between two cells
across cuboids
 Can also use a different method to be
shown later

4444
Chapter 5: Data Cube
Technology
 Data Cube Computation: Preliminary Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring Data
Cube Technology
 Multidimensional Data Analysis in Cube Space
 Summary

45
Data Mining in Cube Space
 Data cube greatly increases the analysis bandwidth
 Four ways to interact OLAP-styled analysis and data mining
 Using cube space to define data space for mining
 Using OLAP queries to generate features and targets for
mining, e.g., multi-feature cube
 Using data-mining models as building blocks in a multi-step
mining process, e.g., prediction cube
 Using data-cube computation techniques to speed up
repeated model construction
 Cube-space data mining may require building a
model for each candidate data space
 Sharing computation across model-construction for
different candidates may lead to efficient mining

$4466 Complex Aggregation at Multiple Granularities: Multi-Feature Cubes  Multi-feature cubes (Ross, et al. 1998): Compute complex queries involving multiple dependent aggregates at multiple granularities  Ex. Grouping by all subsets of {item, region, month}, find the maximum price in 2010 for each group, and the total sales among all maximum price tuples select item, region, month, max(price), sum(R.sales) from purchases where year = 2010 cube by item, region, month: R such that R.price = max(price)  Continuing the last example, among the max price tuples, find the min and max shelf live, and find the fraction of the total sales due to tuple that have min shelf life within the set of all max price tuples$

4477
Discovery-Driven Exploration of Data
Cubes
 Hypothesis-driven
 exploration by user, huge search space
 Discovery-driven (Sarawagi, et al.’98)
 Effective navigation of large OLAP data cubes
 pre-compute measures indicating exceptions, guide
user in the data analysis, at all levels of aggregation
 Exception: significantly different from the value
anticipated, based on a statistical model
 Visual cues such as background color are used to
reflect the degree of exception of each cell

4488
Kinds of Exceptions and their
Computation
 Parameters
 SelfExp: surprise of cell relative to other cells at same
level of aggregation
 InExp: surprise beneath the cell
 PathExp: surprise beneath cell for each drill-down
path
 Computation of exception indicator (modeling fitting and
computing SelfExp, InExp, and PathExp values) can be
overlapped with cube construction
 Exception themselves can be stored, indexed and
retrieved like precomputed aggregates

4499
Examples: Discovery-Driven Data
Cubes

5500
Chapter 5: Data Cube
Technology
 Data Cube Computation: Preliminary Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring Data
Cube Technology
 Multidimensional Data Analysis in Cube Space
 Summary

5511
Data Cube Technology: Summary
 Data Cube Computation: Preliminary Concepts
 Data Cube Computation Methods
 MultiWay Array Aggregation
 BUC
 High-Dimensional OLAP with Shell-Fragments
 Processing Advanced Queries by Exploring Data Cube Technology
 Sampling Cubes
 Ranking Cubes
 Multidimensional Data Analysis in Cube Space
 Discovery-Driven Exploration of Data Cubes
 Multi-feature Cubes
 Prediction Cubes

5522
Ref.(I) Data Cube Computation Methods
 S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses.
SIGMOD’97
 K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.. SIGMOD’99
 M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries
efficiently. VLDB’98
 J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H.
Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals.
Data Mining and Knowledge Discovery, 1:29–54, 1997.
 J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures.
SIGMOD’01
 L. V. S. Lakshmanan, J. Pei, and J. Han, Quotient Cube: How to Summarize the Semantics of a Data
Cube, VLDB'02
 X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal Cubing Approach, VLDB'04
 Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous
multidimensional aggregates. SIGMOD’97
 K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97
 D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-
Up Integration, VLDB'03
 D. Xin, J. Han, Z. Shao, H. Liu, C-Cubing: Efficient Computation of Closed Cubes by Aggregation-
Based Checking, ICDE'06

Ref. (II) Advanced Applications with Data
5533
Cubes
 D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan.
OLAP over uncertain and imprecise data. VLDB’05
 X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework for Statistical
OLAP over Sampling Data”, SIGMOD’08
 C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text Cube: Computing IR measures for
multidimensional text database analysis. ICDM’08
 D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient OLAP operations in spatial data
warehouses. SSTD’01
 N. Stefanovic, J. Han, and K. Koperski. Object-based selective materialization for
efficient implementation of spatial data cubes. IEEE Trans. Knowledge and Data
Engineering, 12:938–958, 2000.
 T. Wu, D. Xin, Q. Mei, and J. Han. Promotion analysis in multidimensional space.
VLDB’09
 T. Wu, D. Xin, and J. Han. ARCube: Supporting ranking aggregate queries in partially
materialized data cubes. SIGMOD’08
 D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k queries with multi-dimensional
selections: The ranking cube approach. VLDB’06
 J. S. Vitter, M. Wang, and B. R. Iyer. Data cube approximation and histograms via
wavelets. CIKM’98
 D. Zhang, C. Zhai, and J. Han. Topic cube: Topic modeling for OLAP on multi-dimensional
text databases. SDM’09

Ref. (III) Knowledge Discovery with Data Cubes
54
 R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases.
ICDE’97
 B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. VLDB’05
 B.-C. Chen, R. Ramakrishnan, J.W. Shavlik, and P. Tamma. Bellwether analysis:
Predicting global aggregates from local regions. VLDB’06
 Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, Multi-Dimensional Regression
Analysis of Time-Series Data Streams, VLDB'02
 G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained
Gradients in Data Cubes. VLDB’ 01
 R. Fagin, R. V. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. Multi-structural
databases. PODS’05
 J. Han. Towards on-line analytical mining in large databases. SIGMOD Record, 27:97–
107, 1998
 T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association
rules. Data Mining & Knowledge Discovery, 6:219–258, 2002.
 R. Ramakrishnan and B.-C. Chen. Exploratory mining in cube space. Data Mining and
Knowledge Discovery, 15:29–54, 2007.
 K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple
granularities. EDBT'98
 S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data
cubes. EDBT'98
 G. Sathe and S. Sarawagi. Intelligent Rollups in Multidimensional OLAP Data. VLDB'01

Unused Slides
for this Class

5577
Chapter 5: Data Cube
Technology
 Efficient Methods for Data Cube Computation
 Preliminary Concepts and General Strategies for Cube
Computation
 Multiway Array Aggregation for Full Cube Computation
 BUC: Computing Iceberg Cubes from the Apex Cuboid Downward
 Precomputing Shell Fragments for Fast High-Dimensional OLAP
 Data Cubes for Advanced Applications
 Sampling Cubes: OLAP on Sampling Data
 Ranking Cubes: Efficient Computation of Ranking Queries
 Knowledge Discovery with Data Cubes
 Discovery-Driven Exploration of Data Cubes
 Complex Aggregation at Multiple Granularity: Multi-feature Cubes
 Prediction Cubes: Data Mining in Multi-Dimensional Cube Space
 Summary

5588
HH--CCuubbiinngg:: UUssiinngg HH--TTrreeee SSttrruuccttuurree
 Bottom-up computation
 Exploring an H-tree
structure
 If the current computation
of an H-tree cannot pass
min_sup, do not proceed
further (pruning)
 No simultaneous
aggregation
a l l
A B C
A C B C
A D B D C D
A B C A B D A C D B C D
D
A B C D
A B

5599
H-tree: A Prefix Hyper-tree
Month City Cust_grp Prod Cost Price
Jan Tor Edu Printer 500 485
Jan Tor Hhd TV 800 1200
Jan Tor Edu Camera 1160 1280
Feb Mon Bus Laptop 1500 2500
Mar Van Edu HD 540 520
… … … … … …
root
edu hhd bus
Jan Mar Jan Feb
Tor Van Tor Mon
Quant-Info Q.I. Q.I. Q.I.
Sum: 1765
Cnt: 2
bins
Attr. Val. Quant-Info Side-link
Edu Sum:2285 …
Hhd …
Bus …
… …
Jan …
Feb …
… …
Tor …
Van …
Mon …
… …
Header
table

6600
Computing Cells Involving “City”
root
Edu. Hhd. Bus.
Jan. Mar. Jan. Feb.
Tor. Van. Tor. Mon.
Quant-Info Q.I. Q.I. Q.I.
Sum: 1765
Cnt: 2
bins
Attr.
Val.
Attr. Val. Quant-Info Side-link
Edu Sum:2285 …
Hhd …
Bus …
… …
Jan …
Feb …
… …
TToorr ……
Van …
Mon …
… …
Q.I. Side-link
Edu …
Hhd …
Bus …
… …
Jan …
Feb …
… …
Header
Table
HTor
From (*, *, Tor) to (*, Jan, Tor)

6611
Computing Cells Involving Month But No City
root
Edu. Hhd. Bus.
Jan. Mar. Jan. Feb.
Q.I. Q.I. Q.I.
1. Roll up quant-info
2. Compute cells involving
month but no city
Tor. Van. Tor. Mont.
Attr. Val. Quant-Info Side-link
Edu. Sum:2285 …
Hhd. …
Bus. …
… …
Jan. …
Feb. …
Mar. …
… …
Tor. …
Van. …
Mont. …
… …
Q.I.
Top-k OK mark: if Q.I. in a child passes
top-k avg threshold, so does its parents.
No binning is needed!

6622
Computing Cells Involving Only Cust_grp
root
edu hhd bus
Jan Mar Jan Feb
Q.I. Q.I. Q.I.
Check header table directly
Tor Van Tor Mon
Attr. Val. Quant-Info Side-link
Edu Sum:2285
…
Hhd …
Bus …
… …
Jan …
Feb …
Mar …
… …
Tor …
Van …
Mon …
… …
Q.I.

6633
Data Cube Computation Methods
 Multi-Way Array Aggregation
 BUC
 Star-Cubing
 High-Dimensional OLAP

A D / A B D / B C D
6644
SSttaarr--CCuubbiinngg:: AAnn IInntteeggrraattiinngg MMeetthhoodd
 D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg
Cubes by Top-Down and Bottom-Up Integration, VLDB'03
 Explore shared dimensions
 E.g., dimension A is the shared dimension of ACD and AD
 ABD/AB means cuboid ABD has shared dimensions AB
 Allows for shared computations
 e.g., cuboid AB is computed simultaneously as ABD
C / C
A C / A C B C / B C
A B C / A B C A B D / A B A C D / A B C D
D
A B C D / a l l
 Aggregate in a top-down
manner but with the bottom-up
sub-layer underneath which will
allow Apriori pruning
 Shared dimensions grow in
bottom-up fashion

6655
IIcceebbeerrgg PPrruunniinngg iinn SShhaarreedd DDiimmeennssiioonnss
 Anti-monotonic property of shared dimensions
 If the measure is anti-monotonic, and if the
aggregate value on a shared dimension does not
satisfy the iceberg condition, then all the cells
extended from this shared dimension cannot
satisfy the condition either
 Intuition: if we can compute the shared dimensions
before the actual cuboid, we can use them to do
Apriori pruning
 Problem: how to prune while still aggregate
simultaneously on multiple dimensions?

6666
CCeellll TTrreeeess
 Use a tree structure similar
to H-tree to represent
cuboids
 Collapses common prefixes
to save memory
 Keep count at node
 Traverse the tree to retrieve
a particular tuple

6677
SSttaarr AAttttrriibbuutteess aanndd SSttaarr NNooddeess
 Intuition: If a single-dimensional
aggregate on an attribute value p
does not satisfy the iceberg
condition, it is useless to distinguish
them during the iceberg computation
 E.g., b2, b3, b4, c1, c2, c4, d1, d2, d3
 Solution: Replace such attributes by
a *. Such attributes are star
attributes, and the corresponding
nodes in the cell tree are star nodes
A B C D Count
a1 b1 c1 d1 1
a1 b1 c4 d3 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3 d4 1

6688
EExxaammppllee:: SSttaarr RReedduuccttiioonn
 Suppose minsup = 2
 Perform one-dimensional
aggregation. Replace attribute
values whose count < 2 with *. And
collapse all *’s together
 Resulting table has all such
attributes replaced with the star-attribute
 With regards to the iceberg
computation, this new table is a
lossless compression of the original
table
A B C D Count
a1 b1 * * 1
a1 b1 * * 1
a1 * * * 1
a2 * c3 d4 1
a2 * c3 d4 1
A B C D Count
a1 b1 * * 2
a1 * * * 1
a2 * c3 d4 2

6699
SSttaarr TTrreeee
 Given the new compressed
table, it is possible to
construct the corresponding
cell tree—called star tree
 Keep a star table at the side
for easy lookup of star
attributes
 The star tree is a lossless
compression of the original
cell tree
A B C D Count
a1 b1 * * 2
a1 * * * 1
a2 * c3 d4 2

7700
Star-Cubing Algorithm—DFS on Lattice
Tree
a l l
A B / B C / C
A C / A C B C / B C
D / D
A D / A B D / B C D
A B C / A B C A B D / A B A C D / A B C D
A B C D
/ A
A B / A B
B C D : 5 1
b * : 3 3 b 1 : 2 6
c * : 1 4 c 3 : 2 1 1 c * : 2 7
d * : 1 5 d 4 : 2 1 2 d * : 2 8
r o o t : 5
a 1 : 3 a 2 : 2
b * : 1 b 1 : 2 b * : 2
c * : 1
d * : 1
c * : 2
d * : 2
c 3 : 2
d 4 : 2

7711
MMuullttii--WWaayy AAggggrreeggaattiioonn B C D A C D / A A B D / A B A B C / A B C
A B C D

7722
Star-Cubing Algorithm—DFS on Star-
Tree

7733
MMuullttii--WWaayy SSttaarr--TTrreeee AAggggrreeggaattiioonn
 Start depth-first search at the root of the base star tree
 At each new node in the DFS, create corresponding star
tree that are descendents of the current tree according to
the integrated traversal ordering
 E.g., in the base tree, when DFS reaches a1, the
ACD/A tree is created
 When DFS reaches b*, the ABD/AD tree is created
 The counts in the base tree are carried over to the new
trees

7744
MMuullttii--WWaayy AAggggrreeggaattiioonn ((22))
 When DFS reaches a leaf node (e.g., d*), start
backtracking
 On every backtracking branch, the count in the
corresponding trees are output, the tree is destroyed,
and the node in the base tree is destroyed
 Example
 When traversing from d* back to c*, the
a1b*c*/a1b*c* tree is output and destroyed
 When traversing from c* back to b*, the
a1b*D/a1b* tree is output and destroyed
 When at b*, jump to b1 and repeat similar process

7755
Multidimensional Data Analysis in
Cube Space
 Prediction Cubes: Data Mining in Multi-
Dimensional Cube Space
 Multi-Feature Cubes: Complex Aggregation at
Multiple Granularities
 Discovery-Driven Exploration of Data Cubes

76
Prediction Cubes
 Prediction cube: A cube structure that stores prediction
models in multidimensional data space and supports
prediction in OLAP manner
 Prediction models are used as building blocks to define
the interestingness of subsets of data, i.e., to answer
which subsets of data indicate better prediction

77
How to Determine the Prediction
Power of an Attribute?
 Ex. A customer table D:
 Two dimensions Z: Time (Month, Year ) and Location
(State, Country)
 Two features X: Gender and Salary
 One class-label attribute Y: Valued Customer
 Q: “Are there times and locations in which the value of a
customer depended greatly on the customers gender (i.e.,
Gender: predictiveness attribute V)?”
 Idea:
 Compute the difference between the model built on
that using X to predict Y and that built on using X – V
to predict Y
 If the difference is large, V must play an important role
at predicting Y

Efficient Computation of Prediction Cubes
78
 Naïve method: Fully materialize the prediction
cube, i.e., exhaustively build models and evaluate
them for each cell and for each granularity
 Better approach: Explore score function
decomposition that reduces prediction cube
computation to data cube computation

7799
Chapter 5: Data Cube
Technology
 Data Cube Computation: Preliminary Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring Data Cube
Technology
 Sampling Cube
 Ranking Cube
 Multidimensional Data Analysis in Cube Space
 Summary

8800
Processing Advanced Queries by
Exploring Data Cube Technology
 Sampling Cube
 X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling
Cube: A Framework for Statistical OLAP over
Sampling Data”, SIGMOD’08
 Ranking Cube
 D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k
queries with multi-dimensional selections: The ranking
cube approach. VLDB’06
 Other advanced cubes for processing data and queries
 Stream cube, spatial cube, multimedia cube, text
cube, RFID cube, etc. — to be studied in volume 2

Ranking Cubes – Efficient Computation of
81
Ranking queries
 Data cube helps not only OLAP but also ranked search
 (top-k) ranking query: only returns the best k results
according to a user-specified preference, consisting of (1)
a selection condition and (2) a ranking function
 Ex.: Search for apartments with expected price 1000 and
expected square feet 800
 Select top 1 from Apartment
 where City = “LA” and Num_Bedroom = 2
 order by [price – 1000]^2 + [sq feet - 800]^2 asc
 Efficiency question: Can we only search what we need?
 Build a ranking cube on both selection dimensions and
ranking dimensions

82
Ranking Cube: Partition Data on Both
Selection and Ranking Dimensions
One single data
partition as the template
Slice the data partition
by selection conditions
Sliced Partition
for city=“LA”
Sliced Partition
for BR=2
Partition for
all data

83
Materialize Ranking-Cube
tid City BR Price Sq feet Block ID
t1 SEA 1 500 600 5
t2 CLE 2 700 800 5
t3 SEA 1 800 900 2
t4 CLE 3 1000 1000 6
t5 LA 1 1100 200 15
t6 LA 2 1200 500 11
t7 LA 2 1200 560 11
t8 CLE 3 1350 1120 4
Step 1: Partition Data on
Ranking Dimensions
Step 2: Group data by
Selection Dimensions
City
City & BR
BR
1 2 3 4
SEA
LA
CLE
1 2 3 4
5 6 7 8
9 10 11
12
13 14 15
16
Step 3: Compute Measures for each group
For the cell (LA)
Block-level: {11, 15}
Data-level: {11: t6, t7; 15: t5}

$84 Search with Ranking-Cube: Simultaneously Push Selection and Ranking Select top 1 from Apartment where city = “LA” order by [price – 1000]^2 + [sq feet - 800]^2 asc 800 1000 Without ranking-cube: start search from here Given the bin boundaries, locate the block with top score With ranking-cube: start search from here 11 15 Measure for LA: {11, 15} {11: t6,t7; 15:t5} Bin boundary for price [500, 600, 800, 1100,1350] Bin boundary for sq feet [200, 400, 600, 800, 1120]$

$85 Processing Ranking Query: Execution Trace Select top 1 from Apartment where city = “LA” order by [price – 1000]^2 + [sq feet - 800]^2 asc 800 11 1000 With ranking-cube: start search from here 15 Measure for LA: {11, 15} {11: t6,t7; 15:t5} f=[price-1000]^2 + [sq feet – 800]^2 Bin boundary for price [500, 600, 800, 1100,1350] Bin boundary for sq feet [200, 400, 600, 800, 1120] Execution Trace: 1. Retrieve High-level measure for LA {11, 15} 2. Estimate lower bound score for block 11, 15 f(block 11) = 40,000, f(block 15) = 160,000 3. Retrieve block 11 4. Retrieve low-level measure for block 11 5. f(t6) = 130,000, f(t7) = 97,600 Output t7, done!$

86
Ranking Cube: Methodology and
Extension
 Ranking cube methodology
 Push selection and ranking simultaneously
 It works for many sophisticated ranking functions
 How to support high-dimensional data?
 Materialize only those atomic cuboids that contain
single selection dimensions
 Uses the idea similar to high-dimensional OLAP
 Achieves low space overhead and high
performance in answering ranking queries with a
high number of selection dimensions

Recommended

Data mining :Concepts and Techniques Chapter 2, data

Data mining :Concepts and Techniques Chapter 2, data

Data mining :Concepts and Techniques Chapter 2, dataSalah Amean

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007

The document describes Chapter 6 of the book "Data Mining: Concepts and Techniques" which covers the topics of classification and prediction. It defines classification and prediction and discusses key issues in classification such as data preparation, evaluating methods, and decision tree induction. Decision tree induction creates a tree model by recursively splitting the training data on attributes and their values to make predictions. The chapter also covers other classification methods like Bayesian classification, rule-based classification, and support vector machines. It describes the process of model construction from training data and then using the model to classify new, unlabeled data.

Data Mining: Concepts and Techniques (3rd ed.)— Chapter _04 olap

Data Mining: Concepts and Techniques (3rd ed.)— Chapter _04 olap

Data Mining: Concepts and Techniques (3rd ed.)— Chapter _04 olapSalah Amean

Chapter 1. Introduction

Chapter 1. Introduction

Chapter 1. Introductionbutest

The document provides an overview of the data mining concepts and techniques course offered at the University of Illinois at Urbana-Champaign. It discusses the motivation for data mining due to abundant data collection and the need for knowledge discovery. It also describes common data mining functionalities like classification, clustering, association rule mining and the most popular algorithms used.

Data cubesMohammed

This document discusses data cubes, which are multidimensional data structures used in online analytical processing (OLAP) to enable fast retrieval of data organized by dimensions and measures. Data cubes can have 2-3 dimensions or more and contain measures like costs or units. Key concepts are slicing to select a 2D page, dicing to define a subcube, and rotating to change dimensional orientation. Data cubes represent categories through dimensions and levels, and store facts as measures in cells. They can be pre-computed fully, not at all, or partially to balance query speed and memory usage. Totals can also be stored to improve performance of aggregate queries.

4.2 spatial data mining

4.2 spatial data mining

4.2 spatial data miningKrish_ver2

This document discusses spatial data mining and its applications. Spatial data mining involves extracting knowledge and relationships from large spatial databases. It can be used for applications like GIS, remote sensing, medical imaging, and more. Some challenges include the complexity of spatial data types and large data volumes. The document also covers topics like spatial data warehouses, dimensions and measures in spatial analysis, spatial association rule mining, and applications in fields such as earth science, crime mapping, and commerce.

Data Mining: Concepts and Techniques — Chapter 2 —

Data Mining: Concepts and Techniques — Chapter 2 —

Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean

lazy learners and other classication methods

lazy learners and other classication methods

lazy learners and other classication methodsrajshreemuthiah

Lazy learning is a machine learning method where generalization of training data is delayed until a query is made, unlike eager learning which generalizes before queries. K-nearest neighbors and case-based reasoning are examples of lazy learners, which store training data and classify new data based on similarity. Case-based reasoning specifically stores prior problem solutions to solve new problems by combining similar past case solutions.

DDBMS Paper with Solution

DDBMS Paper with Solution

DDBMS Paper with SolutionGyanmanjari Institute Of Technology

The document summarizes some of the key potential problems with distributed database management systems (DDBMS), including: 1) Distributed database design issues around how to partition and replicate the database across sites. 2) Distributed directory management challenges in maintaining consistency across global or local directories. 3) Distributed query processing difficulties in determining optimal strategies for executing queries across network locations. 4) Distributed concurrency control complications in synchronizing access to multiple copies of the database across sites while maintaining consistency.

Multidimentional data model

Multidimentional data model

Multidimentional data modeljagdish_93

The document presents on multidimensional data models. It discusses the key components of multidimensional data models including dimensions and facts. It describes different types of multidimensional data models such as data cube model, star schema model, snowflake schema model, and fact constellations. The star schema model and snowflake schema model are explained in more detail through examples and their benefits are highlighted.

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean

3. mining frequent patterns

3. mining frequent patterns

3. mining frequent patternsAzad public school

The document discusses frequent pattern mining and the Apriori algorithm. It introduces frequent patterns as frequently occurring sets of items in transaction data. The Apriori algorithm is described as a seminal method for mining frequent itemsets via multiple passes over the data, generating candidate itemsets and pruning those that are not frequent. Challenges with Apriori include multiple database scans and large number of candidate sets generated.

Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing

Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing

Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean

Data cube computation

Data cube computation

Data cube computationRashmi Sheikh

Data cube computation involves precomputing aggregations to enable fast query performance. There are different materialization strategies like full cubes, iceberg cubes, and shell cubes. Full cubes precompute all aggregations but require significant storage, while iceberg cubes only store aggregations that meet a threshold. Computation strategies include sorting and grouping to aggregate similar values, caching intermediate results, and aggregating from smallest child cuboids first. The Apriori pruning method can efficiently compute iceberg cubes by avoiding computing descendants of cells that do not meet the minimum support threshold.

Coda file system

Coda file system

Coda file systemSneh Pahilwani

Parallel Database

Parallel Database

Parallel DatabaseVESIT/University of Mumbai

The document discusses parallel databases and their architectures. It introduces parallel databases as systems that seek to improve performance through parallelizing operations like loading data, building indexes, and evaluating queries using multiple CPUs and disks. It describes three main architectures for parallel databases: shared memory, shared disk, and shared nothing. The shared nothing architecture provides linear scale-up and speed-up but is more difficult to program. The document also discusses measuring performance improvements from parallelization through speed-up and scale-up.

EDA-Unit 1.pdfNirmalavenkatachalam

This document provides an overview of exploratory data analysis (EDA). It discusses the key stages of EDA including data requirements, collection, processing, cleaning, exploration, modeling, products, and communication. The stages involve examining available data to discover patterns and relationships. EDA is the first step in data mining projects to understand data without assumptions. The document also outlines the problem definition, data preparation, analysis, and result development and representation steps of EDA. Finally, it discusses different types of data like numeric, categorical, and the importance of understanding data types for analysis.

Clustering for Stream and Parallelism (DATA ANALYTICS)

Clustering for Stream and Parallelism (DATA ANALYTICS)

Clustering for Stream and Parallelism (DATA ANALYTICS)DheerajPachauri

Distributed DBMS - Unit 5 - Semantic Data Control

Distributed DBMS - Unit 5 - Semantic Data Control

Distributed DBMS - Unit 5 - Semantic Data ControlGyanmanjari Institute Of Technology

k medoid clustering.pptx

k medoid clustering.pptx

k medoid clustering.pptxRoshan86572

K-medoids is a clustering algorithm that groups similar data points into K clusters by selecting representative data points called medoids. It iteratively assigns data points to the closest medoid and updates the medoids to minimize distances between points and clusters. K-medoids is more robust to outliers than K-means and can handle non-Euclidean distances, making it useful for clustering categorical or nonlinear data. It has various applications but is more computationally expensive than K-means.

02 dataphakhwan22

This chapter discusses getting to know data through analysis and visualization. It covers data objects and attribute types, statistical descriptions of data including measures of central tendency and dispersion, visualization techniques like histograms and scatter plots, and measuring similarity between data objects. The goal is to better understand data characteristics before applying more advanced mining techniques.

2.2 decision tree

2.2 decision tree

2.2 decision treeKrish_ver2

This document discusses decision tree induction and attribute selection measures. It describes common measures like information gain, gain ratio, and Gini index that are used to select the best splitting attribute at each node in decision tree construction. It provides examples to illustrate information gain calculation for both discrete and continuous attributes. The document also discusses techniques for handling large datasets like SLIQ and SPRINT that build decision trees in a scalable manner by maintaining attribute value lists.

Data warehouse architecture

Data warehouse architecture

Data warehouse architecturepcherukumalla

DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA

DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA

DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala

This document discusses data warehouses, including what they are, how they are implemented, and how they can be further developed. It provides definitions of key concepts like data warehouses, data cubes, and OLAP. It also describes techniques for efficient data cube computation, indexing of OLAP data, and processing of OLAP queries. Finally, it discusses different approaches to data warehouse implementation and development of data cube technology.

Lecture #01Konpal Darakshan

This document provides an overview of the introductory lecture to the BS in Data Science program. It discusses key topics that were covered in the lecture, including recommended books and chapters to be covered. It provides a brief introduction to key terminologies in data science, such as different data types, scales of measurement, and basic concepts. It also discusses the current landscape of data science, including the difference between roles of data scientists in academia versus industry.

Mining Frequent Patterns, Association and Correlations

Mining Frequent Patterns, Association and Correlations

Mining Frequent Patterns, Association and CorrelationsJustin Cletus

This document summarizes Chapter 6 of the book "Data Mining: Concepts and Techniques" which discusses frequent pattern mining. It introduces basic concepts like frequent itemsets and association rules. It then describes several scalable algorithms for mining frequent itemsets, including Apriori, FP-Growth, and ECLAT. It also discusses optimizations to Apriori like partitioning the database and techniques to reduce the number of candidates and database scans.

Data warehouse design

Data warehouse design

Data warehouse designines beltaief

The document discusses three papers related to data warehouse design. Paper 1 presents the X-META methodology, which addresses developing a first data warehouse project and integrates metadata creation and management into the development process. It proposes starting with a pilot project and defines three iteration types. Paper 2 proposes extending the ER conceptual data model to allow modeling of multi-dimensional aggregated entities. It includes entity types for basic dimensions, simple aggregations, and multi-dimensional aggregated entities. Paper 3 presents a comprehensive UML-based method for designing all phases of a data warehouse, from source data to implementation. It defines four schemas - operational, conceptual, storage, and business - and the mappings between them. It also provides steps

Data Mining & Data Warehousing Lecture Notes

Data Mining & Data Warehousing Lecture Notes

Data Mining & Data Warehousing Lecture NotesFellowBuddy.com

FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams. We connect Students who have an understanding of course material with Students who need help. Benefits:- # Students can catch up on notes they missed because of an absence. # Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand # Students can earn better grades, save time and study effectively Our Vision & Mission – Simplifying Students Life Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.” Like Us - https://www.facebook.com/FellowBuddycom

Data Mining: Concepts and techniques: Chapter 13 trend

Data Mining: Concepts and techniques: Chapter 13 trend

Data Mining: Concepts and techniques: Chapter 13 trendSalah Amean

Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...

Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...

Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean

Ad

More Related Content

What's hot (20)

DDBMS Paper with Solution

DDBMS Paper with Solution

DDBMS Paper with SolutionGyanmanjari Institute Of Technology

The document summarizes some of the key potential problems with distributed database management systems (DDBMS), including: 1) Distributed database design issues around how to partition and replicate the database across sites. 2) Distributed directory management challenges in maintaining consistency across global or local directories. 3) Distributed query processing difficulties in determining optimal strategies for executing queries across network locations. 4) Distributed concurrency control complications in synchronizing access to multiple copies of the database across sites while maintaining consistency.

Multidimentional data model

Multidimentional data model

Multidimentional data modeljagdish_93

The document presents on multidimensional data models. It discusses the key components of multidimensional data models including dimensions and facts. It describes different types of multidimensional data models such as data cube model, star schema model, snowflake schema model, and fact constellations. The star schema model and snowflake schema model are explained in more detail through examples and their benefits are highlighted.

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean

3. mining frequent patterns

3. mining frequent patterns

3. mining frequent patternsAzad public school

The document discusses frequent pattern mining and the Apriori algorithm. It introduces frequent patterns as frequently occurring sets of items in transaction data. The Apriori algorithm is described as a seminal method for mining frequent itemsets via multiple passes over the data, generating candidate itemsets and pruning those that are not frequent. Challenges with Apriori include multiple database scans and large number of candidate sets generated.

Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing

Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing

Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean

Data cube computation

Data cube computation

Data cube computationRashmi Sheikh

Data cube computation involves precomputing aggregations to enable fast query performance. There are different materialization strategies like full cubes, iceberg cubes, and shell cubes. Full cubes precompute all aggregations but require significant storage, while iceberg cubes only store aggregations that meet a threshold. Computation strategies include sorting and grouping to aggregate similar values, caching intermediate results, and aggregating from smallest child cuboids first. The Apriori pruning method can efficiently compute iceberg cubes by avoiding computing descendants of cells that do not meet the minimum support threshold.

Coda file system

Coda file system

Coda file systemSneh Pahilwani

Parallel Database

Parallel Database

Parallel DatabaseVESIT/University of Mumbai

The document discusses parallel databases and their architectures. It introduces parallel databases as systems that seek to improve performance through parallelizing operations like loading data, building indexes, and evaluating queries using multiple CPUs and disks. It describes three main architectures for parallel databases: shared memory, shared disk, and shared nothing. The shared nothing architecture provides linear scale-up and speed-up but is more difficult to program. The document also discusses measuring performance improvements from parallelization through speed-up and scale-up.

EDA-Unit 1.pdfNirmalavenkatachalam

This document provides an overview of exploratory data analysis (EDA). It discusses the key stages of EDA including data requirements, collection, processing, cleaning, exploration, modeling, products, and communication. The stages involve examining available data to discover patterns and relationships. EDA is the first step in data mining projects to understand data without assumptions. The document also outlines the problem definition, data preparation, analysis, and result development and representation steps of EDA. Finally, it discusses different types of data like numeric, categorical, and the importance of understanding data types for analysis.

Clustering for Stream and Parallelism (DATA ANALYTICS)

Clustering for Stream and Parallelism (DATA ANALYTICS)

Clustering for Stream and Parallelism (DATA ANALYTICS)DheerajPachauri

Distributed DBMS - Unit 5 - Semantic Data Control

Distributed DBMS - Unit 5 - Semantic Data Control

Distributed DBMS - Unit 5 - Semantic Data ControlGyanmanjari Institute Of Technology

k medoid clustering.pptx

k medoid clustering.pptx

k medoid clustering.pptxRoshan86572

K-medoids is a clustering algorithm that groups similar data points into K clusters by selecting representative data points called medoids. It iteratively assigns data points to the closest medoid and updates the medoids to minimize distances between points and clusters. K-medoids is more robust to outliers than K-means and can handle non-Euclidean distances, making it useful for clustering categorical or nonlinear data. It has various applications but is more computationally expensive than K-means.

02 dataphakhwan22

This chapter discusses getting to know data through analysis and visualization. It covers data objects and attribute types, statistical descriptions of data including measures of central tendency and dispersion, visualization techniques like histograms and scatter plots, and measuring similarity between data objects. The goal is to better understand data characteristics before applying more advanced mining techniques.

2.2 decision tree

2.2 decision tree

2.2 decision treeKrish_ver2

This document discusses decision tree induction and attribute selection measures. It describes common measures like information gain, gain ratio, and Gini index that are used to select the best splitting attribute at each node in decision tree construction. It provides examples to illustrate information gain calculation for both discrete and continuous attributes. The document also discusses techniques for handling large datasets like SLIQ and SPRINT that build decision trees in a scalable manner by maintaining attribute value lists.

Data warehouse architecture

Data warehouse architecture

Data warehouse architecturepcherukumalla

DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA

DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA

DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala

This document discusses data warehouses, including what they are, how they are implemented, and how they can be further developed. It provides definitions of key concepts like data warehouses, data cubes, and OLAP. It also describes techniques for efficient data cube computation, indexing of OLAP data, and processing of OLAP queries. Finally, it discusses different approaches to data warehouse implementation and development of data cube technology.

Lecture #01Konpal Darakshan

This document provides an overview of the introductory lecture to the BS in Data Science program. It discusses key topics that were covered in the lecture, including recommended books and chapters to be covered. It provides a brief introduction to key terminologies in data science, such as different data types, scales of measurement, and basic concepts. It also discusses the current landscape of data science, including the difference between roles of data scientists in academia versus industry.

Mining Frequent Patterns, Association and Correlations

Mining Frequent Patterns, Association and Correlations

Mining Frequent Patterns, Association and CorrelationsJustin Cletus

This document summarizes Chapter 6 of the book "Data Mining: Concepts and Techniques" which discusses frequent pattern mining. It introduces basic concepts like frequent itemsets and association rules. It then describes several scalable algorithms for mining frequent itemsets, including Apriori, FP-Growth, and ECLAT. It also discusses optimizations to Apriori like partitioning the database and techniques to reduce the number of candidates and database scans.

Data warehouse design

Data warehouse design

Data warehouse designines beltaief

The document discusses three papers related to data warehouse design. Paper 1 presents the X-META methodology, which addresses developing a first data warehouse project and integrates metadata creation and management into the development process. It proposes starting with a pilot project and defines three iteration types. Paper 2 proposes extending the ER conceptual data model to allow modeling of multi-dimensional aggregated entities. It includes entity types for basic dimensions, simple aggregations, and multi-dimensional aggregated entities. Paper 3 presents a comprehensive UML-based method for designing all phases of a data warehouse, from source data to implementation. It defines four schemas - operational, conceptual, storage, and business - and the mappings between them. It also provides steps

Data Mining & Data Warehousing Lecture Notes

Data Mining & Data Warehousing Lecture Notes

Data Mining & Data Warehousing Lecture NotesFellowBuddy.com

FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams. We connect Students who have an understanding of course material with Students who need help. Benefits:- # Students can catch up on notes they missed because of an absence. # Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand # Students can earn better grades, save time and study effectively Our Vision & Mission – Simplifying Students Life Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.” Like Us - https://www.facebook.com/FellowBuddycom

DDBMS Paper with Solution

DDBMS Paper with Solution

DDBMS Paper with SolutionGyanmanjari Institute Of Technology

Multidimentional data model

Multidimentional data model

Multidimentional data modeljagdish_93

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean

3. mining frequent patterns

3. mining frequent patterns

3. mining frequent patternsAzad public school

Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing

Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing

Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean

Data cube computation

Data cube computation

Data cube computationRashmi Sheikh

Coda file system

Coda file system

Coda file systemSneh Pahilwani

Parallel Database

Parallel Database

Parallel DatabaseVESIT/University of Mumbai

EDA-Unit 1.pdfNirmalavenkatachalam

Clustering for Stream and Parallelism (DATA ANALYTICS)

Clustering for Stream and Parallelism (DATA ANALYTICS)

Clustering for Stream and Parallelism (DATA ANALYTICS)DheerajPachauri

Distributed DBMS - Unit 5 - Semantic Data Control

Distributed DBMS - Unit 5 - Semantic Data Control

Distributed DBMS - Unit 5 - Semantic Data ControlGyanmanjari Institute Of Technology

k medoid clustering.pptx

k medoid clustering.pptx

k medoid clustering.pptxRoshan86572

02 dataphakhwan22

2.2 decision tree

2.2 decision tree

2.2 decision treeKrish_ver2

Data warehouse architecture

Data warehouse architecture

Data warehouse architecturepcherukumalla

DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA

DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA

DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala

Lecture #01Konpal Darakshan

Mining Frequent Patterns, Association and Correlations

Mining Frequent Patterns, Association and Correlations

Mining Frequent Patterns, Association and CorrelationsJustin Cletus

Data warehouse design

Data warehouse design

Data warehouse designines beltaief

Data Mining & Data Warehousing Lecture Notes

Data Mining & Data Warehousing Lecture Notes

Data Mining & Data Warehousing Lecture NotesFellowBuddy.com

Viewers also liked (20)

Data Mining: Concepts and techniques: Chapter 13 trend

Data Mining: Concepts and techniques: Chapter 13 trend

Data Mining: Concepts and techniques: Chapter 13 trendSalah Amean

Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...

Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...

Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean

Datacubeman2sandsce17

The document discusses data warehouse implementation and online analytical processing (OLAP). It describes the compute cube operator, which computes aggregates for all subsets of specified dimensions. It also covers efficient cube computation techniques like chunking and materialized views. Better access methods for OLAP like bitmap indexing and join indexing are also summarized. The document emphasizes that efficient query processing requires determining which operations to perform on available cuboids and selecting the optimal cuboid based on factors like storage size and indexing.

Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...

Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...

Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean

the slides contain: Pattern Mining: A Road Map Pattern Mining in Multi-Level, Multi-Dimensional Space Constraint-Based Frequent Pattern Mining Mining High-Dimensional Data and Colossal Patterns Mining Compressed or Approximate Patterns Sequential Pattern Mining Graph Pattern Mining by Jiawei Han, Micheline Kamber, and Jian Pei, University of Illinois at Urbana-Champaign & Simon Fraser University, ©2013 Han, Kamber & Pei. All rights reserved.

Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007

The document discusses Chapter 5 from the book "Data Mining: Concepts and Techniques" which covers frequent pattern mining, association rule mining, and correlation analysis. It provides an overview of basic concepts such as frequent patterns and association rules. It also describes efficient algorithms for mining frequent itemsets such as Apriori and FP-growth, and discusses challenges and improvements to frequent pattern mining.

Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...

Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...

Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Salah Amean

Data Mining: Concepts and techniques classification _chapter 9 :advanced methods

Data Mining: Concepts and techniques classification _chapter 9 :advanced methods

Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsSalah Amean

slides contain: Bayesian Belief Networks, Classification by Backpropagation, Support Vector Machines, Classification by Using Frequent Patterns, Lazy Learners, (or Learning from Your Neighbors) Other Classification Methods, Additional Topics Regarding Classification, Summary by Jiawei Han, Micheline Kamber, and Jian Pei, University of Illinois at Urbana-Champaign & Simon Fraser University, ©2013 Han, Kamber & Pei. All rights reserved.

Data mining: Concepts and Techniques, Chapter12 outlier Analysis

Data mining: Concepts and Techniques, Chapter12 outlier Analysis

Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean

Outlier and Outlier Analysis, Outlier Detection Methods, Statistical Approaches, Proximity-Base Approaches, Clustering-Base Approaches, Classification Approaches, Mining Contextual and Collective Outliers, Outlier Detection in High Dimensional Data, Summary by Jiawei Han, Micheline Kamber, and Jian Pei, University of Illinois at Urbana-Champaign & Simon Fraser University, ©2013 Han, Kamber & Pei. All rights reserved.

Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts

Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts

Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean

Slides contain: Classification: Basic Concepts, Decision Tree Induction, Bayes Classification Methods, Rule-Based Classification, Model Evaluation and Selection, Techniques to Improve Classification Accuracy: Ensemble Methods, Summary by Jiawei Han, Micheline Kamber, and Jian Pei, University of Illinois at Urbana-Champaign & Simon Fraser University, ©2013 Han, Kamber & Pei. All rights reserved.

Optimization Analysis

Optimization Analysis

Optimization AnalysisSalah Amean

This document discusses optimization analysis for making eating decisions. Optimization analysis aims to find the optimal or best value of target variables like nutrition, price, and taste when choosing meals, given constraints. It involves repeatedly adjusting variables like food choice, location, and cost until the best values for the targets are discovered. The purpose is to help choose meals that provide good nutrition while saving money.

Data mining techniques and dss

Data mining techniques and dss

Data mining techniques and dssNiyitegekabilly

The document discusses several myths about data mining. It summarizes that data mining is not instant predictions from a crystal ball, but rather a multi-step process requiring clean data. It also notes that data mining is a viable technology for businesses that can provide insights regardless of company size or amount of customer data. Advanced algorithms are not the only important aspect of data mining, as business knowledge is also essential.

dssShanjeet Singh Mavi

ICT role in Yemen

ICT role in Yemen

ICT role in Yemen Salah Amean

Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007

This chapter discusses techniques for efficient computation of data cubes and data generalization from multidimensional databases. It covers topics such as: - Methods for computing data cubes including top-down, bottom-up, and hybrid approaches. - Data structure techniques for cube computation including multi-way array aggregation, partitioning for bottom-up computation, and H-cubing. - Optimization techniques including iceberg cube computation, aggregation sharing, and star reduction to prune unnecessary computations.

ContikiMAC : Radio Duty Cycling Protocol

ContikiMAC : Radio Duty Cycling Protocol

ContikiMAC : Radio Duty Cycling ProtocolSalah Amean

ContikiMAC is a radio duty cycling protocol that uses periodic wake-ups to listen for packet transmissions from neighbors. It keeps the radio on only when needed to receive packets, sending acknowledgments and retransmitting packets as necessary. The timing of transmissions and wake-ups is precise to enable efficient communication while minimizing energy usage. Phase awareness allows senders to transmit packets just before expected receiver wake-ups. ContikiMAC is implemented using real-time timers in Contiki to ensure stable and accurate timing.

01 Data Mining: Concepts and Techniques, 2nd ed.

01 Data Mining: Concepts and Techniques, 2nd ed.

01 Data Mining: Concepts and Techniques, 2nd ed.Institute of Technology Telkom

The document provides an overview of data mining concepts and techniques. It introduces data mining, describing it as the process of discovering interesting patterns or knowledge from large amounts of data. It discusses why data mining is necessary due to the explosive growth of data and how it relates to other fields like machine learning, statistics, and database technology. Additionally, it covers different types of data that can be mined, functionalities of data mining like classification and prediction, and classifications of data mining systems.

Data Mining: Data cube computation and data generalization

Data Mining: Data cube computation and data generalization

Data Mining: Data cube computation and data generalizationDataminingTools Inc

Data generalization abstracts data from a low conceptual level to higher levels. Different cube materialization methods include full, iceberg, closed, and shell cubes. The Apriori property states that if a cell does not meet minimum support, neither will its descendants, and can reduce iceberg cube computation. BUC constructs cubes from the apex downward, allowing pruning using Apriori and sharing partitioning costs. Discovery-driven exploration assists users in intelligently exploring aggregated data cubes. Constrained gradient analysis incorporates significance, probe, and gradient constraints to reduce the search space. Attribute-oriented induction generalizes based on attribute values to characterize data. Attribute generalization is controlled through thresholds and relations.

Lecture13 - Association Rules

Lecture13 - Association Rules

Lecture13 - Association RulesAlbert Orriols-Puig

This document provides an introduction to association rule mining. It begins with an overview of association rule mining and its application to market basket analysis. It then discusses key concepts like support, confidence and interestingness of rules. The document introduces the Apriori algorithm for mining association rules, which works in two steps: 1) generating frequent itemsets and 2) generating rules from frequent itemsets. It provides examples of how Apriori works and discusses challenges in association rule mining like multiple database scans and candidate generation.

Decision Support System - Management Information System

Decision Support System - Management Information System

Decision Support System - Management Information SystemNijaz N

Bonjour protocol

Bonjour protocol

Bonjour protocolSalah Amean

Bonjour is Apple's implementation of zero-configuration networking protocols that allow devices to automatically discover each other's presence on a local network without complicated configuration. It uses multicast DNS, IPv4 link-local addressing, and service discovery to enable features like automatic printer discovery and sharing files between devices. Bonjour browsing allows users to see available services on the network without having to know specific device names or IP addresses.

Data Mining: Concepts and techniques: Chapter 13 trend

Data Mining: Concepts and techniques: Chapter 13 trend

Data Mining: Concepts and techniques: Chapter 13 trendSalah Amean

Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...

Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...

Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean

Datacubeman2sandsce17

Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...

Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...

Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean

Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007

Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...

Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...

Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Salah Amean

Data Mining: Concepts and techniques classification _chapter 9 :advanced methods

Data Mining: Concepts and techniques classification _chapter 9 :advanced methods

Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsSalah Amean

Data mining: Concepts and Techniques, Chapter12 outlier Analysis

Data mining: Concepts and Techniques, Chapter12 outlier Analysis

Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean

Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts

Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts

Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean

Optimization Analysis

Optimization Analysis

Optimization AnalysisSalah Amean

Data mining techniques and dss

Data mining techniques and dss

Data mining techniques and dssNiyitegekabilly

dssShanjeet Singh Mavi

ICT role in Yemen

ICT role in Yemen

ICT role in Yemen Salah Amean

Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007

ContikiMAC : Radio Duty Cycling Protocol

ContikiMAC : Radio Duty Cycling Protocol

ContikiMAC : Radio Duty Cycling ProtocolSalah Amean

01 Data Mining: Concepts and Techniques, 2nd ed.

01 Data Mining: Concepts and Techniques, 2nd ed.

01 Data Mining: Concepts and Techniques, 2nd ed.Institute of Technology Telkom

Data Mining: Data cube computation and data generalization

Data Mining: Data cube computation and data generalization

Data Mining: Data cube computation and data generalizationDataminingTools Inc

Lecture13 - Association Rules

Lecture13 - Association Rules

Lecture13 - Association RulesAlbert Orriols-Puig

Decision Support System - Management Information System

Decision Support System - Management Information System

Decision Support System - Management Information SystemNijaz N

Bonjour protocol

Bonjour protocol

Bonjour protocolSalah Amean

Ad

Similar to Data Mining: Concepts and Techniques (3rd ed.)— Chapter 5 (20)

05 cubetechJoonyoungJayGwak

This chapter discusses techniques for computing data cubes from multidimensional datasets. It begins with basic concepts like data cube structure and computation. It then covers specific computation methods like multi-way array aggregation, bottom-up computation (BUC), and star-cubing. It also discusses challenges of computing high-dimensional cubes and approaches for minimizing computation. The chapter provides an overview of key data cube computation methods and optimization techniques.

Chapter 5. Data Cube Technology.ppt

Chapter 5. Data Cube Technology.ppt

Chapter 5. Data Cube Technology.pptSubrata Kumer Paul

Data Mining: Concepts and Techniques (3rd ed.) Chapter 5

Data Mining: Concepts and Techniques (3rd ed.) Chapter 5

Data Mining: Concepts and Techniques (3rd ed.) Chapter 5FriendsofGADGETS

Lecture 8 is for best and you should read

Lecture 8 is for best and you should read

Lecture 8 is for best and you should readcentralcollegepkr

2. "Design Patterns: Elements of Reusable Object-Oriented Software" by Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides: Understanding design patterns is crucial for building scalable and maintainable software. This book introduces 23 classic design patterns that solve recurring design problems. It's an excellent resource for software architects and developers looking to enhance their object-oriented design skills. 3. "The Pragmatic Programmer: Your Journey to Mastery" by Dave Thomas and Andy Hunt: This book provides pragmatic advice for programmers at all levels. It covers a wide range of topics, including code organization, debugging, testing, and automation. The authors share valuable insights and best practices that can significantly impact your efficiency and effectiveness as a developer. 4. "Introduction to Algorithms" by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein: For a deep dive into algorithms and data structures, this book is a comprehensive resource. It's widely used in computer science courses and covers essential algorithms, their analysis, and their application in solving real-world problems. The book's clarity and rigor make it suitable for both beginners and experienced developers. 5. "Code Complete: A Practical Handbook of Software Construction" by Steve McConnell: "Code Complete" is a comprehensive guide to software construction, covering a wide array of topics related to writing high-quality code. It's suitable for developers at various experience levels and provides practical advice, examples, and case studies to help you improve your coding skills. 6. "The Mythical Man-Month: Essays on Software Engineering" by Frederick P. Brooks Jr.: This classic book offers valuable insights into software engineering and project management. Frederick Brooks discusses the challenges of software development, including the famous concept of "The Mythical Man-Month," which explores the complexities of managing large software projects. It remains relevant and thought-provoking decades after its initial publication. 7. "Refactoring: Improving the Design of Existing Code" by Martin Fowler: In the real world, developers often work with existing codebases. This book provides practical strategies for improving the design of existing code through refactoring. Martin Fowler introduces numerous refactorings and explains the principles behind them, making it an invaluable resource for enhancing code maintainability.

05 cubetechStellafilen

The chapter discusses methods for efficiently computing data cubes from large datasets. It covers preliminary concepts like cuboids and lattices. It then summarizes key data cube computation methods like multi-way array aggregation, BUC (bottom-up computation), and star-cubing. Star-cubing aims to integrate top-down and bottom-up approaches to allow for iceberg pruning during computation. The chapter also discusses optimizations like partitioning, ordering dimensions, and exploiting shared dimensions during computation.

Comparison between cube techniques

Comparison between cube techniques

Comparison between cube techniquesijsrd.com

Data Cube in the contest of data warehousing and OLAP is core operator. The data cube was proposed to pre compute the aggregation for all possible combination of dimension to answer analytical queries efficiently. It is a generalization of the group-by operator over all possible combination of dimension with various granularity aggregates. Efficient and Compressed computation of data cube are Fundamental issues. Data Warehouses tend to be order of magnitude larger than operational database in size. So by studying and comparing all these methods we can find out that which methods are applicable and suitable for which kind of Data. Here I have compared range cube with Bit cube .Each problem is of particular interest in the field of data analysis and query answering. So by comparing various methods, we can verify the trades of between time and space as per the requirements and type of problem. Different Types of measures are available for aggregating the datasets. Such as Major-Minor, Count, Sum etc. So comparative study shows that we can find which measures can be compute data cube incrementally.

Efficient_Cube_computation.ppt

Efficient_Cube_computation.ppt

Efficient_Cube_computation.pptKulwinder Padda

The document discusses efficient computation of data cubes. It describes several methods for computing data cubes including top-down, bottom-up, and hybrid approaches. Multi-way array aggregation computes cubes by aggregating values across multiple dimensions simultaneously. The bottom-up computation (BUC) method partitions data to fit in memory and facilitates iceberg pruning. H-cubing uses an H-tree structure and explores the tree to prune computations that do not meet minimum support. Star-cubing integrates top-down and bottom-up methods by exploring shared dimensions.

DATA MINING:Clustering Types

DATA MINING:Clustering Types

DATA MINING:Clustering TypesAshwin Shenoy M

The document discusses different methods for partitioning data into clusters. It describes hierarchical, density-based, grid-based, and model-based partitioning methods. It then explains the k-means and k-medoids partitioning algorithms in more detail, outlining the basic steps of assigning objects to clusters and updating centroids or medoids. Finally, it summarizes the Birch, ROCK, and CURE clustering algorithms.

04 data mining : data generelization

04 data mining : data generelization

04 data mining : data generelizationInstitute of Technology Telkom

PAM.pptjanaki raman

This document discusses spatial indexing techniques for multidimensional point data. It describes grid files which partition space into grid cells, each associated with a disk page. It also covers tree-based methods like the kd-tree which partitions space recursively based on dimension values. Z-ordering and space-filling curves like the Hilbert curve are presented as mapping multidimensional points to a linear ordering to enable range queries on a B-tree. The document compares techniques and analyzes properties like the number of disk accesses for range queries.

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan

The document discusses accelerating Reed-Solomon erasure codes on GPUs. It aims to accelerate two main computation bottlenecks: arithmetic operations in Galois fields and matrix multiplication. For Galois field operations, it evaluates loop-based and table-based methods and chooses a log-exponential table approach. It also proposes tiling algorithms to optimize matrix multiplication on GPUs by reducing data transfers and improving memory access patterns. The goal is to make Reed-Solomon encoding and decoding faster for cloud storage systems using erasure codes.

2013.10.24 big datavisualization

2013.10.24 big datavisualization

2013.10.24 big datavisualizationSean Kandel

1. The document discusses techniques for visualizing and interacting with billion record databases in real-time. 2. It describes using binning, aggregation, and sampling to summarize large datasets and make them perceptually and interactively scalable. 3. Key techniques discussed include binning data into discrete buckets, aggregating statistics within bins, and using GPU processing and WebGL to enable fast querying and linking of visual summaries across multiple plots of billion record datasets.

BlinkdbNitish Upreti

This document discusses approximate query processing using sampling to enable interactive queries over large datasets. It describes BlinkDB, a framework that creates and maintains samples from underlying data to return fast, approximate query answers with error bars. BlinkDB verifies the correctness of the error bars it returns by periodically replacing samples and using diagnostics to check the accuracy without running many queries. The document discusses challenges like selecting appropriate samples, estimating errors, and verifying results to balance speed, accuracy and correctness for interactive analysis of big data.

data mining and data warehousing PPT module 2

data mining and data warehousing PPT module 2

data mining and data warehousing PPT module 2premajain3

data mining and data warehouse is a subject of 7th sem BE computer science belongs to VTU university. In this PPT in cncludes implementation data warehouse like data cubes and different operations can perform on data cubes. also it includes data preprocessing techniques and different steps involved in preprocessing. different types of indexing, different data mining tasks, advantages and disadvantages of the different datamining approaches.

Principal Component Analysis PCA: How to conduct the analysis

Principal Component Analysis PCA: How to conduct the analysis

Principal Component Analysis PCA: How to conduct the analysismdgolamkibria53

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks

Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...

Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...

Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Hsien-Hsin Sean Lee, Ph.D.

This document discusses techniques for improving instruction fetch throughput in superscalar processors. It begins by explaining that fetch throughput defines the maximum performance and that superscalar processors need to supply more than one instruction per cycle. It then describes some challenges to high bandwidth instruction fetching including misaligned instructions, changes in control flow, and memory latency/bandwidth limitations. The document proceeds to discuss specific techniques like aligned fetching, split cache line access, predication, collapsing buffers, trace caches, and issues related to indexing and redundancy in trace caches.

Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01

Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01

Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Hemant Jha

This document provides an overview of VLSI physical design automation. It begins with introducing the intended audience for VLSI CAD, which includes VLSI students, circuit designers, process engineers, and those interested in solving hard computational problems. The objectives of VLSI layout design are then outlined, which are to review fabrication materials and processes, understand the basic algorithm concepts used in layout design, and learn about state-of-the-art academic and commercial physical design automation techniques. The document then describes the basic steps in the physical design cycle, including partitioning, floorplanning, placement, routing, and compaction. Circuit partitioning is discussed in more detail, including definitions, formulations, representation, iterative algorithms like Kernighan-Lin, and other

14 query processing-sorting

14 query processing-sorting

14 query processing-sortingrameswara reddy venkat

The document discusses different algorithms for performing joins between two database tables: 1. Simple nested loops join compares each tuple in one table to every tuple in the other table, resulting in very high I/O costs. 2. Block nested loops join partitions one table into blocks that fit in memory, joining each block to the other table to reduce I/O. 3. Index nested loops join uses an index on the join column to lookup matching tuples, reducing I/O costs compared to nested loops. The document provides examples comparing the I/O costs of applying different join algorithms to sample tables.

Cache recapYoung Alista

The document discusses memory hierarchy and cache performance. It introduces the concepts of memory hierarchy, cache hits, misses, and different types of cache organizations like direct mapped, set associative, and fully associative caches. It analyzes how cache performance is affected by miss rate, miss penalty, block size, cache size, and associativity. Adding a second level cache can help reduce the miss penalty and improve overall performance.

05 cubetechJoonyoungJayGwak

Chapter 5. Data Cube Technology.ppt

Chapter 5. Data Cube Technology.ppt

Chapter 5. Data Cube Technology.pptSubrata Kumer Paul

Data Mining: Concepts and Techniques (3rd ed.) Chapter 5

Data Mining: Concepts and Techniques (3rd ed.) Chapter 5

Data Mining: Concepts and Techniques (3rd ed.) Chapter 5FriendsofGADGETS

Lecture 8 is for best and you should read

Lecture 8 is for best and you should read

Lecture 8 is for best and you should readcentralcollegepkr

05 cubetechStellafilen

Comparison between cube techniques

Comparison between cube techniques

Comparison between cube techniquesijsrd.com

Efficient_Cube_computation.ppt

Efficient_Cube_computation.ppt

Efficient_Cube_computation.pptKulwinder Padda

DATA MINING:Clustering Types

DATA MINING:Clustering Types

DATA MINING:Clustering TypesAshwin Shenoy M

04 data mining : data generelization

04 data mining : data generelization

04 data mining : data generelizationInstitute of Technology Telkom

PAM.pptjanaki raman

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan

2013.10.24 big datavisualization

2013.10.24 big datavisualization

2013.10.24 big datavisualizationSean Kandel

BlinkdbNitish Upreti

data mining and data warehousing PPT module 2

data mining and data warehousing PPT module 2

data mining and data warehousing PPT module 2premajain3

Principal Component Analysis PCA: How to conduct the analysis

Principal Component Analysis PCA: How to conduct the analysis

Principal Component Analysis PCA: How to conduct the analysismdgolamkibria53

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks

Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...

Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...

Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Hsien-Hsin Sean Lee, Ph.D.

Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01

Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01

Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Hemant Jha

14 query processing-sorting

14 query processing-sorting

14 query processing-sortingrameswara reddy venkat

Cache recapYoung Alista

Ad

More from Salah Amean (11)

Contiki os timer tutorial

Contiki os timer tutorial

Contiki os timer tutorialSalah Amean

WSN protocol 802.15.4 together with cc2420 seminars

WSN protocol 802.15.4 together with cc2420 seminars

WSN protocol 802.15.4 together with cc2420 seminars Salah Amean

protothread and its usage in contiki OS

protothread and its usage in contiki OS

protothread and its usage in contiki OSSalah Amean

Protothreads provide a lightweight threading mechanism for memory-constrained embedded systems by allowing sequential flow of control without full multi-threading or complex state machines. Protothreads use a single stack that is reused for each thread, requiring less memory than traditional multithreading. They provide blocking wait functionality through macros that expand the code to use C switch statements instead of stack switching or other complex mechanisms. This allows for conditional blocking within functions to simplify programming of event-driven systems.

Location in ubiquitous computing, LOCATION SYSTEMS

Location in ubiquitous computing, LOCATION SYSTEMS

Location in ubiquitous computing, LOCATION SYSTEMSSalah Amean

The document discusses several indoor location tracking systems, including Active Badge, Active Bat, and Cricket. Active Badge uses infrared signals from badges worn by users to triangulate their location within a building using sensors. Active Bat employs ultrasonic signals and time-of-flight calculations to determine a tag's 3D location. Cricket is an ultrasound and radio frequency system that can locate a listener node within a few feet of resolution through messages from fixed beacon nodes. The document provides details on the techniques, capabilities, and example uses of these seminal indoor localization systems.

Mobile apps-user interaction measurement & Apps ecosystem

Mobile apps-user interaction measurement & Apps ecosystem

Mobile apps-user interaction measurement & Apps ecosystemSalah Amean

ict culturing conference presentation _presented 2013_12_07

ict culturing conference presentation _presented 2013_12_07

ict culturing conference presentation _presented 2013_12_07Salah Amean

introduction to data mining tutorial

introduction to data mining tutorial

introduction to data mining tutorial Salah Amean

Characterizing wi fi-link_in_open_outdoor_netwo

Characterizing wi fi-link_in_open_outdoor_netwo

Characterizing wi fi-link_in_open_outdoor_netwoSalah Amean

Tutorial on dhcp

Tutorial on dhcp

Tutorial on dhcp Salah Amean

Contiki Operating system tutorial

Contiki Operating system tutorial

Contiki Operating system tutorialSalah Amean

Contiki is an open source operating system for the Internet of Things. Contiki connects tiny low-cost, low-power microcontrollers to the Internet. the presentation explains how to install the simulator, teach the reader some concepts of contiki OS, goes through API used in platform specific examples, and most importantly explains some example(Blinking example, Light and temperature sensor web demo).

Ns3 implementation wifi

Ns3 implementation wifi

Ns3 implementation wifiSalah Amean

Contiki os timer tutorial

Contiki os timer tutorial

Contiki os timer tutorialSalah Amean

WSN protocol 802.15.4 together with cc2420 seminars

WSN protocol 802.15.4 together with cc2420 seminars

WSN protocol 802.15.4 together with cc2420 seminars Salah Amean

protothread and its usage in contiki OS

protothread and its usage in contiki OS

protothread and its usage in contiki OSSalah Amean

Location in ubiquitous computing, LOCATION SYSTEMS

Location in ubiquitous computing, LOCATION SYSTEMS

Location in ubiquitous computing, LOCATION SYSTEMSSalah Amean

Mobile apps-user interaction measurement & Apps ecosystem

Mobile apps-user interaction measurement & Apps ecosystem

Mobile apps-user interaction measurement & Apps ecosystemSalah Amean

ict culturing conference presentation _presented 2013_12_07

ict culturing conference presentation _presented 2013_12_07

ict culturing conference presentation _presented 2013_12_07Salah Amean

introduction to data mining tutorial

introduction to data mining tutorial

introduction to data mining tutorial Salah Amean

Characterizing wi fi-link_in_open_outdoor_netwo

Characterizing wi fi-link_in_open_outdoor_netwo

Characterizing wi fi-link_in_open_outdoor_netwoSalah Amean

Tutorial on dhcp

Tutorial on dhcp

Tutorial on dhcp Salah Amean

Contiki Operating system tutorial

Contiki Operating system tutorial

Contiki Operating system tutorialSalah Amean

Ns3 implementation wifi

Ns3 implementation wifi

Ns3 implementation wifiSalah Amean

Recently uploaded (20)

Web and Graphics Designing Training in Rajpura

Web and Graphics Designing Training in Rajpura

Web and Graphics Designing Training in RajpuraErginous Technology

Web & Graphics Designing Training at Erginous Technologies in Rajpura offers practical, hands-on learning for students, graduates, and professionals aiming for a creative career. The 6-week and 6-month industrial training programs blend creativity with technical skills to prepare you for real-world opportunities in design. The course covers Graphic Designing tools like Photoshop, Illustrator, and CorelDRAW, along with logo, banner, and branding design. In Web Designing, you’ll learn HTML5, CSS3, JavaScript basics, responsive design, Bootstrap, Figma, and Adobe XD. Erginous emphasizes 100% practical training, live projects, portfolio building, expert guidance, certification, and placement support. Graduates can explore roles like Web Designer, Graphic Designer, UI/UX Designer, or Freelancer. For more info, visit erginous.co.in , message us on Instagram at erginoustechnologies, or call directly at +91-89684-38190 . Start your journey toward a creative and successful design career today!

Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...

Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...

Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul

Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.

Greenhouse_Monitoring_Presentation.pptx.

Greenhouse_Monitoring_Presentation.pptx.

Greenhouse_Monitoring_Presentation.pptx.hpbmnnxrvb

SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf

SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf

SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfPrecisely

Build Your Own Copilot & Agents For Devs

Build Your Own Copilot & Agents For Devs

Build Your Own Copilot & Agents For DevsBrian McKeiver

Unlocking the Power of IVR: A Comprehensive Guide

Unlocking the Power of IVR: A Comprehensive Guide

Unlocking the Power of IVR: A Comprehensive Guidevikasascentbpo

Dev Dives: Automate and orchestrate your processes with UiPath Maestro

Dev Dives: Automate and orchestrate your processes with UiPath Maestro

Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity

This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots. 📕 Here's what you can expect: - Modeling: Build end-to-end processes using BPMN. - Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes. - Operating: Control process instances with rewind, replay, pause, and stop functions. - Monitoring: Use dashboards and embedded analytics for real-time insights into process instances. This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes. 👨‍🏫 Speaker: Andrei Vintila, Principal Product Manager @UiPath This session streamed live on April 29, 2025, 16:00 CET. Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.

Top 10 IT Help Desk Outsourcing Services

Top 10 IT Help Desk Outsourcing Services

Top 10 IT Help Desk Outsourcing ServicesInfrassist Technologies Pvt. Ltd.

Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...

Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...

Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...BookNet Canada

Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next. Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/ Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.

MINDCTI revenue release Quarter 1 2025 PR

MINDCTI revenue release Quarter 1 2025 PR

MINDCTI revenue release Quarter 1 2025 PRMIND CTI

Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights

Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights

Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell

Role of Data Annotation Services in AI-Powered Manufacturing

Role of Data Annotation Services in AI-Powered Manufacturing

Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo

Rusty Waters: Elevating Lakehouses Beyond Spark

Rusty Waters: Elevating Lakehouses Beyond Spark

Rusty Waters: Elevating Lakehouses Beyond Sparkcarlyakerly1

Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark? At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍 Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀

Mastering Advance Window Functions in SQL.pdf

Mastering Advance Window Functions in SQL.pdf

Mastering Advance Window Functions in SQL.pdfSpiral Mantra

Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf

Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf

Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdfTelecoms Supermarket

Electronic_Mail_Attacks-1-35.pdf by xploit

Electronic_Mail_Attacks-1-35.pdf by xploit

Electronic_Mail_Attacks-1-35.pdf by xploitniftliyevhuseyn

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada

Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next. Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/ Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.

Quantum Computing Quick Research Guide by Arthur Morgan

Quantum Computing Quick Research Guide by Arthur Morgan

Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan

This is a Quick Research Guide (QRG). QRGs include the following: - A brief, high-level overview of the QRG topic. - A milestone timeline for the QRG topic. - Links to various free online resource materials to provide a deeper dive into the QRG topic. - Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic. QRGs planned for the series: - Artificial Intelligence QRG - Quantum Computing QRG - Big Data Analytics QRG - Spacecraft Guidance, Navigation & Control QRG (coming 2026) - UK Home Computing & The Birth of ARM QRG (coming 2027) Any questions or comments? - Please contact Arthur Morgan at [email protected]. 100% human made.

Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive

Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive

Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB

Want to learn practical tips for designing systems that can scale efficiently without compromising speed? Join us for a workshop where we’ll address these challenges head-on and explore how to architect low-latency systems using Rust. During this free interactive workshop oriented for developers, engineers, and architects, we’ll cover how Rust’s unique language features and the Tokio async runtime enable high-performance application development. As you explore key principles of designing low-latency systems with Rust, you will learn how to: - Create and compile a real-world app with Rust - Connect the application to ScyllaDB (NoSQL data store) - Negotiate tradeoffs related to data modeling and querying - Manage and monitor the database for consistently low latencies

Linux Professional Institute LPIC-1 Exam.pdf

Linux Professional Institute LPIC-1 Exam.pdf

Linux Professional Institute LPIC-1 Exam.pdfRHCSA Guru

Web and Graphics Designing Training in Rajpura

Web and Graphics Designing Training in Rajpura

Web and Graphics Designing Training in RajpuraErginous Technology

Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...

Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...

Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul

Greenhouse_Monitoring_Presentation.pptx.

Greenhouse_Monitoring_Presentation.pptx.

Greenhouse_Monitoring_Presentation.pptx.hpbmnnxrvb

SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf

SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf

SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfPrecisely

Build Your Own Copilot & Agents For Devs

Build Your Own Copilot & Agents For Devs

Build Your Own Copilot & Agents For DevsBrian McKeiver

Unlocking the Power of IVR: A Comprehensive Guide

Unlocking the Power of IVR: A Comprehensive Guide

Unlocking the Power of IVR: A Comprehensive Guidevikasascentbpo

Dev Dives: Automate and orchestrate your processes with UiPath Maestro

Dev Dives: Automate and orchestrate your processes with UiPath Maestro

Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity

Top 10 IT Help Desk Outsourcing Services

Top 10 IT Help Desk Outsourcing Services

Top 10 IT Help Desk Outsourcing ServicesInfrassist Technologies Pvt. Ltd.

Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...

Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...

Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...BookNet Canada

MINDCTI revenue release Quarter 1 2025 PR

MINDCTI revenue release Quarter 1 2025 PR

MINDCTI revenue release Quarter 1 2025 PRMIND CTI

Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights

Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights

Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell

Role of Data Annotation Services in AI-Powered Manufacturing

Role of Data Annotation Services in AI-Powered Manufacturing

Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo

Rusty Waters: Elevating Lakehouses Beyond Spark

Rusty Waters: Elevating Lakehouses Beyond Spark

Rusty Waters: Elevating Lakehouses Beyond Sparkcarlyakerly1

Mastering Advance Window Functions in SQL.pdf

Mastering Advance Window Functions in SQL.pdf

Mastering Advance Window Functions in SQL.pdfSpiral Mantra

Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf

Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf

Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdfTelecoms Supermarket

Electronic_Mail_Attacks-1-35.pdf by xploit

Electronic_Mail_Attacks-1-35.pdf by xploit

Electronic_Mail_Attacks-1-35.pdf by xploitniftliyevhuseyn

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada

Quantum Computing Quick Research Guide by Arthur Morgan

Quantum Computing Quick Research Guide by Arthur Morgan

Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan

Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive

Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive

Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB

Linux Professional Institute LPIC-1 Exam.pdf

Linux Professional Institute LPIC-1 Exam.pdf

Linux Professional Institute LPIC-1 Exam.pdfRHCSA Guru

Data Mining: Concepts and Techniques (3rd ed.)— Chapter 5

1. 1 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2013 Han, Kamber & Pei. All rights reserved.

2. 09/14/14 Data Mining: Concepts and Techniques 2

3. 3 Chapter 5: Data Cube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Processing Advanced Queries by Exploring Data Cube Technology  Multidimensional Data Analysis in Cube Space  Summary

4. 4 Data Cube: A Lattice of Cuboids time,item time item location supplier time,location time,item,location all item,location time,supplier item,supplier time,location,supplier time, item, location, supplierc location,supplier time,item,supplier item,location,supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid

5. 5 Data Cube: A Lattice of Cuboids all time item location supplier item,location time,location,supplier  Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) 2. (9/15, milk, Urbana, *) 3. (*, milk, Urbana, *) 4. (*, milk, Urbana, *) 5. (*, milk, Chicago, *) 6. (*, milk, *, *) time,item time,item,location time, item, location, supplier time,location time,supplier item,supplier location,supplier time,item,supplier item,location,supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid

6. 6 Cube Materialization: Full Cube vs. Iceberg Cube  Full cube vs. iceberg cube compute cube sales iceberg as select month, city, customer group, count(*) from salesInfo cube by month, city, customer group having count(*) >= min support iceberg condition  Computing only the cuboid cells whose measure satisfies the iceberg condition  Only a small portion of cells may be “above the water’’ in a sparse cube  Avoid explosive growth: A cube with 100 dimensions  2 base cells: (a1, a2, …., a100), (b1, b2, …, b100)  How many aggregate cells if “having count >= 1”?  What about “having count >= 2”?

7. 7 Iceberg Cube, Closed Cube & Cube Shell  Is iceberg cube good enough?  2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}  How many cells will the iceberg cube have if having count(*) >= 10? Hint: A huge but tricky number!  Close cube:  Closed cell c: if there exists no cell d, s.t. d is a descendant of c, and d has the same measure value as c.  Closed cube: a cube consisting of only closed cells  What is the closed cube of the above base cuboid? Hint: only 3 cells  Cube Shell  Precompute only the cuboids involving a small # of dimensions, e.g., 3 For (A, A, … A), how many combinations to compute? 1210 More dimension combinations will need to be computed on the fly

8. 8 Roadmap for Efficient Computation  General cube computation heuristics (Agarwal et al.’96)  Computing full/iceberg cubes: 3 methodologies  Bottom-Up: Multi-Way array aggregation (Zhao, Deshpande & Naughton, SIGMOD’97)  Top-down:  BUC (Beyer & Ramarkrishnan, SIGMOD’99)  H-cubing technique (Han, Pei, Dong & Wang: SIGMOD’01)  Integrating Top-Down and Bottom-Up:  Star-cubing algorithm (Xin, Han, Li & Wah: VLDB’03)  High-dimensional OLAP: A Minimal Cubing Approach (Li, et al. VLDB’04)  Computing alternative kinds of cubes:  Partial cube, closed cube, approximate cube, etc.

9. 9 General Heuristics (Agarwal et al. VLDB’96)  Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples  Aggregates may be computed from previously computed aggregates, rather than from the base fact table  Smallest-child: computing a cuboid from the smallest, previously computed cuboid  Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os  Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads  Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used  Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used

10. 1100 Chapter 5: Data Cube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Multi-Way Array Aggregation  BUC  High-Dimensional OLAP  Processing Advanced Queries by Exploring Data Cube Technology  Multidimensional Data Analysis in Cube Space  Summary

11. 1111 MMuullttii--WWaayy AArrrraayy AAggggrreeggaattiioonn  Array-based “bottom-up” algorithm  Using multi-dimensional chunks  No direct tuple comparisons  Simultaneous aggregation on multiple dimensions  Intermediate aggregate values are re-used for computing ancestor cuboids  Cannot do Apriori pruning: No iceberg optimization

12. 1122 Multi-way Array Aggregation for Cube Computation (MOLAP)  Partition arrays into chunks (a small subcube which fits in memory).  Compressed sparse array addressing: (chunk_id, offset)  Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. What is the best traversing order to do multi-way aggregation? c3 c2 61 62 63 64 45 46 47 48 c 0c1 b3 b2 b1 b0 13 14 15 16 A B 29 30 31 32 9 5 1 2 3 4 a0 a1 a2 a3 C B 60 44 28 56 24 4036 52 20

13. 13 Multi-way Array Aggregation for Cube Computation (3-D to 2-D) a l l A B A B C A C B C A B C  The best order is the one that minimizes the memory requirement and reduced I/Os

14. 14 Multi-way Array Aggregation for Cube Computation (2-D to 1-D)

15. 1155 Multi-Way Array Aggregation for Cube Computation (Method Summary)  Method: the planes should be sorted and computed according to their size in ascending order  Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane  Limitation of the method: computing well only for a small number of dimensions  If there are a large number of dimensions, “top-down” computation and iceberg cube computation methods can be explored

16. 1166 Bottom-Up Computation (BUC)  BUC (Beyer & Ramakrishnan, SIGMOD’99)  Bottom-up cube computation (Note: top-down in our view!)  Divides dimensions into partitions and facilitates iceberg pruning  If a partition does not satisfy min_sup, its descendants can be pruned  If minsup = 1 Þ compute full CUBE!  No simultaneous aggregation a l l A B C A C B C D A D B D C D A B C A B D A C D B C D A B C D A B 1 a l l 2 A 1 0 B 1 4 C 7 A C 1 1 B C 1 6 D 9 A D 1 3 B D 1 5 C D 4 A B C 6 A B D 8 A C D 1 2 B C D 5 A B C D 3 A B

17. 1177 BUC: Partitioning  Usually, entire data set can’t fit in main memory  Sort distinct values  partition into blocks that fit  Continue processing  Optimizations  Partitioning  External Sorting, Hashing, Counting Sort  Ordering dimensions to encourage pruning  Cardinality, Skew, Correlation  Collapsing duplicates  Can’t do holistic aggregates anymore!

18. 1188 High-Dimensional OLAP? — The Curse of Dimensionality  None of the previous cubing method can handle high dimensionality!  A database of 600k tuples. Each dimension has cardinality of 100 and zipf of 2.

19. 1199 Motivation of High-D OLAP  X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal Cubing Approach, VLDB'04  Challenge to current cubing methods:  The “curse of dimensionality’’ problem  Iceberg cube and compressed cubes: only delay the inevitable explosion  Full materialization: still significant overhead in accessing results on disk  High-D OLAP is needed in applications  Science and engineering analysis  Bio-data analysis: thousands of genes  Statistical surveys: hundreds of variables

20. 2200 Fast High-D OLAP with Minimal Cubing  Observation: OLAP occurs only on a small subset of dimensions at a time  Semi-Online Computational Model 1. Partition the set of dimensions into shell fragments 2. Compute data cubes for each shell fragment while retaining inverted indices or value-list indices 3. Given the pre-computed fragment cubes, dynamically compute cube cells of the high-dimensional data cube online

21. 2211 Properties of Proposed Method  Partitions the data vertically  Reduces high-dimensional cube into a set of lower dimensional cubes  Online re-construction of original high-dimensional space  Lossless reduction  Offers tradeoffs between the amount of pre-processing and the speed of online computation

22. 2222 Example: Computing a 5-D Cube with Two Shell Fragments  Let the cube aggregation function be count tid A B C D E 1 a1 b1 c1 d1 e1 2 a1 b2 c1 d2 e1 3 a1 b2 c1 d1 e2 4 a2 b1 c1 d1 e2 5 a2 b1 c1 d1 e3  Divide the 5-D table into 2 shell fragments: (A, B, C) and (D, E)  Build traditional invert index or RID list Attribute Value TID List List Size a1 1 2 3 3 a2 4 5 2 b1 1 4 5 3 b2 2 3 2 c1 1 2 3 4 5 5 d1 1 3 4 5 4 d2 2 1 e1 1 2 2 e2 3 4 2 e3 5 1

23. 2233 Shell Fragment Cubes: Ideas  Generalize the 1-D inverted indices to multi-dimensional ones in the data cube sense  Compute all cuboids for data cubes ABC and DE while retaining the inverted indices  For example, shell fragment cube ABC contains 7 cuboids:  A, B, C  AB, AC, BC  ABC  This completes the offline computation stage Cell Intersection TID List List Size Ç a1 b1 1 2 3 1 4 5 1 1 a1 b2 1 2 3 2 3 2 3 2 a2 b1 4 5 1 4 5 4 5 2 a2 b2 4 5 2 3 0 Ç Ç Ç Ä

24. 2244 Shell Fragment Cubes: Size and Design  Given a database of T tuples, D dimensions, and F shell fragment size, the fragment cubes’ space requirement is:  For F < 5, the growth is sub-linear é êê æ è OT ö ø ù ú ú (2F-1) D F ç  Shell fragments do not have to be disjoint  Fragment groupings can be arbitrary to allow for maximum online performance  Known common combinations (e.g.,<city, state>) should be grouped together. ÷  Shell fragment sizes can be adjusted for optimal balance between offline and online computation

25. 2255 ID_Measure Table  If measures other than count are present, store in ID_measure table separate from the shell fragments tid count sum 1 5 70 2 3 10 3 8 20 4 5 40 5 2 30

26. 2266 The Frag-Shells Algorithm 1. Partition set of dimension (A1,…,An) into a set of k fragments (P1, …,Pk). 2. Scan base table once and do the following 3. insert <tid, measure> into ID_measure table. 4. for each attribute value ai of each dimension Ai 5. build inverted index entry <ai, tidlist> 6. For each fragment partition Pi 7. build local fragment cube Si by intersecting tid-lists in bottom-up fashion.

27. 2277 Frag-Shells A B C D E F … ABC Cube DEF Cube D Cuboid EF Cuboid DE Cuboid Cell Tuple-ID List d1 e1 {1, 3, 8, 9} d1 e2 {2, 4, 6, 7} d2 e1 {5, 10} … … Dimensions

28. 2288 Online Query Computation: Query  A query has the general form  Each ai has 3 possible values 1. Instantiated value 2. Aggregate * function ,a2 3. Inquire ? function  For example, returns a 2-D data cube. a1 ,K,an :M 3??*1:count

29. 2299 Online Query Computation: Method  Given the fragment cubes, process a query as follows 1. Divide the query into fragment, same as the shell 2. Fetch the corresponding TID list for each fragment from the fragment cube 3. Intersect the TID lists from each fragment to construct instantiated base table 4. Compute the data cube using the base table with any cubing algorithm

30. 3300 Online Query Computation: Sketch A B C D E F G H I J K L M N … Online Cube Instantiated Base Table

31. 3311 Experiment: Size vs. Dimensionality (50 and 100 cardinality)  (50-C): 106 tuples, 0 skew, 50 cardinality, fragment size 3.  (100-C): 106 tuples, 2 skew, 100 cardinality, fragment size 2.

32. 3322 Experiments on Real World Data  UCI Forest CoverType data set  54 dimensions, 581K tuples  Shell fragments of size 2 took 33 seconds and 325MB to compute  3-D subquery with 1 instantiate D: 85ms~1.4 sec.  Longitudinal Study of Vocational Rehab. Data  24 dimensions, 8818 tuples  Shell fragments of size 3 took 0.9 seconds and 60MB to compute  5-D query with 0 instantiated D: 227ms~2.6 sec.

33. 3333 Chapter 5: Data Cube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Processing Advanced Queries by Exploring Data Cube Technology  Sampling Cube: X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework for Statistical OLAP over Sampling Data”, SIGMOD’08  Multidimensional Data Analysis in Cube Space  Summary

34. 3344 Statistical Surveys and OLAP  Statistical survey: A popular tool to collect information about a population based on a sample  Ex.: TV ratings, US Census, election polls  A common tool in politics, health, market research, science, and many more  An efficient way of collecting information (Data collection is expensive)  Many statistical tools available, to determine validity  Confidence intervals  Hypothesis tests  OLAP (multidimensional analysis) on survey data  highly desirable but can it be done well?

35. 3355 Surveys: Sample vs. Whole Population Data is only a sample of population AgeEducation High-school College Graduate 18 19 20 …

36. 3366 Problems for Drilling in Sampling Cube  OLAP on Survey (i.e., Sampling) Data  Semantics of query is unchanged, but input data is changed Age/Education High-school College Graduate 18 19 20 … Data is only a sample of population but samples could be small when drilling to certain multidimensional space

37. 3377 Challenges for OLAP on Sampling Data Q: What is the average income of 19-year-old high-school students? A: Returns not only query result but also confidence interval  Computing confidence intervals in OLAP context  No data?  Not exactly. No data in subspaces in cube  Sparse data  Causes include sampling bias and query selection bias  Curse of dimensionality  Survey data can be high dimensional  Over 600 dimensions in real world example  Impossible to fully materialize

38. 3388 Confidence Interval  Confidence interval at :  x is a sample of data set; is the mean of sample  tc is the critical t-value, calculated by a look-up  is the estimated standard error of the mean  Example: $50,000 ± $3,000 with 95% confidence  Treat points in cube cell as samples  Compute confidence interval as traditional sample set  Return answer in the form of confidence interval  Indicates quality of query answer  User selects desired confidence interval

39. 3399 Efficient Computing Confidence Interval Measures  Efficient computation in all cells in data cube  Both mean and confidence interval are algebraic  Why confidence interval measure is algebraic? is algebraic where both s and l (count) are algebraic  Thus one can calculate cells efficiently at more general cuboids without having to start at the base cuboid each time

40. Boosting Confidence by Query Expansion 4400  From the example: The queried cell “19-year-old college students” contains only 2 samples  Confidence interval is large (i.e., low confidence). why?  Small sample size  High standard deviation with samples  Small sample sizes can occur at relatively low dimensional selections  Collect more data?― expensive!  Use data in other cells? Maybe, but have to be careful

41. Query Expansion: Intra-Cuboid Expansion 4411 Intra-Cuboid Expansion Combine other cells’ data into own to “boost” confidence  If share semantic and cube similarity  Use only if necessary  Bigger sample size will decrease confidence interval Cell segment similarity  Some dimensions are clear: Age  Some are fuzzy: Occupation  May need domain knowledge Cell value similarity  How to determine if two cells’ samples come from the same population?  Two-sample t-test (confidence-based)

42. 4422 Intra-Cuboid Expansion What is the average income of 19-year-old college students? Age/Education High-school College Graduate 18 19 20 … Expand query to include 18 and 20 year olds? Vs. expand query to include high-school and graduate students?

43. Query Expansion: Inter-Cuboid Expansion 4433  If a query dimension is  Not correlated with cube value  But is causing small sample size by drilling down too much  Remove dimension (i.e., generalize to *) and move to a more general cuboid  Can use two-sample t-test to determine similarity between two cells across cuboids  Can also use a different method to be shown later

44. 4444 Chapter 5: Data Cube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Processing Advanced Queries by Exploring Data Cube Technology  Multidimensional Data Analysis in Cube Space  Summary

45. 45 Data Mining in Cube Space  Data cube greatly increases the analysis bandwidth  Four ways to interact OLAP-styled analysis and data mining  Using cube space to define data space for mining  Using OLAP queries to generate features and targets for mining, e.g., multi-feature cube  Using data-mining models as building blocks in a multi-step mining process, e.g., prediction cube  Using data-cube computation techniques to speed up repeated model construction  Cube-space data mining may require building a model for each candidate data space  Sharing computation across model-construction for different candidates may lead to efficient mining

46. 4466 Complex Aggregation at Multiple Granularities: Multi-Feature Cubes  Multi-feature cubes (Ross, et al. 1998): Compute complex queries involving multiple dependent aggregates at multiple granularities  Ex. Grouping by all subsets of {item, region, month}, find the maximum price in 2010 for each group, and the total sales among all maximum price tuples select item, region, month, max(price), sum(R.sales) from purchases where year = 2010 cube by item, region, month: R such that R.price = max(price)  Continuing the last example, among the max price tuples, find the min and max shelf live, and find the fraction of the total sales due to tuple that have min shelf life within the set of all max price tuples

47. 4477 Discovery-Driven Exploration of Data Cubes  Hypothesis-driven  exploration by user, huge search space  Discovery-driven (Sarawagi, et al.’98)  Effective navigation of large OLAP data cubes  pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation  Exception: significantly different from the value anticipated, based on a statistical model  Visual cues such as background color are used to reflect the degree of exception of each cell

48. 4488 Kinds of Exceptions and their Computation  Parameters  SelfExp: surprise of cell relative to other cells at same level of aggregation  InExp: surprise beneath the cell  PathExp: surprise beneath cell for each drill-down path  Computation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction  Exception themselves can be stored, indexed and retrieved like precomputed aggregates

49. 4499 Examples: Discovery-Driven Data Cubes

50. 5500 Chapter 5: Data Cube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Processing Advanced Queries by Exploring Data Cube Technology  Multidimensional Data Analysis in Cube Space  Summary

51. 5511 Data Cube Technology: Summary  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  MultiWay Array Aggregation  BUC  High-Dimensional OLAP with Shell-Fragments  Processing Advanced Queries by Exploring Data Cube Technology  Sampling Cubes  Ranking Cubes  Multidimensional Data Analysis in Cube Space  Discovery-Driven Exploration of Data Cubes  Multi-feature Cubes  Prediction Cubes

52. 5522 Ref.(I) Data Cube Computation Methods  S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96  D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97  K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.. SIGMOD’99  M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB’98  J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29–54, 1997.  J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures. SIGMOD’01  L. V. S. Lakshmanan, J. Pei, and J. Han, Quotient Cube: How to Summarize the Semantics of a Data Cube, VLDB'02  X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal Cubing Approach, VLDB'04  Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. SIGMOD’97  K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97  D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom- Up Integration, VLDB'03  D. Xin, J. Han, Z. Shao, H. Liu, C-Cubing: Efficient Computation of Closed Cubes by Aggregation- Based Checking, ICDE'06

53. Ref. (II) Advanced Applications with Data 5533 Cubes  D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over uncertain and imprecise data. VLDB’05  X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework for Statistical OLAP over Sampling Data”, SIGMOD’08  C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text Cube: Computing IR measures for multidimensional text database analysis. ICDM’08  D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient OLAP operations in spatial data warehouses. SSTD’01  N. Stefanovic, J. Han, and K. Koperski. Object-based selective materialization for efficient implementation of spatial data cubes. IEEE Trans. Knowledge and Data Engineering, 12:938–958, 2000.  T. Wu, D. Xin, Q. Mei, and J. Han. Promotion analysis in multidimensional space. VLDB’09  T. Wu, D. Xin, and J. Han. ARCube: Supporting ranking aggregate queries in partially materialized data cubes. SIGMOD’08  D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k queries with multi-dimensional selections: The ranking cube approach. VLDB’06  J. S. Vitter, M. Wang, and B. R. Iyer. Data cube approximation and histograms via wavelets. CIKM’98  D. Zhang, C. Zhai, and J. Han. Topic cube: Topic modeling for OLAP on multi-dimensional text databases. SDM’09

54. Ref. (III) Knowledge Discovery with Data Cubes 54  R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97  B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. VLDB’05  B.-C. Chen, R. Ramakrishnan, J.W. Shavlik, and P. Tamma. Bellwether analysis: Predicting global aggregates from local regions. VLDB’06  Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, Multi-Dimensional Regression Analysis of Time-Series Data Streams, VLDB'02  G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data Cubes. VLDB’ 01  R. Fagin, R. V. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. Multi-structural databases. PODS’05  J. Han. Towards on-line analytical mining in large databases. SIGMOD Record, 27:97– 107, 1998  T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Data Mining & Knowledge Discovery, 6:219–258, 2002.  R. Ramakrishnan and B.-C. Chen. Exploratory mining in cube space. Data Mining and Knowledge Discovery, 15:29–54, 2007.  K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. EDBT'98  S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. EDBT'98  G. Sathe and S. Sarawagi. Intelligent Rollups in Multidimensional OLAP Data. VLDB'01

56. Unused Slides for this Class

57. 5577 Chapter 5: Data Cube Technology  Efficient Methods for Data Cube Computation  Preliminary Concepts and General Strategies for Cube Computation  Multiway Array Aggregation for Full Cube Computation  BUC: Computing Iceberg Cubes from the Apex Cuboid Downward  Precomputing Shell Fragments for Fast High-Dimensional OLAP  Data Cubes for Advanced Applications  Sampling Cubes: OLAP on Sampling Data  Ranking Cubes: Efficient Computation of Ranking Queries  Knowledge Discovery with Data Cubes  Discovery-Driven Exploration of Data Cubes  Complex Aggregation at Multiple Granularity: Multi-feature Cubes  Prediction Cubes: Data Mining in Multi-Dimensional Cube Space  Summary

58. 5588 HH--CCuubbiinngg:: UUssiinngg HH--TTrreeee SSttrruuccttuurree  Bottom-up computation  Exploring an H-tree structure  If the current computation of an H-tree cannot pass min_sup, do not proceed further (pruning)  No simultaneous aggregation a l l A B C A C B C A D B D C D A B C A B D A C D B C D D A B C D A B

59. 5599 H-tree: A Prefix Hyper-tree Month City Cust_grp Prod Cost Price Jan Tor Edu Printer 500 485 Jan Tor Hhd TV 800 1200 Jan Tor Edu Camera 1160 1280 Feb Mon Bus Laptop 1500 2500 Mar Van Edu HD 540 520 … … … … … … root edu hhd bus Jan Mar Jan Feb Tor Van Tor Mon Quant-Info Q.I. Q.I. Q.I. Sum: 1765 Cnt: 2 bins Attr. Val. Quant-Info Side-link Edu Sum:2285 … Hhd … Bus … … … Jan … Feb … … … Tor … Van … Mon … … … Header table

60. 6600 Computing Cells Involving “City” root Edu. Hhd. Bus. Jan. Mar. Jan. Feb. Tor. Van. Tor. Mon. Quant-Info Q.I. Q.I. Q.I. Sum: 1765 Cnt: 2 bins Attr. Val. Attr. Val. Quant-Info Side-link Edu Sum:2285 … Hhd … Bus … … … Jan … Feb … … … TToorr …… Van … Mon … … … Q.I. Side-link Edu … Hhd … Bus … … … Jan … Feb … … … Header Table HTor From (*, *, Tor) to (*, Jan, Tor)

61. 6611 Computing Cells Involving Month But No City root Edu. Hhd. Bus. Jan. Mar. Jan. Feb. Q.I. Q.I. Q.I. 1. Roll up quant-info 2. Compute cells involving month but no city Tor. Van. Tor. Mont. Attr. Val. Quant-Info Side-link Edu. Sum:2285 … Hhd. … Bus. … … … Jan. … Feb. … Mar. … … … Tor. … Van. … Mont. … … … Q.I. Top-k OK mark: if Q.I. in a child passes top-k avg threshold, so does its parents. No binning is needed!

62. 6622 Computing Cells Involving Only Cust_grp root edu hhd bus Jan Mar Jan Feb Q.I. Q.I. Q.I. Check header table directly Tor Van Tor Mon Attr. Val. Quant-Info Side-link Edu Sum:2285 … Hhd … Bus … … … Jan … Feb … Mar … … … Tor … Van … Mon … … … Q.I.

63. 6633 Data Cube Computation Methods  Multi-Way Array Aggregation  BUC  Star-Cubing  High-Dimensional OLAP

64. A D / A B D / B C D 6644 SSttaarr--CCuubbiinngg:: AAnn IInntteeggrraattiinngg MMeetthhoodd  D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration, VLDB'03  Explore shared dimensions  E.g., dimension A is the shared dimension of ACD and AD  ABD/AB means cuboid ABD has shared dimensions AB  Allows for shared computations  e.g., cuboid AB is computed simultaneously as ABD C / C A C / A C B C / B C A B C / A B C A B D / A B A C D / A B C D D A B C D / a l l  Aggregate in a top-down manner but with the bottom-up sub-layer underneath which will allow Apriori pruning  Shared dimensions grow in bottom-up fashion

65. 6655 IIcceebbeerrgg PPrruunniinngg iinn SShhaarreedd DDiimmeennssiioonnss  Anti-monotonic property of shared dimensions  If the measure is anti-monotonic, and if the aggregate value on a shared dimension does not satisfy the iceberg condition, then all the cells extended from this shared dimension cannot satisfy the condition either  Intuition: if we can compute the shared dimensions before the actual cuboid, we can use them to do Apriori pruning  Problem: how to prune while still aggregate simultaneously on multiple dimensions?

66. 6666 CCeellll TTrreeeess  Use a tree structure similar to H-tree to represent cuboids  Collapses common prefixes to save memory  Keep count at node  Traverse the tree to retrieve a particular tuple

67. 6677 SSttaarr AAttttrriibbuutteess aanndd SSttaarr NNooddeess  Intuition: If a single-dimensional aggregate on an attribute value p does not satisfy the iceberg condition, it is useless to distinguish them during the iceberg computation  E.g., b2, b3, b4, c1, c2, c4, d1, d2, d3  Solution: Replace such attributes by a *. Such attributes are star attributes, and the corresponding nodes in the cell tree are star nodes A B C D Count a1 b1 c1 d1 1 a1 b1 c4 d3 1 a1 b2 c2 d2 1 a2 b3 c3 d4 1 a2 b4 c3 d4 1

68. 6688 EExxaammppllee:: SSttaarr RReedduuccttiioonn  Suppose minsup = 2  Perform one-dimensional aggregation. Replace attribute values whose count < 2 with *. And collapse all *’s together  Resulting table has all such attributes replaced with the star-attribute  With regards to the iceberg computation, this new table is a lossless compression of the original table A B C D Count a1 b1 * * 1 a1 b1 * * 1 a1 * * * 1 a2 * c3 d4 1 a2 * c3 d4 1 A B C D Count a1 b1 * * 2 a1 * * * 1 a2 * c3 d4 2

69. 6699 SSttaarr TTrreeee  Given the new compressed table, it is possible to construct the corresponding cell tree—called star tree  Keep a star table at the side for easy lookup of star attributes  The star tree is a lossless compression of the original cell tree A B C D Count a1 b1 * * 2 a1 * * * 1 a2 * c3 d4 2

70. 7700 Star-Cubing Algorithm—DFS on Lattice Tree a l l A B / B C / C A C / A C B C / B C D / D A D / A B D / B C D A B C / A B C A B D / A B A C D / A B C D A B C D / A A B / A B B C D : 5 1 b * : 3 3 b 1 : 2 6 c * : 1 4 c 3 : 2 1 1 c * : 2 7 d * : 1 5 d 4 : 2 1 2 d * : 2 8 r o o t : 5 a 1 : 3 a 2 : 2 b * : 1 b 1 : 2 b * : 2 c * : 1 d * : 1 c * : 2 d * : 2 c 3 : 2 d 4 : 2

71. 7711 MMuullttii--WWaayy AAggggrreeggaattiioonn B C D A C D / A A B D / A B A B C / A B C A B C D

72. 7722 Star-Cubing Algorithm—DFS on Star- Tree

73. 7733 MMuullttii--WWaayy SSttaarr--TTrreeee AAggggrreeggaattiioonn  Start depth-first search at the root of the base star tree  At each new node in the DFS, create corresponding star tree that are descendents of the current tree according to the integrated traversal ordering  E.g., in the base tree, when DFS reaches a1, the ACD/A tree is created  When DFS reaches b*, the ABD/AD tree is created  The counts in the base tree are carried over to the new trees

74. 7744 MMuullttii--WWaayy AAggggrreeggaattiioonn ((22))  When DFS reaches a leaf node (e.g., d*), start backtracking  On every backtracking branch, the count in the corresponding trees are output, the tree is destroyed, and the node in the base tree is destroyed  Example  When traversing from d* back to c*, the a1b*c*/a1b*c* tree is output and destroyed  When traversing from c* back to b*, the a1b*D/a1b* tree is output and destroyed  When at b*, jump to b1 and repeat similar process

75. 7755 Multidimensional Data Analysis in Cube Space  Prediction Cubes: Data Mining in Multi- Dimensional Cube Space  Multi-Feature Cubes: Complex Aggregation at Multiple Granularities  Discovery-Driven Exploration of Data Cubes

76. 76 Prediction Cubes  Prediction cube: A cube structure that stores prediction models in multidimensional data space and supports prediction in OLAP manner  Prediction models are used as building blocks to define the interestingness of subsets of data, i.e., to answer which subsets of data indicate better prediction

77. 77 How to Determine the Prediction Power of an Attribute?  Ex. A customer table D:  Two dimensions Z: Time (Month, Year ) and Location (State, Country)  Two features X: Gender and Salary  One class-label attribute Y: Valued Customer  Q: “Are there times and locations in which the value of a customer depended greatly on the customers gender (i.e., Gender: predictiveness attribute V)?”  Idea:  Compute the difference between the model built on that using X to predict Y and that built on using X – V to predict Y  If the difference is large, V must play an important role at predicting Y

78. Efficient Computation of Prediction Cubes 78  Naïve method: Fully materialize the prediction cube, i.e., exhaustively build models and evaluate them for each cell and for each granularity  Better approach: Explore score function decomposition that reduces prediction cube computation to data cube computation

79. 7799 Chapter 5: Data Cube Technology  Data Cube Computation: Preliminary Concepts  Data Cube Computation Methods  Processing Advanced Queries by Exploring Data Cube Technology  Sampling Cube  Ranking Cube  Multidimensional Data Analysis in Cube Space  Summary

80. 8800 Processing Advanced Queries by Exploring Data Cube Technology  Sampling Cube  X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework for Statistical OLAP over Sampling Data”, SIGMOD’08  Ranking Cube  D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k queries with multi-dimensional selections: The ranking cube approach. VLDB’06  Other advanced cubes for processing data and queries  Stream cube, spatial cube, multimedia cube, text cube, RFID cube, etc. — to be studied in volume 2

81. Ranking Cubes – Efficient Computation of 81 Ranking queries  Data cube helps not only OLAP but also ranked search  (top-k) ranking query: only returns the best k results according to a user-specified preference, consisting of (1) a selection condition and (2) a ranking function  Ex.: Search for apartments with expected price 1000 and expected square feet 800  Select top 1 from Apartment  where City = “LA” and Num_Bedroom = 2  order by [price – 1000]^2 + [sq feet - 800]^2 asc  Efficiency question: Can we only search what we need?  Build a ranking cube on both selection dimensions and ranking dimensions

82. 82 Ranking Cube: Partition Data on Both Selection and Ranking Dimensions One single data partition as the template Slice the data partition by selection conditions Sliced Partition for city=“LA” Sliced Partition for BR=2 Partition for all data

83. 83 Materialize Ranking-Cube tid City BR Price Sq feet Block ID t1 SEA 1 500 600 5 t2 CLE 2 700 800 5 t3 SEA 1 800 900 2 t4 CLE 3 1000 1000 6 t5 LA 1 1100 200 15 t6 LA 2 1200 500 11 t7 LA 2 1200 560 11 t8 CLE 3 1350 1120 4 Step 1: Partition Data on Ranking Dimensions Step 2: Group data by Selection Dimensions City City & BR BR 1 2 3 4 SEA LA CLE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Step 3: Compute Measures for each group For the cell (LA) Block-level: {11, 15} Data-level: {11: t6, t7; 15: t5}

84. 84 Search with Ranking-Cube: Simultaneously Push Selection and Ranking Select top 1 from Apartment where city = “LA” order by [price – 1000]^2 + [sq feet - 800]^2 asc 800 1000 Without ranking-cube: start search from here Given the bin boundaries, locate the block with top score With ranking-cube: start search from here 11 15 Measure for LA: {11, 15} {11: t6,t7; 15:t5} Bin boundary for price [500, 600, 800, 1100,1350] Bin boundary for sq feet [200, 400, 600, 800, 1120]

85. 85 Processing Ranking Query: Execution Trace Select top 1 from Apartment where city = “LA” order by [price – 1000]^2 + [sq feet - 800]^2 asc 800 11 1000 With ranking-cube: start search from here 15 Measure for LA: {11, 15} {11: t6,t7; 15:t5} f=[price-1000]^2 + [sq feet – 800]^2 Bin boundary for price [500, 600, 800, 1100,1350] Bin boundary for sq feet [200, 400, 600, 800, 1120] Execution Trace: 1. Retrieve High-level measure for LA {11, 15} 2. Estimate lower bound score for block 11, 15 f(block 11) = 40,000, f(block 15) = 160,000 3. Retrieve block 11 4. Retrieve low-level measure for block 11 5. f(t6) = 130,000, f(t7) = 97,600 Output t7, done!

86. 86 Ranking Cube: Methodology and Extension  Ranking cube methodology  Push selection and ranking simultaneously  It works for many sophisticated ranking functions  How to support high-dimensional data?  Materialize only those atomic cuboids that contain single selection dimensions  Uses the idea similar to high-dimensional OLAP  Achieves low space overhead and high performance in answering ranking queries with a high number of selection dimensions

Editor's Notes

#3: Tower bridge of London
#7: 2*(2^{100}-1)-1, 1 Explanation: one cell, such as (a1, a2, …., a100) generates 2^100 -1 aggregate cells, because choose(100 1) + choose (100 2) + ... choose (100, 100) = 2^100 - 1 aggregate cells. For two cell question, it generates 2 * (2^100-1) -1 distinct aggregate cells because (*, *, …, *) generated by (a1, a2, …., a100) and (b1, b2, …, b100) will be merged into one cell: (*, *, …, *): 2. Hence we have 2*(2^{100}-1)-1
#8: If the two base cells were: {(a1, a2, a3 . . . , a100):10, (b1, b2, b3, . . . , b100):10}, the total # of non-base cells should be 2 * 2^{100} – 3. But for {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}, the total # of non-base cells should be 2 * 2^{100} – 6.
#74: Refer to the aggregation diagram.