TIS Notes
TIS Notes
Disclaimer!!
This document is not intended to be a source of truth to prepare the exam. The materials
included here are taken from the official course slides, I’m not responsible for wrong answers or
errors reported here (even though I’ve done my best to provide valid material that regards all the
topics that were spoken in the course).
This notes are an extension of this document: Appunti Riassuntivi del Corso
1
Contents
1 Data Integration 4
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 4 Vs of Big Data in data integration . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 The Steps of Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Design Steps for Data Integration . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Conflict Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Conflict Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Mapping between the Global Logical Schema and the Single Source Schemata (log-
ical view definition) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Inconsistencies in the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Record Linkage (aka Entity Resolution) . . . . . . . . . . . . . . . . . . . . 8
1.5.2 Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Data Model Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Data Warehouse 21
5.1 What is a Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1.1 Evolution of Data Warehouses . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Data Model for OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.1 OLAP Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.2 OLAP LOGICAL MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Data Warehouse Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.1 Aggregate Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.2 Star Schema vs. Snowflake Schema . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.3 Conceptual Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2
6 Big Data Architectures and Data Ethics 27
6.1 NoSQL Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1.1 Transactional Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1.2 Big Data and the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1.3 NoSQL databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1.4 NoSQL for Data Warehouses? . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.1.5 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.1.6 CAP Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Data Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2.1 Ethical Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7 Data Quality 31
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2 Data Quality Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2.1 Quality Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.2.2 Assessment Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.2.3 Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.2.4 Data Quality Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.2.5 Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3
1 Data Integration
1.1 Introduction
Data Integration is the problem of combining data coming from different data sources,
providing the user with a unified vision of the data, detecting correspondences between similar con-
cepts that come from different sources, and solving possible conflicts.
The aim of Data Integration is to set up a system where it is possible to query different data
sources as if they were a unique one (through a global schema).
1.1.1 Interoperability
Data integration is needed because of a need for interoperability among SW applications, services
and information managed by different organizations:
• find information and processing tools, when they are needed, independently of physical lo-
cation
• understand and employ the discovered information and tools, no matter what platform sup-
ports them, whether local or remote
• evolve a processing environment for commercial use without being constrained to a single
vendor’s offerings.
• Velocity (data in motion): as a direct consequence of the rate at which data is being collected
and continuously made available, many of the data sources are very dynamic.
• Variety (data in many form): data sources (even in the same domain) are extremely het-
erogeneous both at:
• The Veracity Dimension: data quality is the most general and used term, and represents
a number of quality aspects besides veracity:
– Completeness (essential fields are present)
– Validity (soundness)
– Consistency (no contradiction)
– Timeliness (up-to-date)
– Accuracy (registered in an accurate way)
4
1.1.3 Heterogeneity
Heterogeneity derives from various forms of AUTONOMY.
• Design (representation) autonomy: design a dataset in different ways w.r.t. to others
• Communication (querying) autonomy: way in which we query the data
• Execution (algorithmic) autonomy: each DBMS has its way of extracting data
There is a need for interoperability among software applications, services and information
(databases, and others) managed by different organizations that need to reuse legacy applications,
existing data repositories (e.g., deep web), and reconcile the different points of view adopted by
the various players using the information.
VARIETY (HETEROGENEITY)
Variety among several data collections to be used together:
• Different platforms: technological heterogeneity
• Different data models at the participating DBMS: model heterogeneity
• Different query languages: language heterogeneity
• Different data schemas and different conceptual representations in DBs previ-
ously developed: schema heterogeneity
• Different values for the same info (due to errors or to different knowledge): instance
(semantic) heterogeneity
VERACITY
Main Data Quality dimensions:
• Completeness
• Validity
• Consistency
• Timeliness
• Accuracy
5
Figure 1: Materialized Integration (left): a physical view aggregatin different sources over one
single structure. Every X days we need to synchronize/update the materialized DB, that is a
different physical DB in the system (offline approach).
Virtual Integration (right): virtual view over different sources providing one single structure.
No need to install a new physical DB. The view is always up to date (online approach).
MATERIALIZED INTEGRATION
Large common stores came to be known as warehouses, and the software to access, scrape,
transform, and load data into warehouses, became known as extract, transform, and load (ETL)
systems.
In a dynamic environment, one must perform ETL periodically (e.g., once a day or once a week),
thereby building up a history of the enterprise. The main purpose of a data warehouse is
to allow systematic or ad-hoc data analysis and mining. Indeed, warehouses are not
appropriate when need to integrate the operational systems (keeping data up-to-date).
VIRTUAL INTEGRATION
The virtual integration approach leaves the information requested in the local
sources. The virtual approach will always return a fresh answer to the query. The query posted
to the global schema is reformulated into the formats of the local information system. The
information retrieved needs to be combined to answer the query.
RATIONALE
The conventional wisdom is to use data warehousing and ETL products to perform data
integration. However, there is a serious flaw in one aspect of this wisdom.
Suppose one wants to integrate current (operational) data rather than historical information.
Consider, for example, an e-commerce web site which wishes to sell hotel rooms over the web.
The actual inventory of available hotel rooms exists in 100 or so information systems since
Hilton, Hyatt, and Marriot all run their own reservation systems.
Applying ETL and warehouses to this problem will create a copy of hotel availability data, which
is quickly out of date. If a website sells a hotel room, based on this data, it has no way of
guaranteeing delivery of the room, because somebody else may have sold the room in the
meantime.
6
The simplest case of non-centralized database.
DISTRIBUTED DB: non-centralized DB; distribution design decisions are on fragmentation
(horizontal or vertical), allocation, replication.
• Key conflicts: two same entities have two different keys (e.g. in a source the person is
identified by the SSN and in another source it is identified through the email address).
1.4 Mapping between the Global Logical Schema and the Single
Source Schemata (logical view definition)
There are two approaches for schema mapping:
• GAV (Global As View): the global schema is derived from the integration process of the
data source schemata, thus the global schema is expressed in terms of the data source
schemata. This approach is appropriate for stable data sources, it is in fact difficult to
extend with a new data source. Mapping quality depends on how well we have compiled
the sources into the global schema through the mapping. Whenever a source changes or a
new one is added, the global schema needs to be reconsidered.
7
• LAV (Local As View): the global schema is designed independently of the data source
schemata. The relationship (mapping) between sources and global schema is obtained by
defining each data source as a view over the global schema. This approach is appropriate if
the global schema is stable, and it favors extensibility. By the way, query processing is much
more complex. Mapping quality depends on how well we have characterized the sources.
A third approach is the union of the previous two:
• GLAV ( Global and Local As View): the relationship (mapping) between sources and
global schema is obtained by defining a set of views, some over the global schema and some
over the data sources.
The most useful integration operators to write relational GAV views are:
• union
• outerunion: used with different schemas, putting null all the information we don’t have.
There are all the attributes of the tow sources (without repetitions)
• outerjoin: it doesn’t admit different values for 1 attribute
• generalization: we keep only common attributes
A mapping defined over some data source is SOUND when it provides a subset of the data that
is available in the data source that corresponds to the definition.
A mapping is COMPLETE if it provides a superset of the available data in the data source
that corresponds to the definition.
A mapping is EXACT if it provides all and only data corresponding to the definition: it is both
sound and complete.
With the GAV approach, the mapping can be exact or only sound.
With the LAV approach, the mapping can be exact or only complete, due to the incompleteness
of one or more sources (they do not cover the data “expected” from the global schema, which has
been defined independently of the source contents).
• Data Fusion: once recognized that two items refer to the same entity, how do we reconcile
inconsistent information?
– Sequence-based:
∗ edit-distance: based on the minimal number of operations that are needed to
transform string a to string b
8
– Set-based:
∗ Jaccard: divide the strings into tokens, and compute the measure on the two sets
of tokens (intersection divided by union of tokens)
– Phonetic:
∗ Soundex: calculates a four-character code from a word based on the pronunciation
and considers two words as similar if their codes are equal. Similar sounding
letters are assigned the same soundex code.
• Record Matching:
– Rule-based matching: manually written rules that specify when two tuples match (e.g.,
two tuples refer to the same person if they have the same SSN)
– Learning Matching Rules:
∗ Supervised: learn how to match from training data, then apply it to match new
tuple pairs (requires a lot of training data)
∗ Unsupervised: clusterizing records based on similar values
– Probabilistic Matching: model the matching domain using a probability distribution.
Provides a principled framework that can naturally incorporate a variety of domain
knowledge. By the way, it is computationally expensive.
• Often the correct value may be obtained as a function of the original ones
3. Choice of the target logical data model and translation of the global conceptual schema
4. Definition of the language translation (wrapping)
5. Definition of the data views.
9
2 Semistructured Data Integration
For semistructured data there is some form of structure, but it is not as prescriptive, regular and
complete as in traditional DBMSs.
• text
• trees
• graphs
They are all different and do no lend themselves to easy integration.
data with different structures also with semistructured data, as if they were all structured. An
overall data representation should be progressively built, as we discover and explore new
information sources.
2.1 Mediators
A mediator has the same purpose as the integration systems. Mediators are interfaces
specialized in a certain domain which stand between application and wrappers. They accept
queries written in the application’s language, decompose them and send them to each specific
wrapper. They also send the responses back to the application, providing a unified vision of data.
The term mediation includes:
2.1.1 Tsimmis
TSIMMIS is the first system based on the mediator/wrapper paradigm, proposed in the 90’s at
Stanford.
In this system:
• unique, graph-based data model
• data model managed by the mediator
10
2.2 Ontologies
2.2.1 Problem in information extraction from HTML docs
• Web sites change very frequently
• a layout change may affect the extraction rules
• human-based maintenance of an ad-hoc wrapper is very expensive → automatic wrapper
generation is better, especially when many pages share the same structure, when pages are
dynamically generated from a DB.
We can only use automatic wrapper generation when pages are regular to some extent.
Formal Specification allows for use of a common vocabulary for automatic knowledge sharing.
Shared: an ontology captures knowledge which is common.
Conceptualization: gives a unique meaning to the terms that define the knowledge about a
given domain.
Ontology Types:
• Taxonomic ontologies: definition of concepts through terms, their hierarchical
organization, and additional (predefined) relationships
• Descriptive ontologies: definition of concepts through data structures and their
interrelationships. Used for alignment of existing data structures or to design new
specialized ontologies (domain ontologies). Descriptive ontologies require rich models to
enable representations close to human perception.
An ontology consists of:
• concepts: generic concepts (express general world categories), specific concepts (describe a
particular application domain, domain ontologies)
• concept definition (in natural language or via a formal language)
• relationships between concepts: taxonomies, user-defined associations, synonymies,...
An ontology is composed by:
• a T-Box: contains all the concept and role definitions, and also contain all the axioms of
our logical theory (e.g., ”A father is a Man with a Child”).
• an A-Box: contains all the basic assertions of the logical theory (e.g., ”Tom is a father” is
represented as Father(Tom)).
11
2.3 Semantic Web
Semantic Web is a vision for the future of the Web in which information is given explicit
meaning, making it easier for machines to automatically process and integrate information
available on the Web. It is built on XML’s ability to define customized tagging schemes and
RDF’s flexible approach to representing data.
Linked Data consists of connecting datasets across the Web. It describes a method of
publishing structured data so that it can be interlinked and become more useful. It builds upon
standard Web technologies such as HTTP, RDF and URIs, but extends them to share
information in a way that can be read automatically by computers, enabling data from different
sources to be connectd and queried.
• XML: provides a syntax for structured documents, but imposes no semantic constraints on
the meaning of these documents.
• XML Schema: is a language for restricting the structure of XML documents and also
extends XML with data types.
• RDF: is a data model for objects (resources) and relations between them, provides a simple
semantic for this data model, and can be represented in an XML syntax.
• RDF Schema: is a vocabulary for describing properties and classes of RDF resources
OWL adds more vocabulary for describing properties and classes. The OWL (Web Ontology
Language) is designed for use by applications that need to process the content of information
instead of just presenting information to humans. It facilitates greater machine interpretability of
Web content than that supported by XML, RDF and RDF Schema by providing additional
vocabulary along with a formal semantics.
It has three increasingly-expressive sublanguages:
• OWL Lite: supports users primarily needing a classification hierarchy and simple
constraints. It has a lower formal complexity than OWL DL.
• OWL DL: supports users who want maximum expressiveness while all conclusions are
guaranteed to be computed (computational completeness) and all computations will finish
in a finite time (decidability).
• OWL Full: meant for users who want maximum expressiveness and the syntactic freedom
of RDF: no computational guarantees.
• Consistency: verifies that there exists at least one interpretation which satisfies the given
Tbox
• Local Satisfiability: verifies for a given concept that there exists at least one
interpretation in which it is true.
12
Services for the A-Box:
• Consistency: verifies that an A-Box is consistent w.r.t a given Tbox
• Discovery of equivalent concepts (mapping): what does equivalent mean? (we look for some
kind of similarity)
• Formal representation of these mappings: how are these mapping represented?
• Reasoning of these mappings: how do we use the mappings within our reasoning and
query-answering process?
Ontology matching is the process of finding pairs of resources coming from different ontologies
which can be considered equal in meaning. We need some kind of similarity measure, this time
taking into account also semantics (i.e., not only the structure of words).
As already seen, similarity is strictly related to distance. Three main categories:
Ontology mapping is the process of relating similar concepts or relations of two or more
information sources using equivalence relations or order relations.
• Model coverage and granularity: a mismatch in the part of the domain that is covered by
the ontology, or the level of detail to which that domain is modeled (e.g., ontology for all
the animals vs. ontology for birds)
• Paradigm: different paradigms can be used to represent concepts such as time (e.g.,
temporal representations → continuous interval vs. discrete sets of time points
• Encoding
• Concept description: a distinction between two classes can be modeled using a qualifying
attribute or by introducing a separate class, or the way in which is-a hierarchy is built.
13
2.5.1 How can ontologies support integration?
An ontology can be a schema integration support tool. Ontologies are used to represent
the semantics of schema elements (if the schema exists).
An ontology can be used instead of a global schema:
• schema-level representation only in terms of ontologies
• ontology mapping, merging etc. instead of schema integration
An ontology as a support for content interpretation and wrapping (e.g., HTML pages).
An ontology as a mediation support tool for content inconsistency detection and
resolution (record linkage and data fusion).
When we use ontologies to interact with databases we have to take care of:
• Transformation of ontological query into the language of the datasource, and the other way
round
• Different semantics (CWA vs. OWA)
• What has to be processed and where (e.g., push of the relational operators to the relational
engine)
14
3 New trends in Data Integration Research and
Development
3.1 Uncertainty in Data Integration
Databases are assumed to represent certain data (a tuple in the database is true). But real life is
not as certain. Uncertain databases attempt to model uncertain data and to answer queries in an
uncertain world.
Moreover:
• Data itself may be uncertain (e.g. extracted from an unreliable source);
• Mappings might be approximate
• Reconciliation is approximate
• Imprecise queries are approximate
Whatever the semantics of uncertainty (e.g, fuzzy or probabilistic..) an uncertain database
describes a set of possible worlds.
Example: assign each tuple a probability. Then the probability of a possible world is the product
of the probabilities for the tuples.
15
3.2 Building Large-Scale Structured Web Databases
Develop methodologies for designing a fully integrated database coming from heterogeneous data
sources.
3.2.2 Mashup
Mashup is a paradigm for lightweight integration.
It is an application that integrates two or more mashup components at any of the application
layers (data, application logic, presentation layer) possibly putting them in communication with
each other.
Key elements:
• Mashup component: is any piece of data, application logic and/or user interface that can
be reused and that is accessible either locally or remotely (e.g., Craiglist and GoogleMaps)
• Mashup logic: is the internal logic of operation of a mashup; it specifies the invocation of
components, the control flow, the data flow, the data transformations and the UI of the
mashup.
Mashups introduce integration at the presentation layer and typically focus on
non-mission-critical applications.
Problems:
Mashup development is not easy. Luckily, mashups typically work on the ”surface”.
• Reuse of existing components
• Composition of the outputs of software systems
TYPES OF MASHUPS
• Data mashups: retrieve data from different resources, process them and return an
integrated result set. Data mashups are a Web-based, lightweight form of data integration,
intended to solve different problems.
• Logic mashups: integrate functionality published by logic or data components. The
output is a process that manages the components, in turn published as a logical component.
16
• User interface mashups: combine the component’s native UIs into an integrated UI. The
output is a Web application the user can interact with. It is mostly client-side, generally
short-lived.
• Hybrid mashups: span multiple layers of the application stack, bringing together
different types of components inside one and a same application.
• Schema mapping
• Uniform search over multiple types of data
• Combining structured, semi-structured and unstructured data
• Approximate query processing
17
4 Data Analysis and Exploration
4.1 Analysis of Data
Data Analysis is a process of inspecting, cleaning, transforming and modeling data with the
goal of highlighting useful information, suggesting conclusions, and supporting decision making.
Data mining is a particular data analysis technique that focuses on modeling and knowledge
discovery for predictive rather than purely descriptive purposes.
Machine Learning is a field of study that gives computers the ability to learn without being
explicit programmed. A computer program is said to learn from experience E w.r.t some task T
and some performance measure P, if its performance on T, as measured by P, improves with
experience E.
• Summary statistics: are numbers that summarize properties of the data, such as
frequency (the frequency of an attribute value is the percentage of times the value occurs in
the data set). Most summary statistics can be calculated in a single pass through the data.
– Mean: the most common measure of the location of an ordered set of points. However,
it is very sensitive to outliers.
– Median: also commonly used (p-50 percentile)
– Range: the difference between the max and the min
– Variance and Standard Deviation: most common measures of the spread of a set of
points. This is also sensitive to outliers.
– For continuous data, is useful to know the notion of percentile.
Given a continuous attribute x and a number p between 0 and 100, the p-th percentile
is a value xp of x such that p% of the observed values of x are less than xp
• Visualization: is the conversion of data into a visual or tabular format so that the
characteristics of the data and the relationships among data items or attributes can be
analyzed or reported.
– Human have a well developed ability to analyze large amounts of information that is
presented visually
18
– Can detect general patterns and trends
– Can detect outliers and unusual patterns
Selection is the elimination of certain objects and attributes. It may involve choosing a
subset of attributes.
– Dimensionality Reduction is often used to reduce the number of dimensions to two or
three
Visualization Techniques:
– Histograms:
∗ usually shows the distribution of values of a single variable
∗ divide the values into bins and show a bar plot of the number of objects in each bin
∗ the height of each bar indicates the number of objects
– Box Plots
∗ displays the distribution of data (over percentiles)
∗ can be used to compare attributes
• Online Analytical Processing (OLAP)
4.3.1 Methods
• Classification (predictive): given a collection of records, each record contains a set of
attributes, one of which is the class. The goal is to find a model for the ”class” attribute as
a function of the values of other attributes, in order to assign a class to a previously unseen
record as accurately as possible. The accuracy of the model is then evaluated over a set of
unseen records called test set.
(Examples: predict fraudolent cases in credit card transactions)
• Clustering (descriptive): tries to divide data points into cluster such that data points in
one cluster are more similar to one another and data points in separate clusters are less
similar to one another.
(Examples: market segmentation into distinct subsets of customers, find group of
documents that are similar to each other based on the important terms appearing in them).
• Frequent Itemset:
– an itemset is a collection of one or more items (e.g., Milk, Bread, Diaper)
– the support count is the number of occurrences of an item set in a list of transactions
(e.g., 2)
– the support is a fraction of transactions that contain an item set (e.g., 2/5)
– the frequent itemset is an item whose support is greater than or equal to a minsup
threshold.
19
• Association Rule An implication expression of the form X → Y, where X and Y are
itemsets.
• Association Rule Discovery (descriptive): given a set of records each of which contains
some number of items from a given collection, produce dependency rules which will predict
occurrence of an item based on occurrences of other item. (e.g., If a customer buys diaper
and milk, then she is very likely to buy beer). (Examples: marketing and sales promotions).
• Sequential pattern discovery (descriptive):
• Data Quality
• Data Distribution
• Privacy Detection
• Streaming Data
20
5 Data Warehouse
5.1 What is a Data Warehouse
• As a dataset: decision support database maintained separately from the organization’s
operational database. A Data Warehouse is a single, complete and constant store of data
obtained from a variety of different sources made available to end users, so that they can
understand and use it in a business context.
• As a process: technique for assembling data from various sources with the purpose of
answering business questions. A Data Warehouse is a process for transforming data into
information and for making it available to users in a timely enough manner to make a
difference.
A data warehouse is a
• subject-oriented,
• integrated,
• time-varying,
• non-volatile
collection of data that is used primarily in organizational decision making.
Data Warehouses (DWs) are very large databases (from Terabytes: 1012 bytes, to Zottabytes:
1024 bytes).
21
5.1.1 Evolution of Data Warehouses
• Offline DW: periodically updated from data in the operational systems and the DW data
are stored in a data structure designed to facilitate reporting
• Online DW: data in the warehouse is updated for every transactions performed on the
source data (e.g. by triggers)
• Integrated DW: data assembled from different data sources, so users can look up the
information they need across other systems.
The dimensional fact model allows one to describe a set of fact schemata.
The components of a fact schema are:
• Facts:
• Dimensions:
– a dimension is a fact property defined w.r.t a finite domain
– it describes an analysis coordinate for the fact.
• Dimension Hierarchy: is a directional tree whose
– Store chain:
∗ Fact: sales
∗ Measures: sold quantity, gross income
∗ Dimensions: product, time, zone
• Drill-down: de-aggregates data at the lower level (e.g., for a given product category and a
given region, show daily sales)
• Slice-and-dice: applies selections and projections, which reduce data dimensionality
• Pivoting: selects two dimensions to re-aggregate data
22
5.2.2 OLAP LOGICAL MODELS
• MOLAP (Multidimensional On-Line Analytical Processing):
– stores data by using a multidimensional data structure
– the storage is not in the relational database, but in proprietary formats
– Advantages:
∗ Excellent performance, fast data retrieval, optimal for slicing and dicing operations
∗ Can perform complex calculations (pre-generated when the cube is created)
– Disadvantages:
∗ Limited in the amount of data it can handle (calculations are performed when the
cube is built)
∗ Requires additional investment: cube technology are often proprietary and do not
already exist in the organization (human and capital resources are needed)
• ROLAP (Relational On-Line Analytical Processing):
– Uses the relational data model to represent multidimensional data.
– Advantages:
∗ Can handle large amount of data
∗ Can leverage functionalities inherent in the relational databases
– Disadvantages:
∗ Performance can be slow because ROLAP report is essentially a SQL query (or
multiple queries) in the relational database (query time is long if data size is large)
∗ Limited by SQL functionalities
• HOLAP (Hybrid On-Line Analytical Processing):
– combines the advantages of MOLAP and ROLAP.
– for summary-type information it leverages cube technology for faster performance
– when detail information is needed it can ”drill through” from the cube into the
underlying relational data
• WITH CUBE:
– generates a result set that shows aggregates for all combinations of values in the
selected columns
– evaluates aggregate expression with all possible combinations of columns specified in
group by clause
• WITH ROLLUP:
– generates a result set that shows aggregates for a hierarchy of values in the selected
columns
– evaluates aggregate expressions only relative to the order of columns specified in group
by clause
– it eliminates the results that contain ALL only in one column
23
5.3 Data Warehouse Design
A primary event is an occurrence of a fact. It is represented by means of tuple of values (e.g.,
On 10/10/2001, ten ”Brillo” detergent packets were sold at the BigShop for a total amount of 25
euros).
A hierarchy describes how it is possible to group and select primary events. The root of a
hierarchy corresponds to the primary event, and represents the finest aggregation granularity.
Given a set of dimensional attributes, each tuple of their values identifies a secondary event
that aggregates (all) the corresponding primary events.
For example the sales can be grouped by Product and Month: ”in October 2001, 230 ”Brillo”
detergent packets were sold at the BigShop for a total amount of 575 euros”.
– e.g., unitary price at a given instant, money change rate, interest rate
– They are evaluated in particular time instants but they are relative measures (AVG,
MIN, MAX, but relative measures)
– unitary price at a given instant cannot be aggregated by sum over the category/type
or the shop/city, nor over the time hierarchy
24
5.3.1 Aggregate Operators
• Distributive operator: allows to aggregate data starting from partially aggregated data
(e.g., sum, max, min)
• Algebraic operator: requires further information to aggregate data (e.g., avg)
• Holistic operator: it is not possible to obtain aggregate data starting from partially
aggrega data (e.g., mode, median)
SNOWFLAKE SCHEMA
The Snowflake Schema reduces the de-normalization of the dimensional tables DTi of a Star
Schema (removal of some transitive dependencies). This allows to avoid space wasting.
Dimension tables of a Snowflake schema are composed by:
• A primary key di,j
• A subset of DTi attributes that directly depend on di,j
• Zero or more external keys that allow to obtain the entire information
In a Snowflake Schema:
• Primary dimension tables: their keys are imported in the fact table
• Secondary dimension table
Benefits:
• Reduction of memory space
• New surrogate keys
• Advantages in the execution of queries related to attributes contained into fact and primary
dimension tables
25
5.3.3 Conceptual Design
Conceptual design takes into account the documentation related to the reconciled database.
1. Fact definition: facts correspond to events that dynamically happen in the organization
2. For each fact:
(a) Design of the attribute tree
(b) Attribute tree editing: allows to remove irrelevant attributes
• pruning: the subtree rooted in v is deleted. Delete a leaf node or a leaf subtree.
• grafting: the children of v are directly connected to the father of v. Delete an
internal node, and attach its children to the parent of the internal node.
(c) Dimension definition: dimensions can be chosen among the attributes that are
children of the root of the tree (time should always be a dimension)
(d) Measure definition: if the fact identifier is included in the set of dimensions, then
numerical attributes that are children of the root (fact) are measures. More measures
are defined by applying aggregate functions to numerical attributes of the tree.
(e) Fact schema creation: the attribute tree is translated into a fact schema including
dimensions and measures. The fact name corresponds to the name of the selected
entity, dimension hierarchies correspond to subtrees having as roots the different
dimensions (with the least granularity)
In the glossary, an expression is associated with each measure. The expression describes how we
obtain the measure at the different levels of aggregation starting from the attributes of the source
schema.
26
6 Big Data Architectures and Data Ethics
6.1 NoSQL Databases
6.1.1 Transactional Systems
ACID properties are a consistency model over data in transactions, that is common in traditional
relational databases. They guarantee safe operations on data at anytime.
The ACID acronym stands for:
• Atomicity: a transaction is an indivisible unit of execution
• Consistency: the execution of a transaction must not violate the integrity constraints
defined on the database
• Isolation: the execution of a transaction is not affected by the execution of other
concurrent transactions
• Durability (Persistence): the effects of a successful transaction must be permanent.
The classical DBMSs (also distributed) are transactional systems: they provide a mechanism for
the definition and execution of transactions. In the execution of a transaction the ACID
properties must be guaranteed. A transaction represents the typical elementary unit of work of a
Database Server, performed by an application.
Because ACID properties are not really required in certain domains, new DBMS have been
proposed that are not transactional systems.
27
6.1.4 NoSQL for Data Warehouses?
28
– Ability to query: get or filter only on row and column keys
– Semi-structured diagram:
∗ Row: columns indexed within each row by a row-key
∗ Column-family: a set of columns, normally similar in structure to optimize
compaction
∗ Columns in the same column family will be ”close” (stored in the same bloc on
disk)
∗ Columns: have a name and may contain a value for each row
• Graph-based
FAMOUS IMPLEMENTATIONS
• Amazon DynamoDB:
– Key-value
– CAP: AP, guarantees Availability and Partition tolerance, relaxing consistency
• Google BigTable:
– Column-oriented
– CAP: CP, if there is a network partition, Availability is lost, but strict consistency
may be required
• Cassandra:
– Column-oriented
– CAP: AP, consistency is configurable
• MongoDB:
– Document-based
– CAP: CP
29
6.2 Data Ethics
As data have an impact on almost every aspect of our lives, it is more and more important to
understand the nature of this effect. With search and recommendation engines, the web can
influences our lives.
Big Data processing is based on algorithmic, thus it must be objective. Unfortunately:
• algorithms are based on data, and data may contain errors
Data quality is a typical ethical requirement: we could never trust a piece of information if it
did not have the typical data quality properties.
Data should conform to a high ethical standard, for it to be considered of good quality.
Hence, the satisfaction of the ethical requirements is actually necessary to assert the quality of a
result. It is the responsibility of the system designer and of the person/company that ordered the
job, to ensure that the necessary ethical properties are satisfied.
30
7 Data Quality
7.1 Introduction
Data preparation is important because real-world data is often incomplete, inconsistent, and
contain many errors. Data preparation, cleaning, and transformation comprises the majority of
the work in a data mining application (90%).
Data Quality is the ability of a data collection to meet user requirements. Causes of poor
quality:
• Historical changes: the importance of data might change over time
• Data usage: data relevance depends on the process in which data are used
31
7.2.1 Quality Dimensions
The most used objective dimensions are:
• accuracy: the extent to which data are correct, reliable and certified.
• completeness: the degree to which a given data collection includes the data describing the
corresponding set of real-world objects.
• consistency: the satisfaction of semantic rules defined over a set of data items. It refers to
the violation of semantic rules defined over a set of data items.
• timeliness: the extent to which data are sufficiently updated for a task. It is the average
age of the data in a source.
Data Quality Rules are the requirements that business set to their data and they are associated
with the data quality dimensions. They are also designated to check the validity of data.
• Plotting Data
DATA CLEANING
Data Cleaning is the process of identifying and eliminating inconsistencies, discrepancies and
errors in data in order to improve quality.
Cleaning tasks
• Standardization/normalization:
– Datatype conversion
– Discretization
– Domain Specific
• Missing Values:
– Detection
– Imputing
32
• Outlier Detection:
– Model
– Distance
• Duplicate detection:
– discovery of multiple representations of the same real-world object and merging.
Similarity measures:
• String-Based Distance Functions
– Jaccard distances (intersection of two sets of words divided by the union of them)
– Cosine similarity
In order to check the existence of duplicates we should compare all pairs of instances. Too many
comparisons → (n2 − n)/2.
Partitioning is the solution for this problem. Partition records through a selection and
comparisons are performed among pairs of records inside the partition.
Example: Sorted neighborhood
1. Creation of key: compute a key for each record by extracting relevant fields
2. Sort data: using the key
3. Merge data: move a fixed size window through the sequential list of records. This limits the
comparisons to the records in the window.
4. Data are compared within a rule and a similarity function
2. Entity reconciliation
3. Data fusion
Data Fusion resolves uncertainties and contradictions. Given duplicate records (previously
identified), it creates a single object representation while resolving conflicting data values.
• complementary tuples
• identical tuples
• subsumed tuples
• conflicting tuples
33
DATA FUSION ANSWERS
The result of a query to an integrated information system is called ”answer”.
Properties of an answer:
• Complete: the answer should contain all the objects and attributes that have been present
in the sources
• Concise: all the object and attributes are described only once
• Consistent: all the tuples that are consistent w.r.t. a specified set of integrity constraints
are present
• Complete and consistent: it additionally fulfill a key constraint on some real world ID
(contains all attributes from the sources and combines semantically equivalent ones into
only one attribute)
-
34