0% found this document useful (0 votes)

85 views

Unit-5: Database System Concepts, 6 Ed

The document discusses different types of database systems including centralized, client-server, and parallel systems. Centralized systems involve a single computer while client-server systems separate the front-end and back-end functionality. Parallel database systems utilize multiple processors and disks to improve performance.

Uploaded by

Sujy Cau

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views

Unit-5: Database System Concepts, 6 Ed

Uploaded by

Sujy Cau

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 67

UNIT- 5

Database System Concepts, 6th Ed.

Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use

Centralized Systems
Run on a single computer system and do not interact with other

computer systems.
General-purpose computer system: one to a few CPUs and a number

of device controllers that are connected through a common bus that provides access to shared memory.
Single-user system (e.g., personal computer or workstation): desk-top

unit, single user, usually has only one CPU and one or two hard disks; the OS may support only one user.
Multi-user system: more disks, more memory, multiple CPUs, and a

multi-user OS. Serve a large number of users who are connected to the system vie terminals. Often called server systems.

Database System Concepts - 6th Edition

22.2

Silberschatz, Korth and Sudarshan

A Centralized Computer System

Database System Concepts - 6th Edition

22.3

Silberschatz, Korth and Sudarshan

Client-Server Systems
Server systems satisfy requests generated at m client systems, whose

general structure is shown below:

Database System Concepts - 6th Edition

22.4

Silberschatz, Korth and Sudarshan

Client-Server Systems (Cont.)

Database functionality can be divided into:

Back-end: manages access structures, query evaluation and optimization, concurrency control and recovery.
Front-end: consists of tools such as forms, report-writers, and graphical user interface facilities.

The interface between the front-end and the back-end is through SQL or

through an application program interface.

Database System Concepts - 6th Edition

22.5

Silberschatz, Korth and Sudarshan

Client-Server Systems (Cont.)

Advantages of replacing mainframes with networks of workstations or

personal computers connected to back-end server machines:

better functionality for the cost flexibility in locating resources and expanding facilities better user interfaces easier maintenance

Database System Concepts - 6th Edition

22.6

Silberschatz, Korth and Sudarshan

Server System Architecture

Server systems can be broadly categorized into two kinds:

transaction servers which are widely used in relational database systems, and data servers, used in object-oriented database systems

Database System Concepts - 6th Edition

22.7

Silberschatz, Korth and Sudarshan

Transaction Servers
Also called query server systems or SQL server systems

Clients send requests to the server Transactions are executed at the server Results are shipped back to the client.

Requests are specified in SQL, and communicated to the server

through a remote procedure call (RPC) mechanism.

Transactional RPC allows many RPC calls to form a transaction. Open Database Connectivity (ODBC) is a C language application

program interface standard from Microsoft for connecting to a server, sending SQL requests, and receiving results.
JDBC standard is similar to ODBC, for Java

Database System Concepts - 6th Edition

22.8

Silberschatz, Korth and Sudarshan

Transaction Server Process Structure

A typical transaction server consists of multiple processes accessing

data in shared memory.

Server processes

These receive user queries (transactions), execute them and send results back

Processes may be multithreaded, allowing a single process to execute several user queries concurrently
Typically multiple multithreaded server processes

Lock manager process

More on this later

Output modified buffer blocks to disks continually

Database writer process

Database System Concepts - 6th Edition

22.9

Silberschatz, Korth and Sudarshan

Transaction Server Processes (Cont.)

Log writer process

Server processes simply add log records to log record buffer

Log writer process outputs log records to stable storage. Performs periodic checkpoints Monitors other processes, and takes recovery actions if any of the other processes fail

Checkpoint process

Process monitor process

E.g., aborting any transactions being executed by a server process and restarting it

Database System Concepts - 6th Edition

22.10

Silberschatz, Korth and Sudarshan

Transaction System Processes (Cont.)

Database System Concepts - 6th Edition

22.11

Silberschatz, Korth and Sudarshan

Transaction System Processes (Cont.)

Shared memory contains shared data

Buffer pool Lock table Log buffer

Cached query plans (reused if same query submitted again) All database processes can access shared memory To ensure that no two processes are accessing the same data structure at the same time, databases systems implement mutual exclusion using either Operating system semaphores Atomic instructions such as test-and-set
To avoid overhead of interprocess communication for lock

request/grant, each database process operates directly on the lock table instead of sending requests to lock manager process Lock manager process still used for deadlock detection
Database System Concepts - 6th Edition 22.12 Silberschatz, Korth and Sudarshan

Data Servers
Used in high-speed LANs, in cases where

The clients are comparable in processing power to the server The tasks to be executed are compute intensive.

Data are shipped to clients where processing is performed, and then

shipped results back to the server.

This architecture requires full back-end functionality at the clients. Used in many object-oriented database systems Issues:

Page-Shipping versus Item-Shipping Locking Data Caching Lock Caching

Database System Concepts - 6th Edition

22.13

Silberschatz, Korth and Sudarshan

Data Servers (Cont.)

Page-shipping versus item-shipping

Smaller unit of shipping more messages Worth prefetching related items along with requested item Page shipping can be thought of as a form of prefetching Locking

Overhead of requesting and getting locks from server is high due to message delays Can grant locks on requested and prefetched items; with page shipping, transaction is granted lock on whole page. Locks on a prefetched item can be P{called back} by the server, and returned by client transaction if the prefetched item has not been used. Locks on the page can be de escalated to locks on items in the page when there are lock conflicts. Locks on unused items can then be returned to server.

Database System Concepts - 6th Edition

22.14

Silberschatz, Korth and Sudarshan

Data Servers (Cont.)

Data Caching

Data can be cached at client even in between transactions

But check that data is up-to-date before it is used (cache coherency) Check can be done when requesting lock on data item Locks can be retained by client system even in between transactions Transactions can acquire cached locks locally, without contacting server Server calls back locks from clients when it receives conflicting lock request. Client returns lock once no local transaction is using it. Similar to deescalation, but across transactions.

Lock Caching

Database System Concepts - 6th Edition

22.15

Silberschatz, Korth and Sudarshan

Parallel Systems
Parallel database systems consist of multiple processors and multiple

disks connected by a fast interconnection network.

A coarse-grain parallel machine consists of a small number of

powerful processors
A massively parallel or fine grain parallel machine utilizes

thousands of smaller processors.

Two main performance measures:

throughput --- the number of tasks that can be completed in a given time interval response time --- the amount of time it takes to complete a single task from the time it is submitted

Database System Concepts - 6th Edition

22.16

Silberschatz, Korth and Sudarshan

Speed-Up and Scale-Up

Speedup: a fixed-sized problem executing on a small system is given

to a system which is N-times larger.

Measured by: speedup = small system elapsed time large system elapsed time

Speedup is linear if equation equals N. N-times larger system used to perform N-times larger job Measured by: scaleup = small system small problem elapsed time big system big problem elapsed time

Scaleup: increase the size of both the problem and the system

Scale up is linear if equation equals 1.

Database System Concepts - 6th Edition

22.17

Silberschatz, Korth and Sudarshan

Speedup

Database System Concepts - 6th Edition

22.18

Silberschatz, Korth and Sudarshan

Scaleup

Database System Concepts - 6th Edition

22.19

Silberschatz, Korth and Sudarshan

Batch and Transaction Scaleup

Batch scaleup:

A single large job; typical of most decision support queries and scientific simulation. Use an N-times larger computer on N-times larger problem. Numerous small queries submitted by independent users to a shared database; typical transaction processing and timesharing systems. N-times as many users submitting requests (hence, N-times as many requests) to an N-times larger database, on an N-times larger computer. Well-suited to parallel execution.

Transaction scaleup:

Database System Concepts - 6th Edition

22.20

Silberschatz, Korth and Sudarshan

Factors Limiting Speedup and Scaleup

Speedup and scaleup are often sublinear due to:
Startup costs: Cost of starting up multiple processes may dominate

computation time, if the degree of parallelism is high.

Interference: Processes accessing shared resources (e.g., system

bus, disks, or locks) compete with each other, thus spending time waiting on other processes, rather than performing useful work.
Skew: Increasing the degree of parallelism increases the variance in

service times of parallely executing tasks. Overall execution time determined by slowest of parallely executing tasks.

Database System Concepts - 6th Edition

22.21

Silberschatz, Korth and Sudarshan

Parallel Database Architectures

Shared memory -- processors share a common memory Shared disk -- processors share a common disk Shared nothing -- processors share neither a common memory nor

common disk
Hierarchical -- hybrid of the above architectures

Database System Concepts - 6th Edition

22.22

Silberschatz, Korth and Sudarshan

Parallel Database Architectures

Database System Concepts - 6th Edition

22.23

Silberschatz, Korth and Sudarshan

Shared Memory
Processors and disks have access to a common memory, typically via

a bus or through an interconnection network.

Extremely efficient communication between processors data in

shared memory can be accessed by any processor without having to move it using software.
Downside architecture is not scalable beyond 32 or 64 processors

since the bus or the interconnection network becomes a bottleneck

Widely used for lower degrees of parallelism (4 to 8).

Database System Concepts - 6th Edition

22.24

Silberschatz, Korth and Sudarshan

Shared Disk
All processors can directly access all disks via an interconnection

network, but the processors have private memories.

The memory bus is not a bottleneck Architecture provides a degree of fault-tolerance if a processor fails, the other processors can take over its tasks since the database is resident on disks that are accessible from all processors.

Examples: IBM Sysplex and DEC clusters (now part of Compaq)

running Rdb (now Oracle Rdb) were early commercial users

Downside: bottleneck now occurs at interconnection to the disk

subsystem.
Shared-disk systems can scale to a somewhat larger number of

processors, but communication between processors is slower.

Database System Concepts - 6th Edition

22.25

Silberschatz, Korth and Sudarshan

Shared Nothing
Node consists of a processor, memory, and one or more disks.

Processors at one node communicate with another processor at another node using an interconnection network. A node functions as the server for the data on the disk or disks the node owns.
Examples: Teradata, Tandem, Oracle-n CUBE Data accessed from local disks (and local memory accesses) do not

pass through interconnection network, thereby minimizing the interference of resource sharing.
Shared-nothing multiprocessors can be scaled up to thousands of

processors without interference.

Main drawback: cost of communication and non-local disk access;

sending data involves software interaction at both ends.

Database System Concepts - 6th Edition

22.26

Silberschatz, Korth and Sudarshan

Hierarchical
Combines characteristics of shared-memory, shared-disk, and shared-

nothing architectures.
Top level is a shared-nothing architecture nodes connected by an

interconnection network, and do not share disks or memory with each other.
Each node of the system could be a shared-memory system with a

few processors.
Alternatively, each node could be a shared-disk system, and each of

the systems sharing a set of disks could be a shared-memory system.

Reduce the complexity of programming such systems by distributed

virtual-memory architectures

Also called non-uniform memory architecture (NUMA)

Database System Concepts - 6th Edition

22.27

Silberschatz, Korth and Sudarshan

Hybrid architecture
hybrid architecture includes:
Non-Uniform Memory Architecture (NUMA), which involves the Non-

Uniform Memory Access.

Cluster (shared nothing + shared disk: SAN/NAS), which is formed by

a group of connected computers.

Non-Uniform Memory Access (NUMA) is a computer memory

design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than nonlocal memory (memory local to another processor or memory shared between processors).
NUMA architectures logically follow in scaling from symmetric

multiprocessing (SMP) architectures.

Database System Concepts - 6th Edition

22.28

Silberschatz, Korth and Sudarshan

XML
Using an example explain the distinction between attribute and a sub

element. Explain the purpose and use of namespaces

Give the DTD for an XML representation of the following nested-

relational schema.

Emp = (ename, ChildrenSet setof(Children), SkillsSet setof(Skills)) Children = (name, Birthday) Birthday = (day, month, year) Skills = (type, ExamsSet setof(Exams)) Exams = (year, city)
Explain the limitations of DTD. Describe the alternative to overcome

this limitation.

Database System Concepts - 6th Edition

22.29

Silberschatz, Korth and Sudarshan

Introduction
XML: Extensible Markup Language Defined by the WWW Consortium (W3C) Derived from SGML (Standard Generalized Markup Language), but

simpler to use than SGML

Documents have tags giving extra information about sections of the

document

E.g. <title> XML </title> <slide> Introduction </slide> Users can add new tags, and separately specify how the tag should be handled for display

Extensible, unlike HTML

Database System Concepts - 6th Edition

22.30

Silberschatz, Korth and Sudarshan

Comparison with Relational Data

Inefficient: tags, which in effect represent schema information, are

repeated
Better than relational tuples as a data-exchange format

Unlike relational tuples, XML data is self-documenting due to presence of tags

Non-rigid format: tags can be added

Allows nested structures Wide acceptance, not only in database systems, but also in browsers, tools, and applications

Database System Concepts - 6th Edition

22.31

Silberschatz, Korth and Sudarshan

Structure of XML Data

Tag: label for a section of data Element: section of data beginning with <tagname> and ending with

matching </tagname>
Elements must be properly nested

Proper nesting

<course> <title> . </title> </course> <course> <title> . </course> </title>

Improper nesting

Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element.

Every document must have a single top-level element

Database System Concepts - 6th Edition

22.32

Silberschatz, Korth and Sudarshan

Structure of XML Data (Cont.)

Mixture of text with sub-elements is legal in XML.

Example: <course> This course is being offered for the first time in 2009. <course id> BIO-399 </course id> <title> Computational Biology </title> <dept name> Biology </dept name> <credits> 3 </credits> </course> Useful for document markup, but discouraged for data representation

Database System Concepts - 6th Edition

22.33

Silberschatz, Korth and Sudarshan

Attributes
Elements can have attributes

<course course_id= CS-101> <title> Intro. to Computer Science</title> <dept name> Comp. Sci. </dept name> <credits> 4 </credits> </course>
Attributes are specified by name=value pairs inside the starting tag of an

element
An element may have several attributes, but each attribute name can

only occur once <course course_id = CS-101 credits=4>

Database System Concepts - 6th Edition

22.34

Silberschatz, Korth and Sudarshan

Attributes vs. Subelements

Distinction between subelement and attribute

In the context of documents, attributes are part of markup, while subelement contents are part of the basic document contents In the context of data representation, the difference is unclear and may be confusing

Same information can be represented in two ways

Suggestion: use attributes for identifiers of elements, and use subelements for contents

Database System Concepts - 6th Edition

22.35

Silberschatz, Korth and Sudarshan

Namespaces
XML data has to be exchanged between organizations Same tag name may have different meaning in different organizations,

causing confusion on exchanged documents

Specifying a unique string as an element name avoids confusion
Better solution: use unique-name:element-name Avoid using long unique names all over document by using XML

Namespaces

<university xmlns:yale=http://www.yale.edu> <yale:course> <yale:course_id> CS-101 </yale:course_id> <yale:title> Intro. to Computer Science</yale:title> <yale:dept_name> Comp. Sci. </yale:dept_name> <yale:credits> 4 </yale:credits> </yale:course> </university>

Database System Concepts - 6th Edition

22.36

Silberschatz, Korth and Sudarshan

XML Document Schema

Database schemas constrain what information can be stored, and the

data types of stored values

XML documents are not required to have an associated schema However, schemas are very important for XML data exchange

Otherwise, a site cannot automatically interpret data received from another site Document Type Definition (DTD)

Two mechanisms for specifying XML schema

Widely used Newer, increasing use

XML Schema

Database System Concepts - 6th Edition

22.37

Silberschatz, Korth and Sudarshan

Document Type Definition (DTD)

The type of an XML document can be specified using a DTD DTD constraints structure of XML data

What elements can occur What attributes can/must an element have What subelements can/must occur inside each element, and how many times. All values represented as strings in XML <!ELEMENT element (subelements-specification) > <!ATTLIST element (attributes) >

DTD does not constrain data types

DTD syntax

Database System Concepts - 6th Edition

22.38

Silberschatz, Korth and Sudarshan

Element Specification in DTD

Subelements can be specified as

names of elements, or #PCDATA (parsed character data), i.e., character strings EMPTY (no subelements) or ANY (anything can be a subelement)

Example <! ELEMENT department (dept_name building, budget)> <! ELEMENT dept_name (#PCDATA)> <! ELEMENT budget (#PCDATA)> Subelement specification may have regular expressions <!ELEMENT university ( ( department | course | instructor | teaches )+)>

Notation:
| - alternatives + - 1 or more occurrences * - 0 or more occurrences

Database System Concepts - 6th Edition

22.39

Silberschatz, Korth and Sudarshan

University DTD
<!DOCTYPE university [ <!ELEMENT university ( (department|course|instructor|teaches)+)> <!ELEMENT department ( dept name, building, budget)> <!ELEMENT course ( course id, title, dept name, credits)> <!ELEMENT instructor (IID, name, dept name, salary)> <!ELEMENT teaches (IID, course id)> <!ELEMENT dept name( #PCDATA )> <!ELEMENT building( #PCDATA )> <!ELEMENT budget( #PCDATA )> <!ELEMENT course id ( #PCDATA )> <!ELEMENT title ( #PCDATA )> <!ELEMENT credits( #PCDATA )> <!ELEMENT IID( #PCDATA )> <!ELEMENT name( #PCDATA )> <!ELEMENT salary( #PCDATA )> ]>

Database System Concepts - 6th Edition

22.40

Silberschatz, Korth and Sudarshan

Attribute Specification in DTD

Attribute specification : for each attribute

Name Type of attribute

CDATA ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs) more on this later Whether mandatory (#REQUIRED) has a default value (value), or neither (#IMPLIED) Examples <!ATTLIST course course_id CDATA #REQUIRED>, or <!ATTLIST course course_id ID #REQUIRED dept_name IDREF #REQUIRED instructors IDREFS #IMPLIED >

Database System Concepts - 6th Edition 22.41 Silberschatz, Korth and Sudarshan

IDs and IDREFs

An element can have at most one attribute of type ID The ID attribute value of each element in an XML document must be

distinct

Thus the ID attribute value is an object identifier

An attribute of type IDREF must contain the ID value of an element in

the same document

An attribute of type IDREFS contains a set of (0 or more) ID values.

Each ID value must contain the ID value of an element in the same document

Database System Concepts - 6th Edition

22.42

Silberschatz, Korth and Sudarshan

University DTD with Attributes

University DTD with ID and IDREF attribute types.

<!DOCTYPE university-3 [ <!ELEMENT university ( (department|course|instructor)+)> <!ELEMENT department ( building, budget )> <!ATTLIST department dept_name ID #REQUIRED > <!ELEMENT course (title, credits )> <!ATTLIST course course_id ID #REQUIRED dept_name IDREF #REQUIRED instructors IDREFS #IMPLIED > <!ELEMENT instructor ( name, salary )> <!ATTLIST instructor IID ID #REQUIRED dept_name IDREF #REQUIRED > declarations for title, credits, building, budget, name and salary ]>

Database System Concepts - 6th Edition

22.43

Silberschatz, Korth and Sudarshan

Limitations of DTDs
No typing of text elements and attributes

All values are strings, no integers, reals, etc. Order is usually irrelevant in databases (unlike in the documentlayout environment from which XML evolved) (A | B)* allows specification of an unordered set, but

Difficult to specify unordered sets of subelements

Cannot ensure that each of A and B occurs only once

IDs and IDREFs are untyped

The instructors attribute of an course may contain a reference to another course, which is meaningless

instructors attribute should ideally be constrained to refer to instructor elements

Database System Concepts - 6th Edition

22.44

Silberschatz, Korth and Sudarshan

XML Schema
XML Schema is a more sophisticated schema language which

addresses the drawbacks of DTDs. Supports

Typing of values

E.g. integer, string, etc Also, constraints on min/max values

User-defined, comlex types Many more features, including

uniqueness and foreign key constraints, inheritance

XML Schema is itself specified in XML syntax, unlike DTDs

More-standard representation, but verbose

XML Scheme is integrated with namespaces BUT: XML Schema is significantly more complicated than DTDs.

Database System Concepts - 6th Edition

22.45

Silberschatz, Korth and Sudarshan

Decision Support Systems

Decision-support systems are used to make business decisions,

often based on data collected by on-line transaction-processing systems.

Examples of business decisions:

What items to stock? What insurance premium to change? To whom to send advertisements? Retail sales transaction details Customer profiles (income, age, gender, etc.)

Examples of data used for making decisions

Database System Concepts - 6th Edition

22.46

Silberschatz, Korth and Sudarshan

Decision-Support Systems: Overview

Data analysis tasks are simplified by specialized tools and SQL

extensions Example tasks For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year As above, for each product category and each customer category

Statistical analysis packages (e.g., : S++) can be interfaced with

databases Statistical analysis is a large field, but not covered here Data mining seeks to discover knowledge automatically in the form of statistical rules and patterns from large databases.
A data warehouse archives information gathered from multiple

sources, and stores it under a unified schema, at a single site.

Important for large businesses that generate data from multiple divisions, possibly at multiple sites Data may also be purchased externally
Database System Concepts - 6th Edition 22.47 Silberschatz, Korth and Sudarshan

Data Warehousing
Data sources often store only current data, not historical data Corporate decision making requires a unified view of all organizational

data, including historical data

A data warehouse is a repository (archive) of information gathered

from multiple sources, stored under a unified schema, at a single site

Greatly simplifies querying, permits study of historical trends

Shifts decision support query load away from transaction processing systems

Database System Concepts - 6th Edition

22.48

Silberschatz, Korth and Sudarshan

Data Warehousing

Database System Concepts - 6th Edition

22.49

Silberschatz, Korth and Sudarshan

Design Issues
When and how to gather data

Source driven architecture: data sources transmit new information to warehouse, either continuously or periodically (e.g., at night) Destination driven architecture: warehouse periodically requests new information from data sources Keeping warehouse exactly synchronized with data sources (e.g., using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Data/updates are periodically downloaded form online transaction processing (OLTP) systems.

What schema to use

Schema integration

Database System Concepts - 6th Edition

22.50

Silberschatz, Korth and Sudarshan

More Warehouse Design Issues

Data cleansing

E.g., correct mistakes in addresses (misspellings, zip code errors) Merge address lists from different sources and purge duplicates Warehouse schema may be a (materialized) view of schema from data sources Raw data may be too large to store on-line

How to propagate updates

What data to summarize

Aggregate values (totals/subtotals) often suffice

Queries on raw data can often be transformed by query optimizer to use aggregate values

Database System Concepts - 6th Edition

22.51

Silberschatz, Korth and Sudarshan

Why Data Mining?

The Explosive Growth of Data

Data collection and data availability

Automated data collection tools, database systems, Web, computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras,

We are drowning in data, but starving for knowledge! Necessity is the mother of inventionData miningAutomated analysis of

massive data sets

Database System Concepts - 6th Edition 22.52

52
Silberschatz, Korth and Sudarshan

Why Data Mining?Potential Applications

Data analysis and decision support

Market analysis and management

Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation

Risk analysis and management

Forecasting, customer retention, improved underwriting, quality control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other Applications

Text mining (news group, email, documents) and Web mining

Stream data mining Bioinformatics and bio-data analysis
53
22.53 Silberschatz, Korth and Sudarshan

Database System Concepts - 6th Edition

Data Mining: A KDD Process

Pattern Evaluation

Data mining: the core of knowledge discovery Data Mining process.

Task-relevant Data
Data Selection Data Preprocessing

Data Warehouse
Data Cleaning Data Integration

Databases
Database System Concepts - 6th Edition 22.54 Silberschatz, Korth and Sudarshan

Steps of a KDD Process

Learning the application domain:

relevant prior knowledge and goals of application

Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

summarization, classification, regression, association, clustering.

Choosing functions of data mining

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

22.55 Silberschatz, Korth and Sudarshan

Use of discovered knowledge

Database System Concepts - 6th Edition

Data Mining Functionalities

General functionality

Descriptive data mining Predictive data mining

Different views lead to different classifications

Data view: Kinds of data to be mined Knowledge view: Kinds of knowledge to be discovered Method view: Kinds of techniques utilized Application view: Kinds of applications adapted

56
Database System Concepts - 6th Edition 22.56 Silberschatz, Korth and Sudarshan

Data Mining Functionalities

Multidimensional concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

association analysis

Diaper Beer [0.5%, 75%] (Correlation or causality?)

Classification and prediction

Construct models (functions) that describe and distinguish classes or concepts for future prediction

E.g., classify countries based on (climate), or classify cars based on (gas mileage)

Predict some unknown or missing numerical values

Database System Concepts - 6th Edition

22.57

Silberschatz, Korth and Sudarshan

Data Mining Functionalities (2)

Cluster analysis

Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis Trend and deviation: e.g., regression analysis Periodicity analysis Similarity-based analysis

Outlier analysis

Trend and evolution analysis

Other pattern-directed or statistical analyses

58
Database System Concepts - 6th Edition 22.58 Silberschatz, Korth and Sudarshan

Data Cleaning
Importance

Data cleaning is one of the three biggest problems in data warehousingRalph Kimball Data cleaning is the number one problem in data warehousingDCI survey

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data Resolve redundancy caused by data integration
Data Mining: Concepts and 22.59
Silberschatz, Korth and Sudarshan

Database System Concepts - 6th Edition

December 5, 2013

Missing Data
Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data

Missing data may need to be inferred.

Database System Concepts - 6th Edition

December 5, 2013

Data Mining: Concepts and 22.60

Silberschatz, Korth and Sudarshan

How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming

the tasks in classificationnot effective when the percentage of missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible? Fill in it automatically with

a global constant : e.g., unknown, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter

the most probable value: inference-based such as Bayesian formula or decision tree
Data Mining: Concepts and 22.61
Silberschatz, Korth and Sudarshan

December 5, 2013 Database System Concepts - 6th Edition

Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due to

faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention duplicate records incomplete data

Database System Concepts - 6th Edition

inconsistent December 5, 2013

data Data Mining: Concepts and 22.62

Silberschatz, Korth and Sudarshan

How to Handle Noisy Data?

Binning

first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. smooth by fitting the data into regression functions detect and remove outliers detect suspicious values and check by human (e.g., deal with possible outliers)
Data Mining: Concepts and 22.63
Silberschatz, Korth and Sudarshan

Regression

Clustering

Combined computer and human inspection

Database System Concepts - 6th Edition

December 5, 2013

Simple Discretization Methods: Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B A)/N.

The most straightforward, but outliers may dominate presentation Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately same number of samples

Good data scaling Managing categorical attributes can be tricky

Data Mining: Concepts and 22.64
Silberschatz, Korth and Sudarshan

Database System Concepts - 6th Edition

December 5, 2013

Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,

29, 34

* Partition into equal-frequency (equi-depth) bins:

- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25
Database System Concepts - 6th Edition

December 5, 2013

Data Mining: Concepts and 22.65

Silberschatz, Korth and Sudarshan

Regression
y

y=x+1

Database System Concepts - 6th Edition

December 5, 2013

Data Mining: Concepts and 22.66

Silberschatz, Korth and Sudarshan

Cluster Analysis

Database System Concepts - 6th Edition

December 5, 2013

Data Mining: Concepts and 22.67

Silberschatz, Korth and Sudarshan

Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
Management Information System
No ratings yet
Management Information System
5 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Chapter 20: Database System Architectures: Version: Oct 5, 2006
No ratings yet
Chapter 20: Database System Architectures: Version: Oct 5, 2006
37 pages
Chapter 20: Database System Architectures
No ratings yet
Chapter 20: Database System Architectures
38 pages
Chapter 18: Database System Architectures
No ratings yet
Chapter 18: Database System Architectures
39 pages
Chapter 20: Database System Architectures
No ratings yet
Chapter 20: Database System Architectures
37 pages
Chapter 20: Database System Architectures
No ratings yet
Chapter 20: Database System Architectures
37 pages
Ch20 Database System Architectures
No ratings yet
Ch20 Database System Architectures
37 pages
Database System Architectures DS 2
No ratings yet
Database System Architectures DS 2
37 pages
adsu2
No ratings yet
adsu2
120 pages
Adbms: Concepts and Architectures: Unit I
No ratings yet
Adbms: Concepts and Architectures: Unit I
41 pages
Chapter 20: Database System Architectures
No ratings yet
Chapter 20: Database System Architectures
45 pages
Chapter DBMS Architecture
No ratings yet
Chapter DBMS Architecture
37 pages
8.CSI2004-ADBMS__Module1__Centralized, Client,Server, System Server Architectures
No ratings yet
8.CSI2004-ADBMS__Module1__Centralized, Client,Server, System Server Architectures
30 pages
Advanced Transaction Processing
No ratings yet
Advanced Transaction Processing
53 pages
Identify and Resolve Database Performance Problems
No ratings yet
Identify and Resolve Database Performance Problems
18 pages
CHP 15
No ratings yet
CHP 15
18 pages
S.No Contents Pagenumber: Synopsis
No ratings yet
S.No Contents Pagenumber: Synopsis
23 pages
DBMS
No ratings yet
DBMS
65 pages
Client-Server Databases PDF
No ratings yet
Client-Server Databases PDF
15 pages
DBA
No ratings yet
DBA
23 pages
Distributed Database Systems: Objectives
No ratings yet
Distributed Database Systems: Objectives
16 pages
9-Database System Architecture
No ratings yet
9-Database System Architecture
37 pages
Scalable System Design
No ratings yet
Scalable System Design
22 pages
DBMS Unit-4
No ratings yet
DBMS Unit-4
66 pages
Unit I
100% (1)
Unit I
43 pages
Lecture - 24 24 Parallel and Distributed Databases Parallel and Distributed Databases
No ratings yet
Lecture - 24 24 Parallel and Distributed Databases Parallel and Distributed Databases
23 pages
Managementul Proceselor IT C4
No ratings yet
Managementul Proceselor IT C4
16 pages
os print
No ratings yet
os print
6 pages
Unit - I Parallel and Distributed Systems 22-23
No ratings yet
Unit - I Parallel and Distributed Systems 22-23
93 pages
ADVANCED DATABASE MANAGEMENT SYSTEM
No ratings yet
ADVANCED DATABASE MANAGEMENT SYSTEM
14 pages
Chapter 8: Application Design and Development
No ratings yet
Chapter 8: Application Design and Development
52 pages
Unit 1.4 - Database System Architecture
No ratings yet
Unit 1.4 - Database System Architecture
4 pages
Computer Servers
No ratings yet
Computer Servers
8 pages
Server (Computing) : For Other Uses, See
No ratings yet
Server (Computing) : For Other Uses, See
8 pages
Concepts and Planning Distributed
No ratings yet
Concepts and Planning Distributed
4 pages
Servers
No ratings yet
Servers
5 pages
IOUG93 - Client Server Very Large Databases - Paper
No ratings yet
IOUG93 - Client Server Very Large Databases - Paper
11 pages
Slide 4
No ratings yet
Slide 4
41 pages
Introduction Distributed Systems Processes & Threads: Today
No ratings yet
Introduction Distributed Systems Processes & Threads: Today
13 pages
Distributed Computing Management Server 2
No ratings yet
Distributed Computing Management Server 2
18 pages
The Poor Man's Super Computer
No ratings yet
The Poor Man's Super Computer
12 pages
Slide 4
No ratings yet
Slide 4
41 pages
Client Server Architecture
No ratings yet
Client Server Architecture
25 pages
P2P - Activity 1
No ratings yet
P2P - Activity 1
7 pages
Ch-6 Database System Architecture
No ratings yet
Ch-6 Database System Architecture
41 pages
Chapter 1: Introduction: ©silberschatz, Korth and Sudarshan 1.1 Database System Concepts
No ratings yet
Chapter 1: Introduction: ©silberschatz, Korth and Sudarshan 1.1 Database System Concepts
21 pages
Client Server & Grid Computing
No ratings yet
Client Server & Grid Computing
8 pages
Part 2
No ratings yet
Part 2
7 pages
Data Base Assignment 1
No ratings yet
Data Base Assignment 1
4 pages
Report Group 14
No ratings yet
Report Group 14
5 pages
DBMS (Database Management System)
No ratings yet
DBMS (Database Management System)
9 pages
ch20
No ratings yet
ch20
39 pages
Real-Time Concurrency Control in A Multiprocessor Environment
No ratings yet
Real-Time Concurrency Control in A Multiprocessor Environment
13 pages
Recovery Management in Quicksilver: Vol. 6, No. 1, February 1968, Pages 82-108
No ratings yet
Recovery Management in Quicksilver: Vol. 6, No. 1, February 1968, Pages 82-108
27 pages
Servers
No ratings yet
Servers
8 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Backend Development
From Everand
Backend Development
Kai Turing
No ratings yet
Siebel Remote Administration 8 Blackbook
From Everand
Siebel Remote Administration 8 Blackbook
Mohammed Azizuddin Aamer
No ratings yet
Maximum Mode 8086 System
No ratings yet
Maximum Mode 8086 System
2 pages
Math Problems
No ratings yet
Math Problems
6 pages
Database
No ratings yet
Database
87 pages
Unit 6 Learning: Objectives
No ratings yet
Unit 6 Learning: Objectives
17 pages
Case Study
No ratings yet
Case Study
1 page
8051 Tutorial: Timers: How Does A Timer Count?
No ratings yet
8051 Tutorial: Timers: How Does A Timer Count?
6 pages
Dbms Viva Questions
No ratings yet
Dbms Viva Questions
33 pages
T1Rist University Faculty of Engineering 5th SEMESTER B.Tech., CSE, 2013-2014 Course Handout
No ratings yet
T1Rist University Faculty of Engineering 5th SEMESTER B.Tech., CSE, 2013-2014 Course Handout
5 pages
Discovering Computers 2010: Living in A Digital World
No ratings yet
Discovering Computers 2010: Living in A Digital World
53 pages
Concept Note On IntelliEXAMS
No ratings yet
Concept Note On IntelliEXAMS
8 pages
Xanadu It Operations Management Itom Health 2024-11-22-08-36-50
No ratings yet
Xanadu It Operations Management Itom Health 2024-11-22-08-36-50
4 pages
Electronic Commerce 2018
No ratings yet
Electronic Commerce 2018
30 pages
Gam - HCL - L2
No ratings yet
Gam - HCL - L2
5 pages
Computer Lab Maintenance
70% (10)
Computer Lab Maintenance
44 pages
RF650M Remote DOKU V10 en
No ratings yet
RF650M Remote DOKU V10 en
32 pages
Network Plus Chapter 1
No ratings yet
Network Plus Chapter 1
49 pages
IT Terms Glossary
No ratings yet
IT Terms Glossary
49 pages
Zhone Management System (ZMS) : Complete Network and Service Management Platform
No ratings yet
Zhone Management System (ZMS) : Complete Network and Service Management Platform
2 pages
A Travel Tourism Information System Providing Real
No ratings yet
A Travel Tourism Information System Providing Real
14 pages
IEDScout Examples
No ratings yet
IEDScout Examples
16 pages
Dhis2 Implementation Guide en
No ratings yet
Dhis2 Implementation Guide en
76 pages
BMC ARS Remedy and ITSM Tips and Tricks - Remedy Interview Questions
No ratings yet
BMC ARS Remedy and ITSM Tips and Tricks - Remedy Interview Questions
8 pages
Vmware Vca DCV
No ratings yet
Vmware Vca DCV
7 pages
Chapter 2 Communication and Networking - Full Lesson Notes-YEAR 12 COMPUTER SCIENCE PAPER 1
No ratings yet
Chapter 2 Communication and Networking - Full Lesson Notes-YEAR 12 COMPUTER SCIENCE PAPER 1
32 pages
Windows Installation Guide
No ratings yet
Windows Installation Guide
70 pages
Odata
No ratings yet
Odata
8 pages
Cisco Prime Infrastructure 3.9 Administrator Guide
No ratings yet
Cisco Prime Infrastructure 3.9 Administrator Guide
422 pages
Call of Duty Manual PDF
No ratings yet
Call of Duty Manual PDF
25 pages
SCCM Admin Guide
No ratings yet
SCCM Admin Guide
102 pages
DCC Unit 1 (22414)
No ratings yet
DCC Unit 1 (22414)
17 pages
Requirements Example 3
No ratings yet
Requirements Example 3
13 pages
Datasheet - AVEVA Mobile Operator
No ratings yet
Datasheet - AVEVA Mobile Operator
8 pages
System Administration Lab
No ratings yet
System Administration Lab
46 pages
EPC Indonesia Company Profile v4-1
No ratings yet
EPC Indonesia Company Profile v4-1
15 pages
Trobleshoot GRID 1050908.1
No ratings yet
Trobleshoot GRID 1050908.1
18 pages
A Virtual Reality Platform For Safety Training in Coal Mines With
No ratings yet
A Virtual Reality Platform For Safety Training in Coal Mines With
7 pages
Vsphere Esxi Vcenter Server 652 Host Management Guide PDF
No ratings yet
Vsphere Esxi Vcenter Server 652 Host Management Guide PDF
190 pages
Domino Server Commands: Command
No ratings yet
Domino Server Commands: Command
4 pages