Unit-5: Database System Concepts, 6 Ed
Unit-5: Database System Concepts, 6 Ed
Centralized Systems
Run on a single computer system and do not interact with other
computer systems.
General-purpose computer system: one to a few CPUs and a number
of device controllers that are connected through a common bus that provides access to shared memory.
Single-user system (e.g., personal computer or workstation): desk-top
unit, single user, usually has only one CPU and one or two hard disks; the OS may support only one user.
Multi-user system: more disks, more memory, multiple CPUs, and a
multi-user OS. Serve a large number of users who are connected to the system vie terminals. Often called server systems.
22.2
22.3
Client-Server Systems
Server systems satisfy requests generated at m client systems, whose
22.4
Back-end: manages access structures, query evaluation and optimization, concurrency control and recovery.
Front-end: consists of tools such as forms, report-writers, and graphical user interface facilities.
The interface between the front-end and the back-end is through SQL or
22.5
better functionality for the cost flexibility in locating resources and expanding facilities better user interfaces easier maintenance
22.6
transaction servers which are widely used in relational database systems, and data servers, used in object-oriented database systems
22.7
Transaction Servers
Also called query server systems or SQL server systems
Clients send requests to the server Transactions are executed at the server Results are shipped back to the client.
program interface standard from Microsoft for connecting to a server, sending SQL requests, and receiving results.
JDBC standard is similar to ODBC, for Java
22.8
These receive user queries (transactions), execute them and send results back
Processes may be multithreaded, allowing a single process to execute several user queries concurrently
Typically multiple multithreaded server processes
22.9
Checkpoint process
E.g., aborting any transactions being executed by a server process and restarting it
22.10
22.11
Cached query plans (reused if same query submitted again) All database processes can access shared memory To ensure that no two processes are accessing the same data structure at the same time, databases systems implement mutual exclusion using either Operating system semaphores Atomic instructions such as test-and-set
To avoid overhead of interprocess communication for lock
request/grant, each database process operates directly on the lock table instead of sending requests to lock manager process Lock manager process still used for deadlock detection
Database System Concepts - 6th Edition 22.12 Silberschatz, Korth and Sudarshan
Data Servers
Used in high-speed LANs, in cases where
The clients are comparable in processing power to the server The tasks to be executed are compute intensive.
22.13
Smaller unit of shipping more messages Worth prefetching related items along with requested item Page shipping can be thought of as a form of prefetching Locking
Overhead of requesting and getting locks from server is high due to message delays Can grant locks on requested and prefetched items; with page shipping, transaction is granted lock on whole page. Locks on a prefetched item can be P{called back} by the server, and returned by client transaction if the prefetched item has not been used. Locks on the page can be de escalated to locks on items in the page when there are lock conflicts. Locks on unused items can then be returned to server.
22.14
Lock Caching
22.15
Parallel Systems
Parallel database systems consist of multiple processors and multiple
powerful processors
A massively parallel or fine grain parallel machine utilizes
throughput --- the number of tasks that can be completed in a given time interval response time --- the amount of time it takes to complete a single task from the time it is submitted
22.16
Measured by: speedup = small system elapsed time large system elapsed time
Speedup is linear if equation equals N. N-times larger system used to perform N-times larger job Measured by: scaleup = small system small problem elapsed time big system big problem elapsed time
Scaleup: increase the size of both the problem and the system
22.17
Speedup
22.18
Scaleup
22.19
A single large job; typical of most decision support queries and scientific simulation. Use an N-times larger computer on N-times larger problem. Numerous small queries submitted by independent users to a shared database; typical transaction processing and timesharing systems. N-times as many users submitting requests (hence, N-times as many requests) to an N-times larger database, on an N-times larger computer. Well-suited to parallel execution.
Transaction scaleup:
22.20
bus, disks, or locks) compete with each other, thus spending time waiting on other processes, rather than performing useful work.
Skew: Increasing the degree of parallelism increases the variance in
service times of parallely executing tasks. Overall execution time determined by slowest of parallely executing tasks.
22.21
common disk
Hierarchical -- hybrid of the above architectures
22.22
22.23
Shared Memory
Processors and disks have access to a common memory, typically via
shared memory can be accessed by any processor without having to move it using software.
Downside architecture is not scalable beyond 32 or 64 processors
22.24
Shared Disk
All processors can directly access all disks via an interconnection
The memory bus is not a bottleneck Architecture provides a degree of fault-tolerance if a processor fails, the other processors can take over its tasks since the database is resident on disks that are accessible from all processors.
subsystem.
Shared-disk systems can scale to a somewhat larger number of
22.25
Shared Nothing
Node consists of a processor, memory, and one or more disks.
Processors at one node communicate with another processor at another node using an interconnection network. A node functions as the server for the data on the disk or disks the node owns.
Examples: Teradata, Tandem, Oracle-n CUBE Data accessed from local disks (and local memory accesses) do not
pass through interconnection network, thereby minimizing the interference of resource sharing.
Shared-nothing multiprocessors can be scaled up to thousands of
22.26
Hierarchical
Combines characteristics of shared-memory, shared-disk, and shared-
nothing architectures.
Top level is a shared-nothing architecture nodes connected by an
interconnection network, and do not share disks or memory with each other.
Each node of the system could be a shared-memory system with a
few processors.
Alternatively, each node could be a shared-disk system, and each of
virtual-memory architectures
22.27
Hybrid architecture
hybrid architecture includes:
Non-Uniform Memory Architecture (NUMA), which involves the Non-
design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than nonlocal memory (memory local to another processor or memory shared between processors).
NUMA architectures logically follow in scaling from symmetric
22.28
XML
Using an example explain the distinction between attribute and a sub
relational schema.
Emp = (ename, ChildrenSet setof(Children), SkillsSet setof(Skills)) Children = (name, Birthday) Birthday = (day, month, year) Skills = (type, ExamsSet setof(Exams)) Exams = (year, city)
Explain the limitations of DTD. Describe the alternative to overcome
this limitation.
22.29
Introduction
XML: Extensible Markup Language Defined by the WWW Consortium (W3C) Derived from SGML (Standard Generalized Markup Language), but
document
E.g. <title> XML </title> <slide> Introduction </slide> Users can add new tags, and separately specify how the tag should be handled for display
22.30
repeated
Better than relational tuples as a data-exchange format
22.31
matching </tagname>
Elements must be properly nested
Proper nesting
Improper nesting
Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element.
22.32
Example: <course> This course is being offered for the first time in 2009. <course id> BIO-399 </course id> <title> Computational Biology </title> <dept name> Biology </dept name> <credits> 3 </credits> </course> Useful for document markup, but discouraged for data representation
22.33
Attributes
Elements can have attributes
<course course_id= CS-101> <title> Intro. to Computer Science</title> <dept name> Comp. Sci. </dept name> <credits> 4 </credits> </course>
Attributes are specified by name=value pairs inside the starting tag of an
element
An element may have several attributes, but each attribute name can
22.34
In the context of documents, attributes are part of markup, while subelement contents are part of the basic document contents In the context of data representation, the difference is unclear and may be confusing
Suggestion: use attributes for identifiers of elements, and use subelements for contents
22.35
Namespaces
XML data has to be exchanged between organizations Same tag name may have different meaning in different organizations,
Namespaces
<university xmlns:yale=http://www.yale.edu> <yale:course> <yale:course_id> CS-101 </yale:course_id> <yale:title> Intro. to Computer Science</yale:title> <yale:dept_name> Comp. Sci. </yale:dept_name> <yale:credits> 4 </yale:credits> </yale:course> </university>
22.36
Otherwise, a site cannot automatically interpret data received from another site Document Type Definition (DTD)
XML Schema
22.37
What elements can occur What attributes can/must an element have What subelements can/must occur inside each element, and how many times. All values represented as strings in XML <!ELEMENT element (subelements-specification) > <!ATTLIST element (attributes) >
DTD syntax
22.38
names of elements, or #PCDATA (parsed character data), i.e., character strings EMPTY (no subelements) or ANY (anything can be a subelement)
Example <! ELEMENT department (dept_name building, budget)> <! ELEMENT dept_name (#PCDATA)> <! ELEMENT budget (#PCDATA)> Subelement specification may have regular expressions <!ELEMENT university ( ( department | course | instructor | teaches )+)>
Notation:
| - alternatives + - 1 or more occurrences * - 0 or more occurrences
22.39
University DTD
<!DOCTYPE university [ <!ELEMENT university ( (department|course|instructor|teaches)+)> <!ELEMENT department ( dept name, building, budget)> <!ELEMENT course ( course id, title, dept name, credits)> <!ELEMENT instructor (IID, name, dept name, salary)> <!ELEMENT teaches (IID, course id)> <!ELEMENT dept name( #PCDATA )> <!ELEMENT building( #PCDATA )> <!ELEMENT budget( #PCDATA )> <!ELEMENT course id ( #PCDATA )> <!ELEMENT title ( #PCDATA )> <!ELEMENT credits( #PCDATA )> <!ELEMENT IID( #PCDATA )> <!ELEMENT name( #PCDATA )> <!ELEMENT salary( #PCDATA )> ]>
22.40
CDATA ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs) more on this later Whether mandatory (#REQUIRED) has a default value (value), or neither (#IMPLIED) Examples <!ATTLIST course course_id CDATA #REQUIRED>, or <!ATTLIST course course_id ID #REQUIRED dept_name IDREF #REQUIRED instructors IDREFS #IMPLIED >
Database System Concepts - 6th Edition 22.41 Silberschatz, Korth and Sudarshan
distinct
Each ID value must contain the ID value of an element in the same document
22.42
<!DOCTYPE university-3 [ <!ELEMENT university ( (department|course|instructor)+)> <!ELEMENT department ( building, budget )> <!ATTLIST department dept_name ID #REQUIRED > <!ELEMENT course (title, credits )> <!ATTLIST course course_id ID #REQUIRED dept_name IDREF #REQUIRED instructors IDREFS #IMPLIED > <!ELEMENT instructor ( name, salary )> <!ATTLIST instructor IID ID #REQUIRED dept_name IDREF #REQUIRED > declarations for title, credits, building, budget, name and salary ]>
22.43
Limitations of DTDs
No typing of text elements and attributes
All values are strings, no integers, reals, etc. Order is usually irrelevant in databases (unlike in the documentlayout environment from which XML evolved) (A | B)* allows specification of an unordered set, but
The instructors attribute of an course may contain a reference to another course, which is meaningless
22.44
XML Schema
XML Schema is a more sophisticated schema language which
Typing of values
XML Scheme is integrated with namespaces BUT: XML Schema is significantly more complicated than DTDs.
22.45
What items to stock? What insurance premium to change? To whom to send advertisements? Retail sales transaction details Customer profiles (income, age, gender, etc.)
22.46
extensions Example tasks For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year As above, for each product category and each customer category
databases Statistical analysis is a large field, but not covered here Data mining seeks to discover knowledge automatically in the form of statistical rules and patterns from large databases.
A data warehouse archives information gathered from multiple
Important for large businesses that generate data from multiple divisions, possibly at multiple sites Data may also be purchased externally
Database System Concepts - 6th Edition 22.47 Silberschatz, Korth and Sudarshan
Data Warehousing
Data sources often store only current data, not historical data Corporate decision making requires a unified view of all organizational
22.48
Data Warehousing
22.49
Design Issues
When and how to gather data
Source driven architecture: data sources transmit new information to warehouse, either continuously or periodically (e.g., at night) Destination driven architecture: warehouse periodically requests new information from data sources Keeping warehouse exactly synchronized with data sources (e.g., using two-phase commit) is too expensive
Usually OK to have slightly out-of-date data at warehouse Data/updates are periodically downloaded form online transaction processing (OLTP) systems.
Schema integration
22.50
E.g., correct mistakes in addresses (misspellings, zip code errors) Merge address lists from different sources and purge duplicates Warehouse schema may be a (materialized) view of schema from data sources Raw data may be too large to store on-line
22.51
Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras,
We are drowning in data, but starving for knowledge! Necessity is the mother of inventionData miningAutomated analysis of
52
Silberschatz, Korth and Sudarshan
Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation
Other Applications
Data Warehouse
Data Cleaning Data Integration
Databases
Database System Concepts - 6th Edition 22.54 Silberschatz, Korth and Sudarshan
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
Data view: Kinds of data to be mined Knowledge view: Kinds of knowledge to be discovered Method view: Kinds of techniques utilized Application view: Kinds of applications adapted
56
Database System Concepts - 6th Edition 22.56 Silberschatz, Korth and Sudarshan
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
association analysis
Construct models (functions) that describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage)
22.57
Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis Trend and deviation: e.g., regression analysis Periodicity analysis Similarity-based analysis
Outlier analysis
Data Cleaning
Importance
Data cleaning is one of the three biggest problems in data warehousingRalph Kimball Data cleaning is the number one problem in data warehousingDCI survey
December 5, 2013
59
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data
December 5, 2013
60
the tasks in classificationnot effective when the percentage of missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible? Fill in it automatically with
a global constant : e.g., unknown, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula or decision tree
Data Mining: Concepts and 22.61
Silberschatz, Korth and Sudarshan
61
Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due to
faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention duplicate records incomplete data
62
first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. smooth by fitting the data into regression functions detect and remove outliers detect suspicious values and check by human (e.g., deal with possible outliers)
Data Mining: Concepts and 22.63
Silberschatz, Korth and Sudarshan
Regression
Clustering
December 5, 2013
63
Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B A)/N.
The most straightforward, but outliers may dominate presentation Skewed data is not handled well
Divides the range into N intervals, each containing approximately same number of samples
December 5, 2013
64
29, 34
December 5, 2013
65
Regression
y
Y1
Y1
y=x+1
X1
December 5, 2013
66
Cluster Analysis
December 5, 2013
67