Ab 5 PDF
Ab 5 PDF
1035
Chapter 21
Information Integration
Information integration is the process of taking several databases or other in formation sources and making the data in these sources work together as if they were a single database. The integrated database may be physical (a warehouse) or virtual (a mediator or middleware that may be queried even though its does not exist physically). The sources may be conventional databases or other types of information, such as collections of Web pages. We begin by exploring the ways in which seemingly similar databases can actually embody conflicts that are hard to resolve correctly. The solution lies in the design of wrappers translators between the schema and data values at a source and the schema and data values at the integrated database. Information-integration systems require special kinds of query-optimization techniques for their efficient operation. Mediator systems can be divided into two classes: global-as-view (the data at the integrated database is defined by how it is constructed from the sources) and local-as-view (the content of the sources is defined in terms of the schema that the integrated database supports). We examine capability-based optimization for global-as-view mediators. We also consider local-as-view mediation, which requires effort even to figure out how to compose the answer to a query from defined views, but which offers advantages in flexibility of operation. In the last section, we examine another important issue in information in tegration, called entity resolution. Different information sources may talk about the same entities (e.g., people) but contain discrepancies such as mis spelled names or out-of-date addresses. We need to make a best estimate of which data elements at the different sources actually refer to the same entity.
21.1
In this section, we discuss the ways in which information-integration is essential for many database applications. We then sample some of the problems that make information integration difficult. 1037
1038
21.1.1
W hy Information Integration?
If we could start anew with an architecture and schema for all the data in the world, and we could put that data in a single database, there would be no need for information integration. However, in the real world, m atters are rather different. Databases are created independently, even if they later need to work to gether. The use of databases evolves, so we cannot design a database to support every possible future use. To see the need for information integration, we shall consider two typical scenar ios: building applications for a university and integrating employee databases. In both scenarios, a key problem is th at the overall data-management system must make use of legacy data sources databases th at were created indepen dently of any other data source. Each legacy source is used by applications that expect the structure of their database not to change, so modification of the schema or data of legacy sources is not an option.
U n iv e r sity D a ta b a se s
As databases came into common use, each university started using them for several functions th at were once done by hand. Here is a typical scenario. The Registrar builds a database of courses, and uses it to record the courses each student took and their grades. Applications are built using this database, such as a transcript generator. The Bursar builds another database for recording tuition payments by stu dents. The Human Resources Department builds a database for recording em ployees, including those students with teaching-assistant or research-assistant jobs. Applications include generation of payroll checks, calculation of taxes and social-security payments to the government, and many others. The Grants Of fice builds a database to keep track of expenditures on grants, which includes salaries to certain faculty, students, and staff. It may also include information about biohazards, use of human subjects, and many other m atters related to research projects. P retty soon, the university realizes th at all these databases are not helping nearly as much as they could, and are sometimes getting in the way. For example, suppose we want to make sure th at the Registrar does not record grades for students that the Bursar says did not pay tuition. Someone has to get a list of students who paid tuition from the Bursars database and compare th at with a list of students from the Registrars database. As another example, when Sally is appointed on grant 123 as a research assistant, someone needs to tell the Grants Office th at her salary should be charged to grant 123. Someone also needs to tell Human Resources th at they should pay her salary. And the salaries in the two databases had better be exactly the same.
1039
So at some point, the university decides that it needs one database for all functions. The first thought might be: start over. Build one database that contains all the information of all the legacy databases and rewrite all the applications to use the new database. This approach has been tried, with great pain resulting. In addition to paying for a very expensive software-architecture task, the university has to run both the old and new systems in parallel for a long time to see that the new system actually works. And when they cut over to the new system, the users find that the applications do not work in the accustomed way, and turmoil results. A better way is to build a layer of abstraction, called middleware, on top of all the legacy databases and allow the legacy databases to continue serving their current applications. The layer of abstraction could be relational views either virtual or materialized. Then, SQL can be used to query the middle ware layer. Often, this layer is defined by a collection of classes and queried in an object-oriented language. Or the middleware layer could use XML docu ments, which are queried using XQuery. We mentioned in Section 9.1 that this middleware may be an important component of the application tier in a 3-tier architecture, although we did not show it explicitly. Once the middleware layer is built, new applications can be written to access this layer for data, while the legacy applications continue to run using the legacy databases. For example, we can write a new application that enters grades for students only if they have paid their tuition. Another new application could appoint a research assistant by getting their name, grant, and salary from the user. This application would then enter the name and salary into the HumanResources database and the name, salary, and grant into the Grants-Office database.
Compaq bought DEC and Tandem, and then Hewlett-Packard bought Com paq. Each company had a database of employees. Because the companies were previously independent, the schemas and architecture of their databases nat urally differed. Moreover, each company actually had many databases about employees, and these databases probably differed on matters as basic as who is an employee. For example, the Payroll Department would not include retirees, but might include contractors. The Benefits Department would include retirees but not contractors. The Safety Office would include not only regular employees and contractors, but the employees of the company that runs the cafeteria. For reasons we discussed in connection with the university database, it may not be practical to shut down these legacy databases and with them all the applications that run on them. However, it is possible to create a middleware layer th at holds virtually or physically all information available for each employee.
1040
21.1.2
When we try to connect information sources th at were developed independently, we invariably find th at the sources differ in many ways, even if they are intended to store the same kinds of data. Such sources are called heterogeneous, and the problem of integrating them is referred to as the heterogeneity problem. We shall introduce a running example of an automobile database and then discuss examples of the different levels at which heterogeneity can make integration difficult. E x a m p le 2 1 . 1 : The Aardvark Automobile Co. has 1000 dealers, each of which maintains a database of their cars in stock. Aardvark wants to create an inte grated database containing the information of all 1000 sources .1 The integrated database will help dealers locate a particular model at another dealer, if they dont have one in stock. It also can be used by corporate analysts to predict the' market and adjust production to provide the models most likely to sell. However, the dealers databases may differ in a great number of ways. We shall enumerate below the most important ways and give some examples in terms of the Aardvark database.
C o m m u n ic a tio n H e te r o g e n e ity
Today, it is common to allow access to your information using the H TTP proto col th at drives the Web. However, some dealers may not make their databases available on the Web, but instead accept remote accesses via remote procedure calls or anonymous FTP, for instance.
Q u ery -L a n g u a g e H e te r o g e n e ity
The manner in which we query or modify a dealers database may vary. It would be nice if the database accepted SQL queries and modifications, but not all do. Of those that do, each accepts a dialect of SQL the version supported by the vendor of the dealers DBMS. Another dealer may not have a relational database at all. They could use an Excel Spreadsheet, or an object-oriented database, or an XML database using XQuery as the language. S ch em a H e te ro g e n e ity Even assuming th at all the dealers use a relational DBMS supporting SQL as the query language, we can find many sources of heterogeneity. At the highest level, the schemas can differ. For example, one dealer might store cars in a single relation th at looks like:
1 M o st real a u to m o b ile co m p an ies h av e sim ila r facilitie s in p lace, a n d th e h is to ry o f th e ir d ev elo p m en t m a y b e different fro m o u r ex am p le; e.g., th e cen tralized d a ta b a s e m a y h ave com e first, w ith d ealers la te r ab le to d o w n lo ad relev an t p o rtio n s to th e ir ow n d a ta b a s e . H ow ever, th is sc en ario serves a s a n ex a m p le o f w h a t co m p an ies in m a n y in d u strie s a re a tte m p tin g to d ay .
21.2. MODES OF INFORM ATION IN TEG RATIO N C a rs(se ria lN o , model, c o lo r, autoT rans, n a v i , . . . )
1041
with one boolean-valued attribute for every possible option. Another dealer might use a schema in which options are separated out into a second relation, such as: A u to s (s e r ia l, model, co lo r) O p tio n s (s e r ia l, optio n )
Notice that not only i s the schema different, but apparently equivalent relation or attribute names have changed: Cars becomes Autos, and serialNo becomes serial.
Moreover, one dealers schema might not record information that most of the other dealers provide. For instance, one dealer might not record colors at all. To deal with missing values, sometimes we can use NULLs or default values. However, because missing schema elements are a common problem, there is a trend toward using semistructured data such as XML as the data model for integrating middleware.
D a ta ty p e differen ces
Serial numbers might be represented by character strings of varying length at one source and fixed length at another. The fixed lengths could differ, and some sources might use integers rather than character strings.
V alue H e te r o g e n e ity
The same concept might be represented by different constants at different sources. The color black might be represented by an integer code at one source, the string BLACK at another, and the code BL at a third. The code BL might stand for blue at yet another source.
S em a n tic H e te r o g e n e ity
Terms may be given different interpretations at different sources. One dealer might include trucks in the Cars relation, while another puts only automobile data in the Cars relation. One dealer might distinguish station wagons from minivans, while another doesnt.
21.2
There are several ways that databases or other distributed information sources can be made to work together. In this section, we consider the three most common approaches: 1. Federated databases. The sources are independent, but one source can call on others to supply information.
1042
2. Warehousing. Copies of data from several sources are stored in a single database, called a (data) warehouse. Possibly, the data stored at the warehouse is first processed in some way before storage; e.g., data may be filtered, and relations may be joined or aggregated. The warehouse is updated periodically, perhaps overnight. As the data is copied from the sources, it may need to be transformed in certain ways to make all data conform to the schema at the warehouse. 3. Mediation. A mediator is a software component th at supports a virtual database, which the user may query as if it were materialized (physi cally constructed, like a warehouse). The mediator stores no data of its own. Rather, it translates the users query into one or more queries to its sources. The mediator then synthesizes the answer to the users query from the responses of those sources, and returns the answer to the user. We shall introduce each of these approaches in turn. One of the key issues for all approaches is the way th at data is transformed when it is extracted from an information source. We discuss the architecture of such transformers called wrappers, adapters, or extractors in Section 21.3.
21.2.1
Perhaps the simplest architecture for integrating several databases is to imple ment one-to-one connections between all pairs of databases th at need to talk to one another. These connections allow one database system D\ to query another D 2 in terms that D a can understand. The problem with this architecture is th at if n databases each need to talk to the n 1 other databases, then we must write n(n 1) pieces of code to support queries between systems. The situation is suggested in Fig. 21.1. There, we see four databases in a federation. Each of the four needs three components, one to access each of the other three databases.
Figure 21.1: A federated collection of four databases needs 12 components to translate queries from one to another
1043
Nevertheless, a federated system may be the easiest to build in some circum stances, especially when the communications between databases are limited in nature. An example will show how the translation components might work. E x am p le 21 .2 : Suppose the Aardvark Automobile dealers want to share in ventory, but each dealer only needs to query the database of a few local dealers to see if they have a needed car. To be specific, consider Dealer 1, who has a relation NeededCars(model, c o lo r, autoT rans) whose tuples represent cars that customers have requested, by model, color, and whether or not they want an automatic transmission ( y e s or n o are the possible values). Dealer 2 stores inventory in the two-relation schema discussed in Example 21.1: A u to s (s e r ia l, model, c o lo r) O p tio n s (s e r ia l, optio n ) Dealer 1 writes an application program that queries Dealer 2 remotely for cars th at match each of the cars described in NeededCars. Figure 21.2 is a sketch of a program with embedded SQL that would find the desired cars. The intent is th at the embedded SQL represents remote queries to the Dealer 2 database, with results returned to Dealer 1. We use the convention from standard SQL of prefixing a colon to variables that represent constants retrieved from a database. These queries address the schema of Dealer 2. If Dealer 1 also wants to ask the same question of Dealer 3, who uses the first schema discussed in Exam ple 21.1, with a single relation C a rs(se ria lN o , model, c o lo r, a u t o T r a n s ,...) the query would look quite different. But each query works properly for the database to which it is addressed.
21.2.2
D ata Warehouses
In the data warehouse integration architecture, data from several sources is extracted and combined into a global schema. The data is then stored at the warehouse, which looks to the user like an ordinary database. The arrangement is suggested by Fig. 21.3, although there may be many more than the two sources shown. Once the data is in the warehouse, queries may be issued by the user exactly as they would be issued to any database. There are at least three approaches to constructing the data in the warehouse: 1. The warehouse is periodically closed to queries and reconstructed from the current data in the sources. This approach is the most common, with reconstruction occurring once a night or at even longer intervals.
1044
CHAPTER 21. INFORM ATION IN TEG RATIO N fo r(e a c h tu p le (:m, :c , :a) in NeededCars) { i f ( : a = TRUE) { /* autom atic tra n sm issio n wanted */ SELECT s e r i a l FROM Autos, Options W HERE A u to s .s e ria l = O p tio n s .s e ria l A N D O p tio n s.o p tio n = au to T ra n s A N D Autos.model = :m A N D A u to s.c o lo r = :c;
>
e ls e { /* autom atic tra n sm issio n no t wanted */ SELECT s e r i a l FROM Autos W HERE Autos.model = :m A N D A u to s.c o lo r = :c A N D NOT EXISTS ( SELECT * FROM Options W HERE s e r i a l = A u to s .s e ria l A N D o p tio n = au to T ra n s ); } } Figure 21.2: Dealer 1 queries Dealer 2 for needed cars 2. The warehouse is updated periodically (e.g., each night), based on the changes that have been made to the sources since the last time the ware house was modified. This approach can involve smaller amounts of data, which is very important if the warehouse needs to be modified in a short period of time, and the warehouse is large (multiterabyte warehouses are in common use). The disadvantage is that calculating changes to the warehouse, a process called incremental update, is complex, compared with algorithms that simply construct the warehouse from scratch. Note th at either of these approaches allow the warehouse to get out of date. However, it is generally too expensive to reflect immediately, at the warehouse, every change to the underlying databases. E x am p le 21 .3 : Suppose for simplicity that there axe only two dealers in the Aardvark system, and they respectively use the schemas C a rs(se ria lN o , model, c o lo r, autoT rans, n a v i , . . . ) and A u to s (s e r ia l, model, c o lo r) O p tio n s (s e r ia l, o p tio n ) We wish to create a warehouse with the schema
1045
Extractor
Extractor
Source 1
Source 2
Figure 21.3: A data warehouse stores integrated information in a separate database AutosWhse(serialNo, model, color, autoTrans, dealer) That is, the global schema is like that of the first dealer, but we record only the option of having an automatic transmission, and we include an attribute that tells which dealer has the car. The software that extracts data from the two dealers databases and popu lates the global schema can be written as SQL queries. The query for the first dealer is simple: INSERT INTO AutosWhse(serialNo, model, color, autoTrans, dealer) SELECT serialNo, model, color, autoTrans, dealer1 FROM Cars; The extractor for the second dealer is more complex, since we have to decide whether or not a given car has an automatic transmission. We leave this SQL code as an exercise. In this simple example, the combiner, shown in Fig. 21.3, for the data ex tracted from the sources is not needed. Since the warehouse is the union of the relations extracted from each source, the data may be loaded directly into the warehouse. However, many warehouses perform operations on the relations that they extract from each source. For instance relations extracted from two sources might be joined, and the result put at the warehouse. Or we might take the union of relations extracted from several sources and then aggregate
1046
the data of this union. More generally, several relations may be extracted from each source, and different relations combined in different ways.
21.2.3
M ediators
A mediator supports a virtual view, or collection of views, that integrates several sources in much the same way that the materialized relation (s) in a warehouse integrate sources. However, since the mediator doesnt store any data, the mechanics of mediators and warehouses are rather different. Figure 21.4 shows a mediator integrating two sources; as for warehouses, there would typically be more than two sources. To begin, the user or application program issues a query to the mediator. Since the mediator has no data of its own, it must get the relevant data from its sources and use th at data to form the answer to the users query. Thus, we see in Fig. 21.4 the mediator sending a query to each of its wrap pers, which in turn send queries to their corresponding sources. The mediator may send several queries to a wrapper, and may not query all wrappers. The results come back and are combined at the mediator; we do not show an explicit combiner component as we did in the warehouse diagram, Fig. 21.3, because in the case of the mediator, the combining of results from the sources is one of the tasks performed by the mediator.
Figure 21.4: A mediator and wrappers translate queries into the terms of the sources and combine the answers E x am p le 2 1 .4 : Let us consider a scenario similar to that of Example 21.3, but use a mediator. That is, the mediator integrates the same two automobile sources into a view that is a single relation with schema: AutosMed(serialNo, model, color, autoTrans, dealer) Suppose the user asks the mediator about red cars, with the query:
21.2. MODES OF INFORM ATION IN TEG RATIO N SELECT serialNo, model FROM AutosMed WHERE color = red;
1047
The mediator, in response to this user query, can forward the same query to each of the two wrappers. The way that wrappers can be designed and implemented to handle queries like this one is the subject of Section 21.3. In more complex scenarios, the mediator would first have to break the query into pieces, each of which is sent to a subset of the wrappers. However, in this case, the translation work can be done by the wrappers alone. The wrapper for Dealer 1 translates the query into the terms of that dealers schema, which we recall is Cars(serialNo, model, color, autoTrans, navi,...) A suitable translation is: SELECT serialNo, model FROM Cars WHERE color = red; An answer, which is a set of serialNo-model pairs, will be returned to the mediator by the first wrapper. At the same time, the wrapper for Dealer 2 translates the same query into the schema of that dealer, which is: Autos(serial, model, color) Options(serial, option) A suitable translated query for Dealer 2 is almost the same: SELECT serial, model FROM Autos WHERE color = red; It differs from the query at Dealer 1 only in the name of the relation queried, and in one attribute. The second wrapper returns to the mediator a set of serial-m o d el pairs, which the mediator interprets as serialN o-m odel pairs. The mediator takes the union of these sets and returns the result to the user.
There are several options, not illustrated by Example 21.4, that a mediator may use to answer queries. For instance, the mediator may issue one query to one source, look at the result, and based on what is returned, decide on the next query or queries to issue. This method would be appropriate, for instance, if the user query asked whether there were any Aardvark Gobi model sportutility vehicles available in blue. The first query could ask Dealer 1, and only if the result was an empty set of tuples would a query be sent to Dealer 2.
1048
21.2.4
! E x ercise 21.2.1: Computer company A keeps data about the PC models it sells in the schema: Computers(number, proc, speed, memory, hd) Monitors(number, screen, maxResX, maxResY) For instance, the tuple (123, Athlon64,3.1,512,120) in Computers means that model 123 has an Athlon 64 processor running at 3.1 gigahertz, with 512M of memory and a 120G hard disk. The tuple (456,19,1600,1050) in M onitors means th at model 456 has a 19-inch screen with a maximum resolution of 1600 x 1050. Computer company B only sells complete systems, consisting of a computer and monitor. Its schema is Systems(id, processor, mem, disk, screenSize) The attribute p ro c e sso r is the speed in gigahertz; the type of processor (e.g., Athlon 64) is not recorded. Neither is the maximum resolution of the monitor recorded. Attributes id, mem, and d isk are analogous to number, memory, and hd from company A, but the disk size is measured in megabytes instead of gigabytes. a) If company A wants to insert into its relations information about the corresponding items from B, what SQL insert statements should it use? b) If Company B wants to insert into Systems as much information about the systems that can be built from computers and monitors made by A, what SQL statements best allow this information to be obtained? ! E x ercise 21 .2 .2 : Suggest a global schema th at would allow us to maintain as much information as we could about the products sold by companies A and B of Exercise 21.2.1. E x ercise 21 .2 .3 : Write SQL queries to gather the information from the data at companies A and B and put it in a warehouse with your global schema of Exercise 21.2.2. E x ercise 21.2.4: Suppose your global schema from Exercise 21.2.2 is used at a mediator. How would the mediator process the query that asks for the maximum amount of hard-disk available with any computer with a 3 gigahertz processor speed? ! E x ercise 21.2.5: Suggest two other schemas that computer companies might use to hold data like that of Exercise 21.2.1. How would you integrate your schemas into your global schema from Exercise 21.2.2?
1049
Exercise 21.2.6: In Example 21.3 we talked about a relation Cars at Dealer 1 th at conveniently had an attribute autoTrans with only the values yes and no . Since these were the same values used for that attribute in the global schema, the construction of relation AutosWhse was especially easy. Suppose instead that the attribute Cars. autoTrans has values that are integers, with 0 meaning no automatic transmission, and i > 0 meaning that the car has an i-speed automatic transmission. Show how the translation from Cars to AutosWhse could be done by a SQL query. E x ercise 21.2.7: Write the insert-statements for the second dealer in Exam ple 21.3. You may assume the values of autoT rans are y e s and n o . E x ercise 21.2.8: How would the mediator of Example 21.4 translate the fol lowing queries? a) Find the serial numbers of cars with automatic transmission. b) Find the serial numbers of cars without automatic transmission. ! c) Find the serial numbers of the blue cars from Dealer 1. E x ercise 21.2.9: Go to the Web pages of several on-line booksellers, and see what information about this book you can find. How would you combine this information into a global schema suitable for a warehouse or mediator?
21.3
In a data warehouse system like Fig. 21.3, the source extractors consist of: 1. One or more predefined queries that are executed at the source to produce data for the warehouse. 2. Suitable communication mechanisms, so the wrapper (extractor) can: (a) Pass ad-hoc queries to the source, (b) Receive responses from the source, and (c) Pass information to the warehouse. The predefined queries to the source could be SQL queries if the source is a SQL database as in our examples of Section 21.2. Queries could also be operations in whatever language was appropriate for a source that was not a database system; e.g., the wrapper could fill out an on-line form at a Web page, issue a query to an on-line bibliography service in that systems own, specialized language, or use myriad other notations to pose the queries. However, mediator systems require more complex wrappers than do most warehouse systems. The wrapper must be able to accept a variety of queries from the mediator and translate any of them to the terms of the source. Of
1050
course, the wrapper must then communicate the result to the mediator, just as a wrapper in a warehouse system communicates with the warehouse. In the balance of this section, we study the construction of flexible wrappers that are suitable for use with a mediator.
21.3.1
A systematic way to design a wrapper th at connects a mediator to a source is to classify the possible queries th at the mediator can ask into templates, which are queries with parameters th at represent constants. The mediator can provide the constants, and the wrapper executes the query with the given constants. An example should illustrate the idea; it uses the notation T => S to express the idea that the template T is turned by the wrapper into the source query S. E x a m p le 21 .5 : Suppose we want to build a wrapper for the source of Dealer 1, which has the schema Cars(serialNo, model, color, autoTrans, navi,...) for use by a mediator with schema AutosMed(serialNo, model, color, autoTrans, dealer) Consider how the mediator could ask the wrapper for cars of a given color. If we denote the code representing that color by the parameter $c, then we can use the template shown in Fig. 21.5. SELECT * FROM AutosMed WHERE color = $c;
=>
SELECT serialNo, model, color, autoTrans, dealer1 FROM Cars WHERE color = $c; Figure 21.5: A wrapper template describing queries for cars of a given color Similarly, the wrapper could have another template that specified only the parameter $m representing a model, yet another template in which it was only specified whether an automatic transmission was wanted, and so on. In this case, there are eight choices, if queries are allowed to specify any of three at tributes: model, co lo r, and autoTrans. In general, there would be 2n tem plates if we have the option of specifying n attributes.2 Other templates would
2If the source is a database that can be queried in SQL, as in our example, you would rightly expect that one template could handle any number of attributes equated to constants,
1051
be needed to deal with queries that asked for the total number of cars of cer tain types, or whether there exists a car of a certain type. The number of templates could grow unreasonably large, but some simplifications are possible by adding more sophistication to the wrapper, as we shall discuss starting in Section 21.3.3.
21.3.2
Wrapper Generators
The templates defining a wrapper must be turned into code for the wrapper itself. The software that creates the wrapper is called a wrapper generator; it is similar in spirit to the parser generators (e.g., YACC) that produce components of a compiler from high-level specifications. The process, suggested in Fig. 21.6, begins when a specification, that is, a collection of templates, is given to the wrapper generator.
Q ueries from m ediator R esults
Figure 21.6: A wrapper generator produces tables for a driver; the driver and tables constitute the wrapper The wrapper generator creates a table that holds the various query patterns contained in the templates, and the source queries that are associated with each. A driver is used in each wrapper; in general the driver can be the same for each generated wrapper. The task of the driver is to: 1. Accept a query from the mediator. The communication mechanism may be mediator-specific and is given to the driver as a plug-in, so the same
simply by making the W H E R E clause a parameter. While that approach will work for SQL sources and queries that only bind attributes to constants, we could not necessarily use the same idea with an arbitrary source, such as a Web site that allowed only certain forms as an interface. In the general case, we cannot assume that the way we translate one query resembles at all the way similar queries are translated.
1052
CHAPTER 21. INFORM ATION IN TEG RATIO N driver can be used in systems th at communicate differently.
2. Search the table for a template that matches the query. If one is found, then the param eter values from the query are used to instantiate a source query. If there is no matching template, the wrapper responds negatively to the mediator. 3. The source query is sent to the source, again using a plug-in communi cation mechanism. The response is collected by the wrapper. 4. The response is processed by the wrapper, if necessary, and then returned to the mediator. The next sections discuss how wrappers can support a larger class of queries by processing results.
21.3.3
Filters
Suppose th at a wrapper on a car dealers database has the template shown in Fig. 21.5 for finding cars by color. However, the mediator is asked to find cars of a particular model and color. Perhaps the wrapper has been designed with a more complex template such as that of Fig. 21.7, which handles queries that specify both model and color. Yet, as we discussed at the end of Example 21.5, it is not always realistic to write a template for every possible form of query. SELECT * FROM AutosMed WHERE model = $m AND color = $c; => SELECT serialNo, model, color, autoTrans, dealer1 FROM Cars WHERE model = $m AND color = $c; Figure 21.7: A wrapper template that gets cars of a given model and color Another approach to supporting more queries is to have the wrapper filter the results of queries that it poses to the source. As long as the wrapper has a template that (after proper substitution for the parameters) returns a superset of what the query wants, then it is possible to filter the returned tuples at the wrapper and pass only the desired tuples to the mediator. E x a m p le 2 1.6: Suppose the only template we have is the one in Fig. 21.5 th at finds cars given a color. However, the wrapper is asked by the mediator to find blue Gobi model cars. A possible way to answer the query is to use the template of Fig. 21.5 with $c = b lu e to find ail the blue cars and store them in a temporary relation TempAutos(serialNo, model, color, autoTrans, dealer)
1053
The wrapper may then return to the mediator the desired set of automobiles by executing the local query: SELECT * FROM TempAutos WHERE model = Gobi;
In practice, the tuples of TempAutos could be produced one-at-a-time and f i l tered one-at-a-time, in a pipelined fashion, rather than having the entire relation TempAutos materialized at the wrapper and then filtered.
21.3.4
It is possible to transform data in other ways at the wrapper, as long as we are sure th at the source-query part of the template returns to the wrapper all the data needed in the transformation. For instance, columns may be projected out of the tuples before transmission to the mediator. It is even possible to take aggregations or joins at the wrapper and transmit the result to the mediator. E x am p le 21.7: Suppose the mediator wants to know about blue Gobis at the various dealers, but only asks for the serial number, dealer, and whether or not there is an automatic transmission, since the value of the model and c o lo r attributes are obvious from the query. The wrapper could proceed as in Example 21.6, but at the last step, when the result is to be returned to the mediator, the wrapper performs a projection in the SELECT clause as well as the filtering for the Gobi model in the WHERE clau se. The query SELECT serialNo, autoTrans, dealer FROM TempAutos WHERE model = Gobi; does this additional filtering, although as in Example 21.6 relation TempAutos would probably be pipelined into the projection operator, rather than materi alized at the wrapper.
1054
E x am p le 2 1 .8 : For a more complex example, suppose the mediator is asked to find dealers and models such that the dealer has two red cars, of the same model, one with and one without an automatic transmission. Suppose also that the only useful template for Dealer 1 is the one about colors from Fig. 21.5. That is, the mediator asks the wrapper for the answer to the query of Fig. 21.8. Note th at we do not have to specify a dealer for either Al or A2, because this wrapper can only access data belonging to Dealer 1. The wrappers for all the other dealers will be asked the same query by the mediator. SELECT Al.model Al.dealer FROM AutosMed Al, AutosMed A2 WHERE Al.model = A2.model AND Al.color = red AND A2.color = red AND Al.autoTrans = no AND A2. autoTrans = yes; Figure 21.8: Query from mediator to wrapper A cleverly designed wrapper could discover that it is possible to answer the mediators query by first obtaining from the Dealer-1 source a relation with all the red cars at th at dealer: RedAutos(serialNo, model, color, autoTrans, dealer) To get this relation, the wrapper uses its template from Fig. 21.5, which handles queries th at specify a color only. In effect, the wrapper acts as if it were given the query: SELECT * FROM AutosMed WHERE color = red; The wrapper can then create the relation RedAutos from Dealer l s database by using the template of Fig. 21.5 with $c = r e d . Next, the wrapper joins RedAutos with itself, and performs the necessary selection, to get the relation asked for by the query of Fig. 21.8. The work performed by the wrapper for this step is shown in Fig. 21.9.
21.3.5
E x ercise 2 1 .3 .1 : In Fig. 21.5 we saw a simple wrapper template th at trans lated queries from the mediator for cars of a given color into queries at the dealer with relation Cars. Suppose that the color codes used by the mediator in its schema were different from the color codes used at this dealer, and there was
21.3. W RAPPERS IN MEDIATOR-BASED SYSTEM S SELECT DISTINCT Al.model, Al.dealer FROM RedAutos Al, RedAutos A2 WHERE Al.model = A2.model AND Al. autoTrans = no AND A2.autoTrans = yes;
1055
Figure 21.9: Query performed at the wrapper (or mediator) to complete the answer to the query of Fig. 21.8 a relation G toL (globalC olor, lo c a lC o lo r) that translated between the two sets of codes. Rewrite the template so the correct query would be generated. E x ercise 21.3.2: In Exercise 21.2.1 we spoke of two computer companies, A and B, that used different schemas for information about their products. Suppose we have a mediator with schema PCMed(manf, speed, mem, disk, screen) with the intuitive meaning that a tuple gives the manufacturer (A or B), pro cessor speed, main-memory size, hard-disk size, and screen size for one of the systems you could buy from that company. Write wrapper templates for the following types of queries. Note that you need to write two templates for each query, one for each of the manufacturers. a) Given a speed, find the tuples with that speed. b) Given a screen size, find the tuples with that size. c) Given memory and disk sizes, find the matching tuples. E xercise 21.3.3: Suppose you had the wrapper templates described in Ex ercise 21.3.2 available in the wrappers at each of the two sources (computer manufacturers). How could the mediator use these capabilities of the wrappers to answer the following queries? a) Find the manufacturer, memory size, and screen size of all systems with a 3.1 gigahertz speed and a 120 gigabyte disk. ! b) Find the maximum amount of hard disk available on a system with a 2.8 gigahertz processor. c) Find all the systems with 512M memory and a screen size (in inches) that exceeds the disk size (in gigabytes).
1056
21.4
In Section 16.5 we introduced the idea of cost-based query optimization. A typical DBMS estimates the cost of each query plan and picks what it believes to be the best. When a mediator is given a query to answer, it often has little knowledge of how long its sources will take to answer the queries it sends them. Furthermore, many sources are not SQL databases, and often they will answer only a small subset of the kinds of queries th at the mediator might like to pose. As a result, optimization of mediator queries cannot rely on cost measures alone to select a query plan. Optimization by a mediator usually follows the simpler strategy known as capability-based optimization. The central issue is not what a query plan costs, but whether the plan can be executed at all. Only among plans found to be executable (feasible) do we try to estimate costs.
21.4.1
Today, many useful sources have only Web-based interfaces, even if they are, behind the scenes, an ordinary database. Web sources usually permit query ing only through a query form, which does not accept arbitrary SQL queries. Rather, we are invited to enter values for certain attributes and can receive a response th at gives values for other attributes. E x a m p le 2 1 .9 : The Amazon.com interface allows us to query about books in many different ways. We can specify an author and get all their books, or we can specify a book title and receive information about that book. We can specify keywords and get books that match the keywords. However, there is also information we can receive in answers but cannot specify. For instance, Amazon ranks books by sales, but we cannot ask give me the top 10 sellers. Moreover, we cannot ask questions that are too general. For instance, the query: SELECT * FROM Books; tell me everything you know about books, cannot be asked or answered through the Amazon Web interface, although it could be answered behind the scenes if we were able to access the Amazon database directly. There are a number of other reasons why a source may limit the ways in which queries can be asked. Among them are: 1. Many of the earliest data sources did not use a DBMS, surely not a relational DBMS th at supports SQL queries. These systems were designed to be queried in certain very specific ways only. 2. For reasons of security, a source may limit the kinds of queries that it will accept. Amazons unwillingness to answer the query tell me about
1057
all your books is a rudimentary example; it protects against a rival ex ploiting the Amazon database. As another instance, a medical database may answer queries about averages, but wont disclose the details of a particular patients medical history. 3. Indexes on large databases may make certain kinds of queries feasible, while others are too expensive to execute. For instance, if a books data base were relational, and one of the attributes were au th o r, then without an index on that attribute, it would be infeasible to answer queries that specified only an author.3
21.4.2
If data is relational, or may be thought of as relational, then we can describe the legal forms of queries by adornments. These are sequences of codes that repre sent the requirements for the attributes of the relation, in their standard order. The codes we shall use for adornments reflect the most common capabilities of sources. They are: 1. / (free) means that the attribute can be specified or not, as we choose. 2. b (bound) means that we must specify a value for the attribute, but any value is allowed. 3. u (unspecified) means that we are not permitted to specify a value for the attribute. 4. c[S] (choice from set S) means that a value must be specified, and that value must be one of the values in the finite set S. This option corre sponds, for instance, to values that are specified from a pulldown menu in a Web interface. 5. o[S] (optional, from set S) means th at we either do not specify a value, or we specify one of the values in the finite set S. In addition, we place a prime (e.g., / ') on a code to indicate that the attribute is not part of the output of the query. A capabilities specification for a source is a set of adornments. The intent is th at in order to query the source successfully, the query must match one of the adornments in its capabilities specification. Note that, if an adornment has free or optional components, then queries with different sets of attributes specified may match that adornment.
3We should be aware, however, that information like Amazons about products is not accessed as if it were a relational database. Rather, the information about books is stored as text, with an inverted index, as we discussed in Section 14.1.8. Thus, queries about any aspect of books authors, titles, words in titles, and perhaps words in descriptions of the book are supported by this index.
1058
E x am p le 21 .1 0 : Suppose we have two sources like those of the two dealers in Example 21.4. Dealer 1 is a source of data in the form: C a rs (se ria lN o , m odel, c o lo r, au to T ra n s, n av i) Note th at in the original, we suggested relation Cars could have additional attributes representing options, but for simplicity in this example, let us limit our thinking to automatic transmissions and navigation systems only. Here are two possible ways th at Dealer 1 might allow this data to be queried: 1. The user specifies a serial number. All the information about the car with th at serial number (i.e., the other four attributes) is produced as output. The adornment for this query form is b'uuuu. T hat is, the first attribute, se ria lN o must be specified and is not part of the output. The other attributes must not be specified and are part of the output. 2. The user specifies a model and color, and perhaps whether or not auto matic transmission and navigation system are wanted. All five attributes are printed for all matching cars. An appropriate adornment is ubbo[yes, no]o[yes, no] This adornment says we must not specify the serial number; we must specify a model and color, but are allowed to give any possible value in these fields. Also, we may, if we wish, specify whether we want automatic transmission and/or a navigation system, but must do so by using only the values yes and no in those fields.
Given a query at the mediator, a capability-based query optimizer first con siders what queries it can ask at the sources to help answer the query. If we imagine those queries asked and answered, then we have bindings for some more attributes, and these bindings may make some more queries at the sources possible. We repeat this process until either: 1. We have asked enough queries at the sources to resolve all the conditions of the mediator query, and therefore we may answer that query. Such a plan is called feasible. 2. We can construct no more valid forms of source queries, yet we still cannot answer the mediator query, in which case the mediator must give up; it has been given an impossible query.
1059
The simplest form of mediator query for which we need to apply the above strategy is a join of relations, each of which is available, with certain adorn ments, at one or more sources. If so, then the search strategy is to try to get tuples for each relation in the join, by providing enough argument bindings that some source allows a query about that relation to be asked and answered. A simple example will illustrate the point. E x am p le 21.11: Let us suppose we have sources like the relations of Dealer 2 in Example 21.4: Autos(serial, model, color) Options(serial, option) Suppose th at ubf is the sole adornment for Autos, while O ptions has two adorn ments, bu and uc[autoTrans, navi], representing two different kinds of queries that we can ask at that source. Let the query be find the serial numbers and colors of Gobi models with a navigation system. Here are three different query plans that the mediator must consider: 1. Specifying th at the model is Gobi, query Autos and get the serial numbers and colors of all Gobis. Then, using the bu adornment for Options, for each such serial number, find the options for that car and filter to make sure it has a navigation system. 2. Specifying the navigation-system option, query O ptions using the uc[autoTrans, navi] adornment and get all the serial numbers for cars with a navigation sys tem. Then query Autos as in (1), to get all the serial numbers and colors of Gobis, and intersect the two sets of serial numbers.
1060
3. Query O ptions as in (2) to get the serial numbers for cars with a naviga tion system. Then use these serial numbers to query Autos and see which of these cars are Gobis. Either of the first two plans are acceptable. However, the third plan is one of several plans th at will not work; the system does not have the capability to execute this plan because the second part the query to Autos does not have a matching adornment.
21.4.4
The mediators query optimizer is not done when the capabilities of the sources are examined. Having found the feasible plans, it must choose among them. Making an intelligent, cost-based optimization requires th at the mediator know a great deal about the costs of the queries involved. Since the sources are usually independent of the mediator, it is difficult to estimate the cost. For instance, a source may take less time during periods when it is lightly loaded, but when are those periods? Long-term observation by the mediator is necessary for the mediator even to guess what the response time might be. In Example 21.11, we might simply count the number of queries to sources th at must be issued. Plan (2) uses only two source queries, while plan (1) uses one plus the number of Gobis found in the Autos relation. Thus, it appears th at plan (2) has lower cost. On the other hand, if the queries of Options, one with each serial number, could be combined into one query, then plan (1) might turn out to be the superior choice.
21.4.5
E x ercise 21 .4 .1 : Suppose each relation from Exercise 21.2.1: Computers(number, proc, speed, memory, hd) Monitors(number, screen, maxResX, maxResY) is an information source. Using the notation from Section 21.4.2, write one or more adornments th at express the following capabilities: a) We can query for computers having a given processor, which must be one of P-IV, G5, or Athlon, a given speed, and (optionally) a given amount of memory. b) We can query for computers having any specified hard-disk size and/or any given memory size. c) We can query for monitors if we specify either the number of the monitor, the screen size, or the maximum resolution in both dimensions.
1061
d) We can query for monitors if we specify the screen size, which must be either 19, 22, 24, or 30 inches. All attributes except the screen size are returned. ! e) We can query for computers if we specify any two of the processor type, processor speed, memory size, or disk size. E x ercise 21.4.2: Suppose we have the two sources of Exercise 21.4.1, but understand the attribute number of both relations to refer to the number of a complete system, some of whose attributes are found in one source and some in the other. Suppose also that the adornments describing access to the Computers relation are buuuu, ubbff, and uuubb, while the adornments for M onitors are bfff and ubbb. Tell what plans are feasible for the following queries (exclude any plans th at are obviously more expensive than other plans on your list): a) Find the systems with 512 megabytes of memory, an 80-gigabyte hard disk, and a 22-inch monitor. b) Find the systems with a Pentium-IV processor running at 3.0 gigahertz with a 22-inch monitor and a maximum resolution of 1600-by-1050. ! c) Find all systems with a G5 processor running at 1.8 gigahertz, with 2 gigabytes of memory, a 300 gigabyte disk, and a 19-inch monitor.
21.5
In this section, we shall give a greedy algorithm for answering queries at a mediator. This algorithm, called chain, always finds a way to answer the query by sending a sequence of requests to its sources, provided at least one solution exists. The class of queries that can be handled is those that involve joins of relations th at come from the sources, followed by an optional selection and optional projection onto output attributes. This class of queries is exactly what can be expressed as Datalog rules (Section 5.3).
21.5.1
The Chain Algorithm concerns itself with Datalog rules and with whether prior source requests have provided bindings for any of the variables in the body of the rule. Since we care only about whether we have found all possible constants for a variable, we can limit ourselves, in the query at the mediator (although not at the sources), to the b (bound) and / (free) adornments. That is, a c[5] adornment for an attribute of a source relation can be used as soon as we know all possible values of interest for that attribute (i.e., the corresponding position in the mediator query has a b adornment). Note that the source will not provide matches for the values outside S, so there is no point in asking questions about these values. The optional adornment o[5] can be treated as free, since there is
1062
no need to have a binding for the corresponding attribute in the query at the mediator (although we could). Likewise, adornment u can be treated as free, since although we cannot then specify a value for the attribute at the source, we can have, or not have, a binding for the corresponding variable at the mediator. E x am p le 21 .1 2 : Let us use the same query and source relations as in Exam ple 21.11, but with different capabilities at the sources. In what follows we shall use superscripts on the predicate or relation names to show the adornment or permitted set of adornments. In this example, the permitted adornments for the two source relations are: Autos6 ""(serial, model, color) Optionsc t a u t o T r a Bn a v i )(serial, option) T hat is, we can only access O ptions by providing a binding autoTrans or navi for the o p tio n attribute, and we can only access Autos by providing a binding for the s e r i a l attribute. The query find the serial numbers and colors of Gobi models with a navi gation system is expressed in Datalog by: Answer(s,c) Autos-^(s,"Gobi",c) AND Options^6(s,"navi") Here, notice the adornments on the subgoals of the body. These, at the moment, are commentaries on what arguments of each subgoal are bound to a set of constants. Initially, only the middle argument of the Autos subgoal is bound (to the set containing only the constant Gobi) and the second argument of the O ptions subgoal is bound to the set containing only navi. We shall see shortly th at as we use the sources to find tuples th at match one or another subgoal, we get bindings for some of the variables in the Datalog rule, and thus change some of the / s to V s in the adornments.
21.5.2
We now need to formalize the comments made at the beginning of Section 21.5.1 about when a subgoal with some of its arguments bound can be answered by a source query. Suppose we have a subgoal R XlX2'"Xn(ai,a 2 , , an), where each Xi is either b or / . R is a relation th at can be queried at some source, and which has some set of adornments. Suppose j/i ?/2 'Un is one of the adornments for R at its source. Each yi can be any of b, f , u, c[S] or o[S] for any set S. Then it is possible to obtain a relation for the subgoal provided, for each i = 1 ,2 ,... , n, provided: If j/j is b or of the form cfS1 ], then x* = b. If Xi = / , then yi is not output restricted (i.e., not primed). Note th at if y%is any of / , u, or o[S], then Xi can be either b or / . We say that the adornment on the subgoal matches the adornment at the source.
1063
E x am p le 21.13: Suppose the subgoal in question is R hbB (p ,q ,r ,s ), and the adornments for R at its source are ati fc[S-i]uo[S2 ] and a 2 = c[5s]6/c[54]. Then bbff matches adornment a i , so we may use a j to get the relation for subgoal R(p,q,r,s). That is, a x has no bs and only one c, in the second position. Since the adornment of the subgoal has b in the second position, we know th at there is a set of constants to which the variable q (the variable in the second argument of the subgoal) has been bound. For each of those constants th at are a member of the set Si we can issue a query to the source for R, using that constant as the binding for the second argument. We do not provide bindings for any other argument, even though c*i allows us to provide a binding for the first and/or fourth argument as well. However, bbff does not match a 2. The reason is that a 2 has cfS-j] in the fourth position, while bbff has / in that position. If we were to try to obtain R using a 2, we would have to provide a binding for the fourth argument, which means th at variable s in R(p, q, r, s) would have to be bound to a set of con stants. But we know that is not the case, or else the adornment on the subgoal would have had b in the fourth position.
21.5.3
The Chain Algorithm is a greedy approach to selecting an order in which we obtain relations for each of the subgoals of a Datalog rule. It is not guaranteed to provide the most efficient solution, but it will provide a solution whenever one exists, and in practice, it is very likely to obtain the most efficient solution. The algorithm maintains two kinds of information: An adornment is maintained for each subgoal. Initially, the adornment for a subgoal has b if and only if the mediator query provides a constant binding for the corresponding argument of that subgoal, as for instance, the query in Example 21.12 provided bindings for the second arguments of both the Autos and O ptions subgoals. In all other places, the adornment has / s. A relation X that is (a projection of) the join of the relations for all the subgoals that have been resolved. We resolve a subgoal when the adornment for the subgoal matches one of the adornments at the source for this subgoal, and we have extracted from the source all possible tuples for that subgoal. Initially, since no subgoals have been resolved, X is a relation over no attributes, containing just the empty tuple (i.e., the tuple with zero components). Note that for empty X and any relation R, X ix R = R; i.e., X is initially the identity relation for the naturaljoin operation. As the algorithm progresses, X will have attributes that are variables of the rule those variables that correspond to bs in the adornments of the subgoals in which they appear. The core of the Chain Algorithm is as follows. After initializing relation X and the adornments of the subgoals as above, we repeatedly select a subgoal
1064
th at can be resolved. Let R a (ai,a, 2 , ... , an) be the subgoal to be resolved. We do so by: 1. Wherever a has a b, we shall find th at either the corresponding argument of R is a constant rather than a variable, or it is one of the variables in the schema of the relation X . Project X onto those of its variables th at appear in subgoal R. Each tuple in the projection, together with constants in the subgoal R, if any, provide sufficient bindings to use one of the adornments for the source relation R whichever adornment a matches. 2. Issue a query to the source for each tuple t in the projection of X . We construct the query as follows, depending on the source adornment [3 that a matches. (a) If a component of is b, then the corresponding component of a is too, and we can use the corresponding component of t (or a constant in the subgoal) to provide the necessary binding for the source query. (b) If a component of /3 is c[S], then again the corresponding component of a will be b, and we can obtain a constant from the subgoal or the tuple t. However, if th at constant is not in S, then there is no chance the source can produce any tuples that match t, so we do not generate any source query for t. (c) If a component of /3 is / , then produce a constant value for this component in the source query if we can; otherwise do not provide a value for this component in the source query. Note th at we can provide a constant exactly when the corresponding component of a is b. (d) If a component of /3 is u, provide no binding for this component, even if the corresponding component of a is b. (e) If a component of (3 is o[S], treat this component as if it were / in the case th at the corresponding component of a is / , and as c[S] if the corresponding component of a is b. For each tuple returned, extend the tuple so it has one component for each argument of the subgoal (i.e., n components). Note th at the source will return every component of R th at is not output restricted, so the only components th at are not present have b in the adornment a. Thus, the returned tuples can be padded by using either the constant from the subgoal, or the constant from the tuple in the projection of X . The union of all the responses is the relation R for the subgoal R(ai, a2, . . . , an). 3. Every variable among a \,a 2, . . . ,a n is now bound. For each subgoal that has not yet been resolved, change its adornment so any position holding one of these variables is now bound (b).
1065
5. Project out of X all components that correspond to variables that do not appear in the head or in any unresolved subgoal. These components can never be useful in what follows. The complete Chain Algorithm, then, consists of the initialization described above, followed by as many subgoal-resolution steps as we can manage. If we succeed in resolving every subgoal, then relation X will be the answer to the query. If at some point, there are unresolved subgoals, yet none can be resolved, then the algorithm fails. In that case, there can be no other sequence of resolution steps th at answers the query. E x am p le 21.14: Consider the mediator query Q: Answer(c) - R6/(l,a) AND S^(a,b) AND T^(b,c)
There are three sources th at provide answers to queries about R, S, and T, respectively. The contents of these relations at the sources and the only adorn ments supported by these sources are shown in Fig. 21.10. Relation
w
R
X X
y 4 5
y 4 5 5
Data
2 3 4
2 3
6 7 8
1 1
Adornment
bf
c'[2,3 ,5 ]/
bu
Figure 21.10: Data for Example 21.14 Initially, the adornments on the subgoals are as shown in the query Q, and the relation X that we construct initially contains only the empty tuple. Since subgoals S and T have f f adornments, but the adornments at the corresponding sources each have a component with b or c, neither of these subgoals can be resolved. Fortunately, the first subgoal, i? (l,a ), can be resolved, since the bf adornment at the corresponding source is matched by the adornment of the subgoal. Thus, we send the source for R (w ,x) a query with w = 1, and the response is the set of three tuples shown in the first column of Fig. 21.10. We next project the subgoals relation onto its second component, since only the second component of R (l,a ) is a variable. That gives us the relation
1066
2~
3 4 This relation is joined with X , which currently has no attributes and only the empty tuple. The result is th at X becomes the relation above. Since a is now bound, we change the adornment on the S subgoal from f f t o bf. At this point, the second subgoal, S bf(a,b), can be resolved. We obtain bindings for the first component by projecting X onto a; the result is X itself. T hat is, we can go to the source for S (x ,y ) with bindings 2, 3, and 4 for x. We do not need bindings for y, since the second component of the adornment for the source is / . The c'[2,3,5] code for x says th at we can give the source the value 2, 3, or 5 for the first argument. Since there is a prime on the c, we know th at only the corresponding y value(s) will be returned, not the value of x that we supplied in the request. We care about values 2, 3, and 4, but 4 is not a possible value at the source for 5 , so we never ask about it. When we ask about x = 2, we get one response: y 4. We pad this response with the value 2 we supplied to conclude th at (2,4) is a tuple in the relation for the 5 subgoal. Similarly, when we ask about x = 3, we get y 5 as the only response and we add (3,5) to the set of tuples constructed for the S subgoal. There are no more requests to ask at the source for S, so we conclude that the relation for the S subgoal is a 2 3 b 4 5
When we join this relation with the previous value of X , the result is just the relation above. However, variable a now appears neither in the head nor in any unresolved subgoal. Thus, we project it out, so X becomes b
~
4~
5 Since b is now bound, we change the adornment on the T subgoal, so it becomes T bf(b,c). Now this last subgoal can be resolved, which we do by sending requests to the source for T(y, z) with y 4 and y = 5. The responses we get back give us the following relation for the T subgoal: b 4 5 5 c 6 7 8
1067
We join it with the relation for X above, and then project onto the c attribute to get the relation for the head. That is, the answer to the query at the mediator is {(6), (7), (8)}.
21.5.4
In our description of the Chain Algorithm, we assumed that each predicate in the Datalog query at the mediator was a view of data at one particular source. However, it is common for there to be several sources that can contribute tuples to the relation for the predicate. How we construct the relation for such a predicate depends on how we expect the sources for the predicate to interact. The easy case is where we expect the sources for a predicate to contain replicated information. In th at case, we can turn to any one of the sources to get the relation for a predicate. This case thus looks exactly like the case where there is a single source for a predicate, but there may be several adornments th at allows us to query that source. The more complex case is when the sources each contribute some tuples to the predicate th at the other sources may not contribute. In that case, we should consult all the sources for the predicate. However, there is still a policy choice to be made. Either we can refuse to answer the query unless we can consult all the sources, or we can make best efforts to return all the answers to the query th at we can obtain by combinations of sources. C o n su lt A ll S ources If we must consult all sources to consider a subgoal resolved, then we can only resolve a subgoal when each source for its relation has an adornment matched by the current adornment of the subgoal. This rule is a small modification of the Chain Algorithm. However, not only does it make queries harder to answer, it makes queries impossible to answer when any source is down, even if the Chain Algorithm provides a feasible ordering in which to resolve the subgoals. Thus, as the number of sources grows, this policy becomes progressively less practical. B e st E fforts Under this assumption, we only need one source with a matching adornment to resolve a subgoal. However, we need to modify the chain algorithm to revisit each subgoal when that subgoal has new bound arguments. We may find that some source that could not be matched is now matched by the subgoal with its new adornment. E x am p le 21.15: Consider the mediator query answer (a,c) <r- R^(a,b) AND S^Cb.c)
1068
Suppose also th at R has two sources, one described by adornment f f and the other by fb. Likewise, S has two sources, described by i f and bf. We could start by using either source with adornment ff, suppose we start with R s source. We query this source and get some tuples for R. Now, we have some bindings, but perhaps not all, for the variable b. We can now use both sources for S to obtain tuples and the relation for 5 can be set to their union. At this point, we can project the relation for S onto variable b and get some 6-values. These can be used to query the second source for R, the one with adornment fb. In this manner, we can get some additional il-tuples. It is only at this point that we can join the relations for R and S, and project onto a and c to get the best-effort answer to the query.
21.5.5
with the following adornments at the sources for R, S, and T. If there is more than one adornment for a predicate, either may be used. a) R f f f , S b*, T bf f , T W . b) R f f b, S f b, T f bf, T bf f . c) R W , S f b, S bf, Tf f f . In each case: v i. Indicate all possible orders in which the subgoals can be resolved. ii. Does the Chain Algorithm produce an answer to the query? Hi. Give the sequence of relational-algebra operations needed to compute the intermediate relation X at each step and the result of the query. ! E x ercise 2 1 .5 .2 : Suppose that for the mediator query of Exercise 21.5.1, each predicate is a view defined by the union of two sources. For each predicate, one of the sources has an all-/ adornment. The other sources have the following adornments: R f bb, S bf , and T b^ . Find a best-effort sequence of source requests th at will produce all the answers to the mediator query that can be obtained from these sources. E x ercise 2 1 .5 .3 : Describe all the source adornments th at are matched by a subgoal with adornment R bf . !! E x ercise 21 .5 .4 : Prove th at if there is any sequence of subgoal resolutions th at will resolve all subgoals, then the Chain Algorithm will find one. Hint'. Notice th at if a subgoal can be resolved at a certain step, then if it is not selected for resolution, it can still be resolved at the next step.
1069
21.6
Local-as-View M ediators
The mediators discussed so far are called global-as-view (GAV) mediators. The global data (i.e., the data available for querying at the mediator) is like a view; it doesnt exist physically, but pieces of it are constructed by the mediator, as needed, by asking queries of the sources. In this section, we introduce another approach to connecting sources with a mediator. In a local-as-view (LAV) mediator, we define global predicates at the mediator, but we do not define these predicates as views of the source data. Rather, we define, for each source, one or more expressions involving the global predicates that describe the tuples that the source is able to produce. Queries are answered at the mediator by discovering all possible ways to construct the query using the views provided by the sources.
21.6.1
In many applications, GAV mediators are easy to construct. You decide on the global predicates or relations that the mediator will support, and for each source, you consider which predicates it can support, and how it can be queried. T hat is, you determine the set of adornments for each predicate at each source. For instance, in our Aardvark Automobiles example, if we decide we want Autos and Options predicates at the mediator, we find a way to query each dealers source for those concepts and let the Autos and Options predicates at the mediator represent the union of what the sources provide. Whenever we need one or both of those predicates to answer a mediator query, we make requests of each of the sources to obtain their data. However, there are situations where the relationship between what we want to provide to users of the mediator and what the sources provide is more subtle. We shall look at an example where the mediator is intended to provide a single predicate Par(c,p), meaning that p is a parent of c. As with all mediators, this predicate represents an abstract concept in this case, the set of all childparent facts that could ever exist and the sources will provide information about whatever child-parent facts they know. Even put together, the sources probably do not know about everyone in the world, let along everyone who ever lived. Life would be simple if each source held some child-parent information and nothing else that was relevant to the mediator. Then, all we would have to do is determine how to query each one for whatever facts they could provide. However, suppose we have a database maintained by the Association of Grand parents th at doesnt provide any child-parent facts at all, but provides childgrandparent facts. We can never use this source to help answer a query about someones parents or children, but we can use it to help answer a mediator query th at uses the Pax predicate several times to ask for the grandparents of an individual, or their great-grandparents, or another complex relationship among people.
1070
GAV mediators do not allow us to use a grandparents source at all, if our goal is to produce a P ar relation. Producing both a parent and a grandparent predicate at the mediator is possible, but it might be confusing to the user and would require us to figure out how to extract grandparents from all sources, including those th at only allow queries for child-parent facts. However, LAV mediators allow us to say th at a certain source provides grandparent facts. Moreover, the technology associated with LAV mediators lets us discover how and when to use th at source in a given query.
21.6.2
LAV mediators are always defined using a form of logic that serves as the language for defining views. In our presentation, we shall use Datalog. Both the queries at the mediator and the queries (view definitions) th at describe the sources will be single Datalog rules. A query that is a single Datalog rule is often called a conjunctive query, and we shall use the term here. A LAV mediator has a set of global predicates, which are used as the subgoals of mediator queries. There are other conjunctive queries that define views; i.e., their heads each have a unique view predicate th at is the name of a view. Each view definition has a body consisting of global predicates and is associated with a particular source, from which th at view can be constructed. We assume that each view can be constructed with an all-free adornment. If capabilities are limited, we can use the chain algorithm to decide whether solutions using the views are feasible. Suppose we are given a conjunctive query Q whose subgoals are predicates defined at the mediator. We need to find all solutions conjunctive queries whose bodies are composed of view predicates, but th at can be expanded to produce a conjunctive query involving the global predicates. Moreover, this conjunctive query must produce only tuples th at are also produced by Q. We say such expansions are contained in Q. An example may help with these tricky concepts, after which we shall define expansion formally. E x a m p le 21.1 6 : Suppose there is one global predicate Par(c,p) meaning that p is a parent of c. There is one source that produces some of the possible parent facts; its view is defined by the conjunctive query Vi (c, p) - Par(c,p) There is another source that produces some grandparent facts; its view is defined by the conjunctive query V2Cc,g) <- Par(c,p) AND Par(p,g) Our query at the mediator will ask for great-grandparent facts th at can be obtained from the sources. T hat is, the mediator query is Q(w,z) Par(w,x) AND Par(x,y) AND Par(y,z)
1071
How might we answer this query? The source view Vi contributes to the parent predicate directly, so we can use it three times in the obvious solution
Q(w,z) - V i ( w , x ) AND V i ( x , y ) AND V i ( y , z )
There are, however, other solutions that may produce additional answers, and thus must be part of the logical query plan for answering the query. In partic ular, we can use the view V2 to get grandparent facts, some of which may not be inferrable by using two parent facts from Vi. We can use Vi to make a step of one generation, and then use V2 to make a step of two generations, as in the solution
Q(w,z) Vi(w,x) AND V2 ( x , z )
It turns out these are the only solutions we need; their union is all the greatgrandparent facts th at we can produce from the sources Vi and V2. There is still a great deal to explain. Why are these solutions guaranteed to produce only answers to the query? How do we tell whether a solution is part of the answer to a query? How do we find all the useful solutions to a query? We shall answer each of these questions in the next sections.
21.6.3
Expanding Solutions
Given a query Q, a solution S has a body whose subgoals are views, and each view V is defined by a conjunctive query with that view as the head. We can substitute the body of V s conjunctive query for a subgoal in S that uses predicate V, as long as we are careful not to confuse variable names from one body with those of another. Once we substitute rule bodies for the views that are in S, we have a body that consists of global predicates only. The expanded solution can be compared with Q, to see if the results produced by the solution S are guaranteed to be answers to the query Q, in a manner we shall discuss later. However, first we must be clear about the expansion algorithm. Suppose th at there is a solution S that has a subgoal V (a i,a 2 , . .. ,a). Here the at s can be any variables or constants, and it is possible th at two or more of the a*s are actually the same variable. Let the definition of view V be of the form V(bi, & 2 > ! bn) < B where B represents the entire body. We may assume that the V s are dis tinct variables, since there is no need to have two identical components in a view, nor is there a need for components th at are constant. We can replace V (a i,a 2, . .. , a) in solution S by a version of body B that has all the subgoals of B, but with variables possibly altered. The rules for altering the variables of B are:
1072
1. First, identify the local variables o i B those variables th at appear in the body, but not in the head. Note that, within a conjunctive query, a local variable can be replaced by any other variable, as long as the replacing variable does not appear elsewhere in the conjunctive query. The idea is the same as substituting different names for local variables in a program. 2. If there are any local variables of B th at appear in B or in S, replace each one by a distinct new variable th at appears nowhere in the rule for V or in S. 3. In the body B, replace each bi by a*, for i = 1 , 2 , . . . , n. E x am p le 2 1 .1 7 : Suppose we have the view definition
V (a,b,c,d)
< -
E ( a , b , x , y ) AND F ( x , y , c , d )
Suppose further th at some solution S has in its body a subgoal V (x ,y , 1, x). The local variables in the definition of V are x and y, since these do not appear in the head. We need to change them both, because they appear in the subgoal for which we are substituting. Suppose e and / are variable names that appear nowhere in S. We can rewrite the body of the rule for V as
V (a,b,c,d) < E ( a , b , e , f ) AND F ( e , f , c , d )
Next, we must substitute the arguments of the V subgoal for a, b, c, and d. The correspondence is th at a and d become x, b becomes y, and c be comes the constant 1. We therefore substitute for V (x ,y , 1, x) the two subgoals E { x , y , e , f ) and F ( e ,f , l ,x). The expansion process is essentially the substitution described above for each subgoal of the solution S. There is one extra caution of which we must be aware, however. Since we may be substituting for the local variables of several view definitions, and may in fact need to create several versions of one view definition (if S has several subgoals with the same view predicate), we must make sure th at in the substitution for each subgoal of S, we use unique local variables ones th at do not appear in any other substitution or in S itself. Only then can we be sure th at when we do the expansion we do not use the same name for two variables th at should be distinct. E x am p le 2 1 .1 8 : Let us resume the discussion we began in Example 21.16, where we had view definitions
V i ( c , p ) < - P a r ( c, p ) V2 ( c , g ) - P a r ( c , p ) AND Pa r( p, g)
1073
Let us expand this solution. The first subgoal, with predicate V\ is easy to expand, because the rule for V\ has no local variables. We substitute w and x for c and p respectively, so the body of the rule for V\ becomes Par(w, x). This subgoal will be substituted in S for Vi(w,x). We must also substitute for the V2 subgoal. Its rule has local variable p. However, since p does not appear in S, nor has it been used as a local variable in another substitution, we are free to leave p as it is. We therefore have only to substitute x and z for the variables c and g , respectively. The two subgoals in the rule for V2 become P ar(x,p ) and Par(p,z). When we substitute these two subgoals for V2 (x, z) in S, we have constructed the complete expansion of S:
Q(w,z) Par(w,x) AND Par( x, p) AND Pa r( p, z)
Notice that this expansion is practically identical to the query in Exam ple 21.16. The only difference is that the query uses local variable y where the expansion uses p. Since the names of local variables do not affect the result, it appears th at the solution S is the answer to the query. However, that is not quite right. The query is looking for all great-grandparent facts, and all the expansion says is th at the solution S provides only facts that answer the query. S might not produce all possible answers. For example, the source of V2 might even be empty, in which case nothing is produced by solution S, even though another solution might produce some answers.
21.6.4
In order for a conjunctive query S to be a solution to the given mediator query Q, the expansion of S, say E, must produce only answers that Q produces, regardless of what relations are represented by the predicates in the bodies of E and Q. If so, we say that E C Q. There is an algorithm to tell whether E C Q; we shall see this test after introducing the following important concept. A containment mapping from Q to E is a function r from the variables of Q to the variables and constants of E, such that: 1. If x is the th argument of the head of Q, then of the head of E.
t (x )
2. Add to r the rule that r(c) = c for any constant c. If P (x i,ar2, , x n) is a subgoal of Q, then P ( t (x 1), r(x 2) , . .. , t (x )) is a subgoal of E. E x am p le 21.19: Consider the following two conjunctive queries:
QiQ 2:
1074
We claim th at Q2 Q Q i- In proof, we offer the following containment mapping: t (x ) = a, r(y) = b, and r(z) = d. Notice th at when we apply this substitution, the head of Q\ becomes H(a, b), which is the head of Q2. The first subgoal of Q 1 becomes A(a, d), which is the third subgoal of Q 2 . Likewise, the second subgoal of Q 1 becomes the second subgoal of That proves there is a containment mapping from Q 1 to Q 2 , and therefore Q 2 Q Qi- Notice th at no subgoal of Q\ maps to the first subgoal of Q 2 , but the containment-mapping definition does not require th at there be one. Surprisingly, there is also a containment mapping from Q 2 to Q i, so the two conjunctive queries are in fact equivalent. That is, not only is one contained in the other, but on any relations A and B, they produce exactly the same set of tuples for the relation H. The containment mapping from Qi to Q 1 is p(a) x, p(b) y, and p(c) = p(d) z. Under this mapping, the head of Q 2 becomes the head of Q 1 , the first and third subgoals of Q2 become the first subgoal of Q i , and the second subgoal of Q 2 becomes the second subgoal of Q \ . While it may appear strange th at two such different looking conjunctive queries are equivalent, the following is the intuition. Think of A and B as two different colored edges on a graph. Then Q 1 asks for the pairs of nodes x and y such th at there is an A-edge from x to some 2 and a B-edge from z to y. Q 2 asks for the same thing, using its second and third subgoals respectively, although it calls x, y, and 2 by the names a, b, and d respectively. In addition, Q 2 seems to have the added condition expressed by the first subgoal th at there is an edge from node a to somewhere (node c). But we already know that there is an edge from a to somewhere, namely d. T hat is, we are always free to use the same node for c as we did for d, because there are no other constraints on c. E x a m p le 21.20: Here are two queries similar, but not identical, to those of Example 21.19:
P i:
P2:
H(x,y) H(a,b)
<-
Intuitively, if we think of A as representing edges in a graph, then Pi asks for paths of length 2 and P2 asks for paths of length 3. We do not expect either to be contained in the other, and indeed the containment-mapping test confirms th at fact. Consider a possible containment mapping r from Pi to P2. Because of the conditions on heads, we know r(x) = a and r(y) = b. To what does 2 map? Since we already know t (x ) = a, the first subgoal A (x ,z ) can only map to A{a , c) of P-2 - T hat means r(z) must be c. However, since r(y) = b, the subgoal A(z, y) of Pi can only become A(d,b) in P2. That means t (z ) must be d. But 2 can only map to one value; it cannot map to both c and d. We conclude that no containment mapping from Pi to P 2 exists. A similar argument shows th at there is no containment mapping from P2 to P i . We leave it as an exercise.
1075
The importance of containment mappings is expressed by the following the orem: If Q i and Qa are conjunctive queries, then Qa C Q\ if and only if there is a containment mapping from Q\ to Q2. Notice that the containment mapping goes in the opposite direction from the containment; that is, the containment mapping is from the conjunctive query that produces the larger set of answers to the one that produces the smaller, contained set.
21.6.5
We need to argue two points. First, if there is a containment mapping, why must there be a containment of conjunctive queries? Second, if there is containment, why must there be a containment mapping? We shall not give formal proofs, but will sketch the arguments. First, suppose there is a containment mapping r from Q i to Qa. Recall from Section 5.3.4 that when we apply Q2 to a database, we look for substitutions a for all the variables of Q2 that make all its relational subgoals be tuples of the corresponding relation of the database. The substitution for the head becomes a tuple t that is returned by Q-i. If we compose r and then <r, we have a mapping from the variables of Q i to tuples of the database that produces the same tuple t for the head of Q \ . Thus, on any given database, everything that Q 2 produces is also produced by Q i . Conversely, suppose that Q 2 C Q\. That is, on any database D, everything that Q 2 produces is also produced by Qi- Construct a particular database D th at has only the subgoals of <52- That is, pretend the variables of Q 2 are distinct constants, and for each subgoal P(a\,a, 2 , .. ,an), put the tuple (01 , ci2, ... , an) in the relation for P. There are no other tuples in the relations of D. When Q 2 is applied to database D, surely the tuple whose components are the arguments of the head of Q 2 is produced. Since Q 2 C Q 1, it must be that
1076
Q i applied to D also produces the head of Q2. Again, we use the definition in Section 5.3.4 of how a conjunctive query is applied to a database. That definition tells us that there is a substitution of constants of D for the variables of Q i th at turns each subgoal of Qi into a tuple in D and turns the head of Q i into the tuple th at is the head of <52- But remember th at the constants of D are the variables of Q2. Thus, this substitution is actually a containment mapping.
21.6.6
We have one more issue to resolve. We are given a mediator query Q, and we need to find all solutions S such th at the expansion E of S is contained in Q. But there could be an infinite number of S built from the views using any number of subgoals and variables. The following theorem limits our search. If a query Q has n subgoals, then any answer produced by any solution is also produced by a solution th at has at most n subgoals. This theorem, often called the LMSS Theorem,4 gives us a finite, although exponential task to find a sufficient set of solutions. There has been considerable work on making the test much more efficient in typical situations. E x am p le 21 .2 1 : Recall the query Q\:
Q(w,z)
4-
from Example 21.16. This query has three subgoals, so we dont have to look at solutions with more than three subgoals. One of the solutions we proposed was
S i:
Q(w,z) - V i ( w , x ) AND V2 ( x , z )
This solution has only two subgoals, and its expansion is contained in the query. Thus, it needs to be included among the set of solutions that we evaluate to answer the query. However, consider the following solution: S2:
Q(w,z) < - V i ( w , x ) AND V2 ( x , z ) AND V i ( t , u ) AND V2 ( u, v)
It has four subgoals, so we know by the LMSS Theorem th at it does not need to be considered. However, it is truly a solution, since its expansion E2:
Q ( w ,z )
1077
is contained in the query Q i. To see why, use the containment mapping that maps w, x, and 2 to themselves and y to p. However, E 2 is also contained in the expansion Ei of the smaller solution Si. Recall from Example 21.18 that the expansion of Si is
E i'.
We can see immediately th at E2 C E \ , using the containment mapping that sends each variable of E\ to the same variable in E2. Thus, every answer to Qi produced by S 2 is also produced by Si. Notice, incidentally, that S2 is really Si with the two subgoals of Si repeated with different variables. In principle, to apply the LMSS Theorem, we must consider a number of possible solutions that is exponential in the query size. We must consider not only the choices of predicates for the subgoals, but which arguments of which subgoals hold the same variable. Note that within a conjunctive query, the names of the variables do not matter, but it matters which sets of arguments have the same variable. Most query processing is worst-case exponential in the query size anyway, as we learned in Chapter 16. Moreover, there are some powerful techniques known for limiting the search for solutions by looking at the structure of the conjunctive queries that define the views. We shall not go into depth here, but one easy but powerful idea is the following. If the conjunctive query th at defines a view V has in its body a predicate P th at does not appear in the body of the mediator query, then we need not consider any solution that uses V.
21.6.7
Suppose we have a query Q with n subgoals, and there is a solution S with more than n subgoals. The expansion E of S must be contained in query Q, which means that there is a containment mapping from Q to the expansion E, as suggested in Fig. 21.11. If there are n subgoals (n = 2 in Fig. 21.11) in Q, then the containment mapping turns Q s subgoals into at most n of the subgoals of the expansion E. Moreover, these subgoals of E come from at most n of the subgoals of the solution S. Suppose we removed from S all subgoals whose expansion was not the target of one of Q 's subgoals under the containment mapping. We would have a new conjunctive query S' with at most n subgoals. Now S' must also be a solution to Q, because the same containment mapping th at showed E C Q in Fig. 21.11 also shows that E' C Q , where E' is the expansion of S'. We must show one more thing: th at any answer provided by S is also provided by S'. That is, S C . S'. But there is an obvious containment mapping from S' to S: the identity mapping. Thus, there is no need for solution S among the solutions to query Q.
1078
Solution S
Expansion E
Query Q
\
Exercises for Section 21.6
N{...)
Figure 21.11: Why a query with n subgoals cannot need a solution with more than n subgoals
21.6.8
E x ercise 2 1 .6 .1 : Find all the containments among the following four conjunc tive queries:
Qi:
Q 2 : Q 3 :
Q i
P ( x , y ) (- Q(x,a) AND Q(a,b) AND Q(b,y) P ( x , y ) <- Q(x,a) AND Q(a,b) AND Q(b,c) AND Q(c,y) P ( x , y ) <- Q(x,a) AND Q(b,c) AND Q(d,y) AND Q(x,b) AND Q(a, c) AND Q(c,y) P(x,y) Q(x,a) AND q ( a , l ) AND Q ( l , b) AND Q(b,y)
! E x ercise 2 1 .6 .2 : For the mediator and views of Example 21.16, find all the needed solutions to the great-great-grandparent query:
Q(x,y) < P a r ( x , a ) AND P ar (a ,b) AND P a r ( b , c ) AND P a r ( c , y )
! E x ercise 21 .6 .3 : Show th at there is no containment mapping from P2 to Pi in Example 21.20. ! E x ercise 2 1 .6 .4 : Show th at if conjunctive query Q2 is constructed from con junctive query Q i by removing one or more subgoals of Q i, then Q\ C Q2.
21.7
E ntity R esolution
We shall now take up a problem th at must be solved in many informationintegration scenarios. We have tacitly assumed th at sources agree on the rep resentation of entities or values, or at least that it is possible to perform a translation of data as we go through a wrapper. Thus, we are not afraid of two sources th at report temperatures, one in Fahrenheit and one in Centigrade. Neither are we afraid of sources th at support a concept like employee but have somewhat different sets of employees. W hat happens, however, if two sources not only have different sets of em ployees, but it is unclear whether records at the two sources represent the same
21.7. E N T IT Y RESOLUTION
1079
individual or not? Discrepancies can occur for many reasons, such as mis spellings. In this section, we shall begin by discussing some of the reasons why entity resolution determining whether two records or tuples do or do not represent the same person, organization, place, or other entity is a hard problem. We then look at the process of comparing records and merging those th at we believe represent the same entity. Under some fairly reasonable condi tions, there is an algorithm for finding a unique way to group all sets of records th at represent a common entity and to perform this grouping efficiently.
21.7.1
Imagine we have a collection of records that represent members of an entity set. These records may be tuples derived from several different sources, or even from one source. We only need to know that the records each have the same fields (although some records may have null in some fields). We hope to compare the values in corresponding fields to decide whether or not two records represent the same entity. To be concrete, suppose that the entities are people, and the records have three fields: name, address, and phone. Intuitively, we want to say that two records represent the same individual if the two records have similar values for each of the three fields. It is not sufficient to insist that the values of corresponding fields be identical for a number of reasons. Among them: 1. Misspellings. Often, data is entered by a clerk who hears something over the phone, or who copies a written form carelessly. Thus, Smythe may appear as Smith, or Jones may appear as Jomes (m and n are adjacent on the keyboard). Two phone numbers or street addresses may differ in a digit, yet really represent the same phone or house. 2. Variant Names. A person may supply their middle initial or not. They may use their complete first name or just their initial, or a nickname. Thus, Susan Williams may appear as Susan B. Williams, S. Will iams, or Sue Williams in different records. 3. Misunderstanding of Names. There are many different systems of names used throughout the world. In the US, it is sometimes not understood th at Asian names generally begin with the family name. Thus, Chen Li and Li Chen may or may not turn out to be the same person. The first author of this book has been referred to as Hector Garcia-Molina, Hector Garcia, and even Hector G. Molina. 4. Evolution of Values. Sometimes, two different records that represent the same entity were created at different times. A person may have moved in the interrim, so the address fields in the two records are completely different. Or they may have started using a cell phone, so the phone
1080
CHAPTER 21. INFORM ATION IN TEG RATIO N fields are completely different. Area codes are sometimes changed. For example, every (650) number used to be a (415) number, so an old record may have (415) 555-1212 and a newer record (650) 555-1212, and yet these numbers refer to the same phone.
5. Abbreviations. Sometimes words in an address are spelled out; other times an abbreviation may be used. Thus, Sesame St. and Sesame Street may be the same street. Thus, when deciding whether two records represent the same entity, we need to look carefully at the kinds of discrepancies th at occur and devise a scoring system or other test that measures the similarity of records. Ultimately, we must turn the score into a yes /no decision: do the records represent the same entity or not? We shall mention below two useful approaches to measuring the similarity of records.
E d it D ista n c e
Values th at are strings can be compared by counting the number of insertions and/or deletions of characters it takes to turn one string into another. Thus, Smythe and Smith are at distance 3 (delete the y and e, then insert the i). An alternative edit distance counts 1 for a mutation, that is, a replacement of one letter by another. In this measure, Smythe and Smith are at distance 2 (mutate y to i and delete e). This edit distance makes mistyped charac ters cost less, and therefore may be appropriate if typing errors are common in the data. Finally, we may devise a specialized distance th at takes into account the way the data was constructed. For instance, if we decide th at changes of area codes are a major source of errors, we might charge only 1 for changing the entire area code from one to another. We might decide that the problem of misinterpreted family names was severe and allow two components of a name to be swapped at low cost, so Chen Li and Li Chen are at distance 1. Once we have decided on the appropriate edit distance for each field, we can define a similarity measure for records. For example, we could sum the edit distances of each of the pairs of corresponding fields in the two records, or we could compute the sum of the squares of those distances. Whatever formula we use, we have then to say th at records represent the same entity if their similarity measure is below a given threshold.
N o rm a liza tio n
Before applying an edit distance, we might wish to normalize records by replacing certain substrings by others. The goal is th at substrings representing the same thing will become identical. For instance, it may make sense to use a table of abbreviations and replace abbreviations by what they normally stand
21.7. E N T IT Y RESOLUTION
1081
for. Thus, St. would be replaced by Street in street addresses and by Saint in town names. Also, we could use a table of nicknames and variant spellings, so Sue would become Susan and Jeffery would become Geoffrey. One could even use the Soundex encoding of names, so names that sound the same are represented by the same string. This system, used by telephone in formation services, for example, would represent Smith and Smythe identically. Once we have normalized values in the records, we could base our similarity test on identical values only (e.g., a majority of fields have identical values in the two records), or we could further use an edit distance to measure the difference between normalized values in the fields.
21.7.2
In many applications, when we find two records that are similar enough to merge, we would like to replace them by a single record that, in some sense, contains the information of both. For instance, if we want to compile a dossier on the entity represented, we might take the union of all the values in each field. Or we might somehow combine the values in corresponding fields to make a single value. If we try to combine values, there are many rules that we might follow, with no obvious best approach. For example, we might assume that a full name should replace a nickname or initials, and a middle initial should be used in place of no middle initial. Thus, Susan Williams and S. B. Williams would be combined into Susan B. Williams. It is less clear how to deal with misspellings. For instance, how would we combine the addresses 123 Oak St. and 123 Yak St.? Perhaps we could look at the town or zip-code and determine that there was an Oak St. there and no Yak St. But if both existed and had 123 in their range of addresses, there is no right answer. Another problem that arises if we use certain combinations of a similarity test and a merging rule is that our decision to merge one pair of records may preclude our merging another pair. An example may help illustrate the risk. name Susan Williams Susan Williams Susan Williams address 123 Oak St. 456 Maple St. 456 Maple St. phone 818-555-1234 818-555-1234 213-555-5678
E x am p le 21.22: Suppose that we have the three name-address-phone records in Fig. 21.12. and our similarity rule is: must agree exactly in at least two out of the three fields. Suppose also that our merge rule is: set the field in which the records disagree to the empty string.
1082
Then records (1) and (2) are similar; so are records (2) and (3). Note that records (1) and (3) are not similar to each other, which serves to remind us that similarity is not normally a transitive relationship. If we decide to replace ( 1 ) and (2 ) by their merger, we are left with the two tuples: (1-2) (3)
address
These records disagree in two fields, so they cannot be merged. Had we merged (1) and (3) first, we would again have a situation where the remaining record cannot be merged with the result. Another choice for similarity and merge rules is: 1. Merge by taking the union of the values in each field, and 2. Declare two records similar if at least two of the three fields have a nonempty intersection. Consider the three records in Fig. 21.12. Again, (1) is similar to (2) and (2 ) is similar to (3), but (1) is not similar to (3). If we choose to merge (1) and (2 ) first, we get: (1-2) (3)
address {123 Oak St. 456 Maple St.} 456 Maple St.
phone 818-555-1234
213-555-5678
Now, the remaining two tuples are similar, because 456 Maple S t . is a member of both address sets and Susan W illiam s is a member of both name sets. The result is a single tuple: (1-2-3)
21.7.3
Any choice of similarity and merge functions allows us to test pairs of records for similarity and merge them if so. As we saw in the first part of Example 21.22, the result we get when no more records can be merged may depend on which pairs of mergeable records we consider first. W hether or not different ending configurations can result depends on properties of similarity and merger. There are several properties th at we would expect any merge function to satisfy. If A is the operation th at produces the merge of two records, it is reasonable to expect:
21.7. E N T IT Y RESOLUTION
1083
1. r A r = r ( Idempotence). That is, the merge of a record with itself should surely be that record. 2. rA s = sA r ( Commutativity ). If we merge two records, the order in which we list them should not matter. 3. (rA s)A t = rA (sA t) ( Associativity). The order in which we group records for a merger should not matter. These three properties say that the merge operation is a semilattice. Note that both merger functions in Example 21.22 have these properties. The only tricky point is th at we must remember that r A s need not defined for all records r and s. We do, however, assume that: If r and s are similar, then r A s is defined. There are also some properties that we expect the similarity relationship to have, and ways that we expect similarity and merging to interact. We shall use r s to say that records r and s are similar. a) r ~ r ( Idempotence for similarity). A record is always similar to itself. b) r s if and only if s r (Commutativity of similarity). T hat is, in deciding whether two records are similar, it does not m atter in which order we list them. c) If r s, then r (s A t) (Representability). This rule requires that if r is similar to some other record s (and thus could be merged with s), but s is instead merged with some other record t, then r remains similar to the merger of s and t and can be merged with that record. Note that representability is the property most likely to fail. In particular, it fails for the first merger rule in Example 21.22, where we merge by setting disagreeing fields to the empty string. In particular, representability fails when r is record (3) of Fig. 21.12, s is (2), and t is (1). On the other hand, the second merger rule of Example 21.22 satisfies the representability rule. If r and s have nonempty intersections in at least two fields, those shared values will still be present if we replace s by s At. The collection of properties above are called the ICAR properties. The let ters stand for Idempotence, Commutativity, Associativity, and Representability, respectively.
21.7.4
When the similarity and merge functions satisfy the ICAR properties, there is a simple algorithm that merges all possible records. The representability property guarantees that if two records are similar, then as they are merged with other records, the resulting records are also similar and will eventually
1084
be merged. Thus, if we repeatedly replace any pair of similar records by their merger, until no more pairs of similar records remain, then we reach a unique set of records th at is independent of the order in which we merge. A useful way to think of the merger process is to imagine a graph whose nodes are the records. There is an edge between nodes r and s if r s. Since similarity need not be transitive, it is possible th at there are edges between r and s and between s and t, yet there is no edge between r and t. For instance, the records of Fig. 21.12 have the graph of Fig. 21.13.
Figure 21.13: Similarity graph from Fig. 21.12 However, representability tells us th at if we merge s and t, then because r is similar to s, it will be similar to s A t. Thus, we can merge all three of r, s, and t. Likewise, if we merge r and s first, representability says th at because s i, we also have (r A s) t, so we can merge t with r As. Associativity tells us th at the resulting record will be the same, regardless of the order in which we do the merge. The idea described above extends to any set of ICAR nodes (records) th at are connected in any way. T hat is, regardless of the order in which we do the merges, the result is th at every connected component of the graph becomes a single record. This record is the merger of all the records in th at component. Commutativity and associativity are enough to tell us th at the order in which we perform the mergers does not matter. Although computing connected components of a graph is simple in principle, when we have millions of records or more, it is not feasible to construct the graph. To do so would require us to test similarity of every pair of records. The R-Swoosh algorithm is an implementation of this idea th at organizes the comparisons so we avoid, in many cases, comparing all pairs of records. Unfortunately, if no records at all are similar, then there is no algorithm that can avoid comparing all pairs of records to determine this fact. A lg o rith m 21.23: R-Swoosh.
IN P U T : A set of records I, a similarity function , and a merge function A.
We assume that and A satisfy the ICAR properties. If they do not, then the algorithm will still merge some records, but the result may not be the maximum or best possible merging.
O U T P U T : A s e t o f m e rg e d re c o rd s O . M E T H O D : Execute the steps of Fig. 21.14. The value of O at the end is the
output.
21.7. E N T IT Y RESOLUTION
1085
0 := emptyset; WHILE I is not empty DO BEGIN let r be any record in I ; find, if possible, some record s in 0 that is similar to r; IF no record s exists THEN move r from I to 0 ELSE BEGIN delete r from I; delete s from 0; add the merger of r and s to I; END; END; Figure 21.14: The R-Swoosh Algorithm E x am p le 21.24: Suppose th at I is the three records of Fig. 21.12, and that we use the ICAR similarity and merge functions from Example 21.22, where we take the union of possible values for a field to produce the corresponding field in the merged record. Initially, O is empty. We pick one of the records from I, say record (1) to be the record r in Fig. 21.14. Since O is empty, there is no possible record s, so we move record ( 1) from I to O. We next pick a new record r. Suppose we pick record (3). Since record (3) is not similar to record (1), which is the only record in O, we again have no value of s, so we move record (3) from I to O. The third choice of r must be record (2). That record is similar to both of the records in O, so we must pick one to be s; say we pick record (1). Then we merge records (1) and (2) to get the record ( 1- 2 ) name Susan Williams address {123 Oak St., 456 Maple St.} phone 818-555-1234
We remove record (2) from I, remove record (1) from O, and insert the above record into I. At this point, I consists of only the record (1-2) and O consists of only the record (3). The execution of the R-Swoosh Algorithm ends after we pick record (1-2) as r the only choice and pick record (3) as s again the only choice. These records are merged, to produce (1 -2 -3 ) name Susan Williams address {123 Oak St., 456 Maple St.} phone {818-555-1234, 213-555-5678}
and deleted from I and O, respectively. The record (1-2-3) is put in I, at which point it is the only record in I, and O is empty. At the last step, this record is moved from I to O, and we are done.
1086
21.7.5
W hy R-Swoosh Works
Recall th at for ICAR similarity and merge functions, the goal is to merge records th at form connected components. There is a loop invariant th at holds for the while-loop of Fig. 21.14: If a connected component C is not completely merged into one record, then there is at least one record in I th at is either in C or was formed by the merger of some records from C. To see why this invariant must hold, suppose th at the selected record r in some iteration of the loop is the last record in I from its connected component C. If r is the only record that is the merger of one or more records from C, then it may be moved to O without violating the loop invariant. However, if there are other records th at are the merger of one or more records from C, they are in O. Let r be the merger of the set of records R C C. Note th at R could be only one record, or could be many records. However, since R is not all of C , there must be an original record ri in R th at is similar to another original record r 2 th at is in C R. Suppose r 2 is currently merged into a record r' in O. By representability, perhaps applied several times, we can start with the known rj r 2 and deduce that r r ' . Thus, r' can be s in Fig. 21.14. As a result, r will surely be merged with some record from O. The resulting merged record will be placed in I and is the merger of some or all records from C. Thus, the loop invariant continues to hold.
21.7.6
There are many other algorithms known to discover and (optionally) merge similar records. We shall outline some of them briefly here.
N o n -I C A R D a ta s e ts
First, suppose the ICAR properties do not hold, but we want to find all possible mergers of records, including cases where one record ri is merged with a record r 2, but later, n (not the merger n Ar2) is also merged with r3. If so, we need to systematically compare all records, including those we constructed by merger, with all other records, again including those constructed by merger. To help control the proliferation of records, we can define a dominance relation r < s th at means record s contains all the information contained in record r. If so, we can eliminate record r from further consideration. If the merge function is a semilattice, then the only reasonable choice for < is a < b if and only if a A b = b. This dominance function is always a partial order, regardless of what semilattice is used. If the merge operation is not even a semilattice, then the dominance function must be constructed in an ad-hoc manner.
1087
In some entity-resolution applications, we do not want to merge at all, but will instead group records into clusters such that members of a cluster are in some sense similar to each other and members of different clusters are not similar. For example, if we are looking for similar products sold on eBay, we might want the result to be not a single record for each kind of product, but rather a list of the records that represent a common product for sale. Clustering of large-scale data involves a complex set of options. We shall discuss the m atter further in Section 22.5. P a rtitio n in g Since any algorithm for doing a complete merger of similar records may be forced to examine each pair of records, it may be infeasible to get an exact answer to a large entity-resolution problem. One solution is to group the records, perhaps several times, into groups that are likely to contain similar records, and look only within each group for pairs of similar records. E x am p le 21.25: Suppose we have millions of name-address-phone records, and our measure of similarity is that the total edit distance of the values in the three fields must be at most 5. We could partition the records into groups such th at each group has the same name field. We could also partition the records according to the value in their address field, and a third time according to their phone numbers. Thus, each record appears in three groups and is compared only with the members of those groups. This method will not notice a pair of similar records that have edit distance 2 in their phone fields, 2 in their name fields, and 1 in their address fields. However, in practice, it will catch almost all similar pairs. The idea in Example 21.25 is actually a special case of an important idea: locality-sensitive hashing. We discuss this topic in Section 22.4.
21.7.7
E x ercise 21.7.1: A string s is a subsequence of a string t if s is formed from t by deleting 0 or more positions of t. For example, if t = "abcab", then substrings of t include "aba" (delete positions 3 and 5), "be" (delete positions 1, 4, and 5), and the empty string (delete all positions). a) W hat are all the other subsequences of "abcab"? b) W hat are the subsequences of "aabb"? ! c) If a string consists of n distinct characters, how many subsequences does it have?
1088
E x ercise 2 1 .7 .2 : A longest common subsequence of two strings s and t is any string r th at is a subsequence of both s and t and is as long as any other string th at is a substring of both. For example, the longest common subsequences of "aba" and "bab" are "ab" and "ba". Give a the longest common subsequence for each pair of the following strings: "she", "h ers", "they", and " th e ir s " ? E x ercise 2 1 .7 .3 : A shortest common supersequence of two strings s and t is any string r of which both s and t are subsequences, such th at no string shorter than r has both s and t as subsequences. For example, the some of the shortest common supersequences of "abc" and "cb" are "abcb" and "acbc". a) W hat are the shortest common supersequences of each pair of strings in Exercise 21.7.2? ! b) W hat are all the other shortest common supersequences of "abc" and "cb"? !! c) If two strings have no characters in common, and are of lengths m and n, respectively, how many shortest common supersequences do the two strings have? !! E x ercise 21 .7 .4 : Suppose we merge records (whose fields are strings) by tak ing, for each field, the lexicographically first longest common subsequence of the strings in the corresponding fields. a) Does this definition of merge satisfy the idempotent, commutative, and associative laws? b) Repeat (a) if instead corresponding fields are merged by taking the lexi cographically first shortest common supersequence. ! E x ercise 2 1 .7 .5 : Suppose we define the similarity and merge functions by:
i. Records are similar if in all fields, or in all but one field, either both records have the same value or one has NULL. ii. Merge records by letting each field have the common value if both records agree in th at field or have value NULL if the records disagree in th at field. Note th at NULL disagrees with any nonnull value.
Show th at these similarity and merge functions have the ICAR properties. ! E x ercise 21 .7 .6 : In Section 21.7.6 we suggested th at if A is a semilattice, then the dominance relationship defined by a < b if and only if a A b = b is a partial order. That is, a < b and b < c imply a < c (transitivity) and a < b and b < a if and only if a b (antisymmetry). Prove th at < is a partial order, using the reflexivity, commutativity, and associativity properties of a semilattice.
1089
21.8
Integration of Information: When many databases or other information sources contain related information, we have the opportunity to combine these sources into one. However, heterogeneities in the schemas often ex ist; these incompatibilities include differing types, codes or conventions for values, interpretations of concepts, and different sets of concepts rep resented in different schemas. Approaches to Information Integration: Early approaches involved fed eration, where each database would query the others in the terms un derstood by the second. A more recent approach is warehousing, where data is translated to a global schema and copied to the warehouse. An alternative is mediation, where a virtual warehouse is created to allow queries to a global schema; the queries are then translated to the terms of the data sources. Extractors and Wrappers: Warehousing and mediation require compo nents at each source, called extractors and wrappers, respectively. A major function of either is to translate queries and results between the global schema and the local schema at the source. Wrapper Generators: One approach to designing wrappers is to use tem plates, which describe how a query of a specific form is translated from the global schema to the local schema. These templates are tabulated and in terpreted by a driver th at tries to match queries to templates. The driver may also have the ability to combine templates in various ways, and/or perform additional work such as filtering, to answer more complex queries. Capability-Based Optimization: The sources for a mediator often are able or willing to answer only limited forms of queries. Thus, the mediator must select a query plan based on the capabilities of its sources, before it can even think about optimizing the cost of query plans as conventional DBMSs do. Adornments: These provide a convenient notation in which to describe the capabilities of sources. Each adornment tells, for each attribute of a relation, whether, in queries matching that adornment, this attribute requires or permits a contant value, and whether constants must be chosen from a menu. Conjunctive Queries: A single Datalog rule, used as a query, is a con venient representation for queries involving joins, possibly followed by selection and/or projection. The Chain Algorithm: This algorithm is a greedy approach to answering mediator queries that are in the form of a conjunctive query. Repeatedly look for a subgoal that matches one of the adornments at a source, and
1090
CH APTER 21. INFORM ATION IN TEG RATIO N obtain the relation for that subgoal from the source. Doing so may provide a set of constant bindings for some variables of the query, so repeat the process, looking for additional subgoals that can be resolved.
Local-as-View Mediators: These mediators have a set of global, virtual predicates or relations at the mediator, and each source is described by views, which axe conjunctive queries whose subgoals use the global predi cates. A query at the mediator is also a conjunctive query using the global predicates. Answering Queries Using Views: A local-as-view mediator searches for solutions to a query, which are conjunctive queries whose subgoals use the views as predicates. Each such subgoal of a proposed solution is expanded using the conjunctive query th at defines the view, and it is checked that the expansion is contained in the query. If so, the proposed solution does indeed provide (some of the) answers to the query. Containment of Conjunctive Queries: We test for containment of conjunc tive queries by looking for a containment mapping from the containing query to the contained query. A containment mapping is a substitution for variables that turns the head of the first into the head of the second and turns each subgoal of the first into some subgoal of the second. Limiting the Search for Solutions: The LMSS Theorem says th at when seaching for solutions to a query at a local-as-view mediator, it is sufficient to consider solutions th at have no more subgoals than the query does. Entity Resolution: The problem is to take records with a common schema, find pairs or groups of records th at are likely to represent the same entity (e.g., a person) and merge these records into a single record th at represents the information of the entire group. IC AR Similarity and Merge Functions: Certain choices of similarity and merge functions satisfy the properties of idempotence, commutativity, as sociativity, and representability. The latter is the key to efficient algo rithms for merging, since it guarantees that if two records are similar, their successors will also be similar even as they are merged into records th at represent progressively larger sets of original records. The R-Swoosh Algorithm: If similarity and merge functions have the ICAR properties, then the complete merger of similar records will group all records th at are in a connected component of the graph formed from the similarity relation on the original records. The R-Swoosh algorithm is an efficient way to make all necessary mergers without determining similarity for every pair of records.
1091
21.9
Federated systems are surveyed in [11]. The concept of the mediator comes from [12]. Implementation of mediators and wrappers, especially the wrapper-generator approach, is covered in [4], Capability-based optimization for mediators was explored in [10, 13]; the latter describes the Chain Algorithm. Local-as-view mediators come from [7]. The LMSS Theorem is from [6 ], and the idea of containment mappings to decide containment of conjunctive queries is from [2], [8] extends the idea to sources with limited capabilities. [5] is a survey of logical information-integration techniques. Entity resolution was first studied informally by [9] and formally by [3]. The theory presented here, the R-Swoosh Algorithm, and related algorithms are from [1]. 1. O. Benjelloun, H. Garcia-Molina, J. Jonas, Q. Su, S. E. Whang, and J. Widom, Swoosh: a generic approach to entity resolution. Available as h t t p : //d b p u b s . S ta n fo rd . edu:8090/pub/2005-5. 2. A. K. Chandra and P. M. Merlin, Optimal implementation of conjunc tive queries in relational databases, Proc. Ninth Annual Symposium on Theory of Computing, pp. 77-90, 1977. 3. I. P. Fellegi and A. B. Sunter, A theory for record linkage, J. American Statistical Assn. 64, pp. 1183-1210, 1969. 4. H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, V. Vassalos, J. D. Ullman, and J. Widom, The TSIMMIS approach to mediation: data models and languages, J. Intelligent Information Systems 8:2 (1997), pp. 117-132. 5. A. Y. Levy, Logic-based techniques in data integration, Logic-Based, Artificial Intelligence (J. Minker, ed.), pp. 575-595, Kluwer, Norwell, MA, 2000.
6 . A. Y. Levy, A. O. Mendelzon, Y. Sagiv, and D. Srivastava, Answer
ing queries using views, Proc. 25th Annual Symposium on Principles of Database Systems, pp. 95-104, 1995. 7. A. Y. Levy, A. Rajaraman, and J. J. Ordille, Querying heterogeneous information sources using source descriptions, Intl. Conf. on Very Large Databases, pp. 251-262, 1996.
8 . A. Y. Levy, A. Rajaraman, and J. D. Ullman, Answering queries using
limited external query processors, Proc. Fifteenth Annual Symposium on Principles of Database Systems, pp. 227-237, 1996. 9. H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James, Auto matic linkage of vital records, Science 130, pp. 954-959, 1959.
1092
10. Y. Papakonstantinou, A. Gupta, and L. Haas, Capabilities-base query rewriting in mediator systems, Conference on Parallel and Distributed Information Systems (1996). Available as http://dbpubs.Stanford. edu/pub/1995-2. 11. A. P. Sheth and J. A. Larson, Federated databases for managing dis tributed, heterogeneous, and autonomous databases, Computing Surveys 22:3 (1990), pp. 183-236. 12. G. Wiederhold, Mediators in the architecture of future information sys tems, IEEE Computer C -25:l (1992), pp. 38-49. 13. R. Yemeni, C. Li, H. Garcia-Molina, and J. D. Ullman, Optimizing large joins in mediation systems, Proc. Seventh Intl. Conf. on Database Theory, pp. 348-364, 1999.
Chapter 22
D ata M ining
Data mining is the process of examining data and finding simple rules or models th at summarize the data. The rules can range from very general, such as 50% of the people who buy hot dogs also buy mustard, to the very specific: these three individuals pattern of credit-card expenditures indicate that they are running a terrorist cell. Our discussion of data mining will concentrate on mining information from very large databases. We begin by looking at market-basket data, records of the things people buy together, such as at a supermarket. This study leads to a number of efficient algorithms for finding frequent itemsets in large databases, including the A-Priori Algorithm and its extensions. We next turn to finding similar items in a large collection. Example appli cations include finding documents on the Web that share a significant amount of common text or finding books that have been bought by many of the same Amazon customers. Two key techniques for this problem are minhashing and locality-sensitive hashing. We conclude the chapter with a discussion of the problem of large-scale clustering in high dimensions. An example application is clustering Web pages by the words they use. In that case, each word might be a dimension, and a document is placed in this space by counting the number of occurrences of each word.
22.1
There is a family of problems that arise from attem pts by marketers to use large databases of customer purchases to extract information about buying patterns. The fundamental problem is called frequent itemsets what sets of items are often bought together? This information is sometimes further refined into association rules implications that people who buy one set of items are likely to buy another particular item. The same technology has many 1093
1094
other uses, from discovering combinations of genes related to certain diseases to finding plagiarism among documents on the Web.
22.1.1
In several im portant applications, the data involves a set of items, perhaps all the items th at a supermarket sells, and a set of baskets-, each basket is a subset of the set of items, typically a small subset. The baskets each represent a set of items th at someone has bought together. Here are two typical examples of where market-basket data appears.
S u p erm a rk et C h eck ou t
A supermarket chain may sell 10,000 different items. Daily, millions of cus tomers wheel their shopping carts (market baskets) to the checkout, and the cash register records the set of items they purchased. Each such set is one basket, in the sense used by the market-basket model. Some customers may have identified themselves, using a discount card th at many supermarket chains provide, or by their credit card. However, the identity of the customer often is not necessary to get useful information from the data. Stores analyze the data to learn what typical customers buy together. For example, if a large number of baskets contain both hot dogs and mustard, the supermarket manager can use this information in several ways. 1. Apparently, many people walk from where the hot dogs are to where the mustard is. We can put them close together, and put between them other foods th at might also be bought with hot dogs and mustard, e.g., ketchup or potato chips. Doing so can generate additional impulse sales. 2. The store can run a sale on hot dogs and at the same time raise the price of m ustard (without advertising th at fact, of course). People will come to the store for the cheap hot dogs, and many will need m ustard too. It is not worth the trouble to go to another store for cheaper mustard, so they buy th at too. The store makes back on m ustard what it loses on hot dogs, and also gets more customers into the store. While the relationship between hot dogs and m ustard may be obvious to those who think about the m atter, even if they have no data to analyze, there are many pairs of items that are connected but may be less obvious. The most famous example is diapers and beer .1 There are some conditions on when a fact about co-occurrence of sets of items can be useful. Any useful pair (or larger set) of items must be bought by many customers. It is not even necessary that there be any connection between purchases of the items, as long as we know lots of customers buy them
1One theory: if you buy diapers, you probably have a baby at home. If so, you are not going out to a bar tonight, so you are more likely to buy beer at a supermarket.
1095
all. Conversely, strongly linked, but rarely purchased items (e.g., caviar and champagne) are not very interesting to the supermarket, because it doesnt pay to advertise things that few customers are interested in buying anyway.
O n-L ine P u rch a ses
Amazon.com offers several million different items for sale, and has several tens of millions of customers. While brick-and-mortar stores such as the supermarket discussed above can only make money on combinations of items that large numbers of people buy, Amazon and other on-line sellers have the opportunity to tailor their offers to every customer. Thus, an interesting question is to find pairs of items that many customers have bought together. Then, if one customer has bought one of these items but not the other, it might be good for Amazon to advertise the second item when this customer next logs in. We can treat the purchase data as a market-basket problem, where each basket is the set of items that one particular customer ever has bought. But there is another way Amazon can use the same data. This approach, often called collaborative filtering, has us look for customers that are similar in their purchase habits. For example, we could look for pairs, or even larger sets, of customers that have bought many of the same items. Then, if a customer logs in, Amazon might pitch an item that a similar customer bought, but this customer has not. Finding similar customers also can be couched as a market-basket problem. Here, however, the items are the customers and the baskets are the items for sale by Amazon. That is, for each item / sold by Amazon there is a basket consisting of all the customers who bought I. It is worth noting that the meaning of many baskets differs in the on-line and brick-and-mortar situations. In the brick-and-mortar case, we may need thousands of baskets containing a set of items before we can exploit that infor mation profitably. For on-line stores, we need many fewer baskets containing a set of items, before we can use the information in the limited context we intend (pitching one item to one customer). On the other hand, the brick-and-mortar store doesnt need too many ex amples of good sets of items to use; they cant run sales on millions of items. In contrast, the on-line store needs millions of good pairs to work with at least one for each customer. As a result, the most effective techniques for analyzing on-line purchases may not be those of this section, which exploit the assumption th at many occurrences of a pair of items are needed. Rather, we shall resume our discussion of finding correlated, but infrequent, pairs in Section 22.3.
22.1.2
Basic Definitions
Suppose we are given a set of items I and a set of baskets B . Each basket b in B is a subset of I. To talk about frequent sets of items, we need a support threshold s, which an integer. We say a set of items J C I is frequent if there
1096
are at least s baskets th at contain all the items in J (perhaps along with other items). Optionally, we can express the support s as a percentage of |B |, the number of baskets in B. E x am p le 2 2.1: Suppose our set of items I consists of the six movies {B I, B S , BU, H P l, HP2, HP3} standing for the Bourne Identity, Bourne Supremacy, Bourne Ultimatum, and Harry Potter I, II, and III. The table of Fig. 22.1 shows eight viewers (baskets of items) and the movies they have seen. An x indicates they saw the movie. BI BS
X X X X X X X X X X X X X
BU
HPl
HP2
HPS
Vl
V2
v3
V4 V5 V6
x x
v7 v8
X X
X X
Figure 22.1: Market-basket data about viewers and movies Suppose th at s = 3. T hat is, in order for a set of items to be considered a frequent itemset, it must be a subset of at least three baskets. Technically, the empty set is a subset of all baskets, so it is frequent but uninteresting. In this example, all singleton sets except {HPS} appear in at least three baskets. For example, { B I} is contained in V\, V3, V4 , V5 , Vq, and Vg. Now, consider which doubleton sets (pairs of items) are frequent. Since HP 3 is not frequent by itself, it cannot be part of a frequent pair. However, each of the 10 pairs involving the other five movies might be frequent. For example, { B I, B S } is frequent because it appears in at least three baskets; in fact it appears in four: Vi, V4 , V5 , and Vs. Also: { B I, H P l} is frequent appearing in V3, V4 , V5 , and Vg. {B S, HP 1} is frequent, appearing in V4 , V5 , V7 , and Vs. {H P l, HP2} is frequent, appearing in V2 , V4 , V7 , and Vg. No other pair is frequent. There is one frequent triple: { B I,B S ,H P 1}. This set is a subset of the baskets V4 , V5 , and Vs. There are no frequent itemsets of size greater than three.
1097
22.1.3
A ssociation Rules
A natural query about market-basket data asks for implications among pur chases th at people make. That is, we want to find pairs of items such that people buying the first are likely to buy the second as well. More generally, people buying a particular set of items are also likely to buy yet another par ticular item. This idea is formalized by association rules. An association rule is a statement of the form { i i ,i 2, . .. , i n} = > j , where the i s and j are items. In isolation, such a statement asserts nothing. However, three properties th at we might want in useful rules of this form are: 1. High Support: the support of this association rule is the support of the itemset {*i,j2, . . . , i n,j}2. High Confidence: the probability of finding item j in a basket th at has all of {*i, *2 , .. , in} is above a certain threshold, e.g., 50%, e.g., at least 50% of the people who buy diapers buy beer. 3. Interest: the probability of finding item j in a basket that has all of {*i, *2 , . . . , i n} is significantly higher or lower than the probability of find ing j in a random basket. In statistical terms, j correlates with
{ ^11 i
in}
either positively or negatively. The alleged relationship between diapers and beer is really a claim that the association rule {diapers} => beer has high interest in the positive direction. Note th at even if an association rule has high confidence or interest, it will tend not to be useful unless it also has high support. The reason is that if the support is low, then the number of instances of the rule is not large, which limits the benefit of a strategy that exploits the rule. Also, it is important not to confuse an association rule, even with high values for support, confidence, and interest, with a causal rule. For instance, the beer and diapers example mentioned in Section 22.1.1 suggests that the association rule {beer} => diapers has high confidence, but that does not mean beer causes diapers. Rather, the theory suggested there is that both are caused by a hidden variable the baby at home. E x am p le 22.2 : Using the data from Fig. 22.1, consider the association rule {B I, B S } => BU Its support is 2, since there are two baskets, V\ and V5 th at contain all three Bourne movies. The confidence of the rule is 1/2, since there are four baskets that contain both B I and B S , and two of these also contain BU. The rule is slightly interesting in the positive direction. That is, B U appears in 3/8 of all baskets, but appears in 1 /2 of those baskets th at contain the left side of the association rule.
1098
As long as high support is a significant requirement for a useful association rule, the search for high-confidence or high-interest association rules is really the search for high-support itemsets. Once we have these itemsets, we can consider each member of an itemset as the item on the right of the association rule. We may, as part of the process of finding frequent itemsets, already have computed the counts of baskets for the subsets of this frequent itemset, since they also must be frequent. If so, we can compute easily the confidence and interest of each potential association rule. We shall thus, in what follows, leave aside the problem of finding association rules and concentrate on efficient methods for finding frequent itemsets.
22.1.4
Since we are studying database systems, our first thought might be th at the market-basket data is stored in a relation such as: Baskets(basket, item) consisting of pairs th at are a basket ID and the ID of one of the items in th at basket. In principle, we could find frequent itemsets by a SQL query. For instance, the query in Fig. 22.2 finds all frequent pairs. It joins B askets with itself, grouping the resulting tuples by the two items found in th at tuple, and throwing away groups where the number of baskets is below the support threshold s. Note th at the condition I . item < J . item in the WHERE-clause is there to prevent the same pair from being considered in both orders, or for a pair consisting of the same item twice from being considered at all. SELECT I.item, J.item, C0UNT(I.basket) FROM Baskets I, Baskets J WHERE I. basket = J. basket AND I.item < J. item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; Figure 22.2: Naive way to find all high-support pairs of items However, if the size of the Baskets relation is very large, the join of the relation with itself will be too large to construct, or at least too time-consuming to construct. No m atter how efficiently we compute the join, the result relation contains one tuple for each pair of items in a basket. For instance, if there are 1 ,000,000 baskets, and each basket contains 20 items, then there will be 190,000,000 tuples in the join [since (2 2 ) = 190]. We shall see in Section 22.2 th at it is often possible to do much better by preprocessing the Baskets relation. But in fact, it is not common to store market-basket data as a relation. It is far more efficient to put the data in a file or files consisting of the baskets,
1099
in some order. A basket is represented by a list of its items, and there is some punctuation between baskets. Example 22.3: The data of Fig. 22.1 could be represented by a file that begins: {BI,BS,BU}{HP1,HP2) HP3>{BI,HP1>{BI,BS,HP1,HP2}{.. . Here, we are using brackets to surround baskets and commas to separate items within a basket. When market-basket data is represented this way, the cost of an algorithm is relatively simple to estimate. Since we are interested only in cases where the data is too large to fit in main memory, we can count disk-I/Os as our measure of complexity. However, the m atter is even simpler than disk-I/Os. All the successful algorithms for finding frequent itemsets read the data file several times, in the order given. They thus make several passes over the data, and the information preserved from one pass to the next is small enough to fit in main memory. Thus, we do not even have to count disk-I/Os; it is sufficient to count the number of passes through the data.
22.1.5
Exercise 22.1.1: Suppose we are given the eight market baskets of Fig. 22.3. B1 = b2 = Bz = b4 = b5 = b6 = b7 = BS = {milk, coke, beer} {milk, pepsi, juice} {milk, beer} {coke, juice} {milk, pepsi, beer} {milk, beer, juice, pepsi} {coke, beer, juice} {beer, pepsi}
a) As a percentage of the baskets, what is the support of the set {beer, juice}? b) W hat is the support of the itemset {coke, pepsi}? c) W hat is the confidence of milk given beer (i.e., of the association rule {beer} => milk)? d) W hat is the confidence of juice given milk?
1100
e) W hat is the confidence of coke, given beer and juice? f) If the support threshold is 37.5% (i.e., 3 out of the eight baskets are needed), which pairs of items are frequent? g) If the support threshold is 50%, which pairs of items are frequent? ! h) W hat is the most interesting association rule with a singleton set on the left?
22.2
We now look at how many passes are needed to find frequent itemsets of a certain size. We first argue why, in practice, finding frequent pairs is often the bottleneck. Then, we present the A-Priori Algorithm, a key step in minimizing the amount of main memory needed for a multipass algorithm. Several im provements on A-Priori make better use of main memory on the first pass, in order to make it more feasible to complete the algorithm without exceeding the capacity of main memory on later passes.
22.2.1
If we pick a support threshold s = 1, then all itemsets th at appear in any basket are frequent, so just producing the answer could be infeasible. However, in applications such as managing sales at a store, a small support threshold is not useful. Recall th at we need many customers buying a set of items before we can exploit th at itemset. Moreover, any data mining of m arket-basket data must produce a small number of answers, say tens or hundreds. If we get no answers, we cannot act, but if we get millions of answers, we cannot read them all, let alone act on them all. The consequence of this reasoning is th at the support threshold must be set high enough to make few itemsets frequent. Typically, a threshold around 1% of the baskets is used. Since the probability of an itemset being frequent goes down rapidly with size, most frequent itemsets will be small. However, an itemset of size one is generally not useful; we need at least two items in a frequent itemset in order to apply the marketing techniques mentioned in Section 22.1.1, for example. Our conclusion is th at in practical uses of algorithms to find frequent item sets, we need to use a support threshold so th a t there will be a small number of frequent pairs, and very few frequent itemsets th at are larger. Thus, our algorithms will focus on how to find frequent pairs in a few passes through the data. If larger frequent itemsets are wanted, the computing resources used to find the frequent pairs are usually sufficient to find the small number of frequent triples, quadruples, and so on.
1101
22.2.2
Let us suppose th at there is some fixed number of bytes of main memory M , perhaps a gigabyte, or 16 gigabytes, or whatever our machine has. Let there be k different items in our market-basket dataset, and assume they are numbered 0 , 1,. .. , k 1 . Finally, as suggested in Section 2 2.2 . 1 , we shall focus on the counting of pairs, assuming th at is the bottleneck for memory use. If there is enough room in main memory to count all the pairs of items as we make a single pass over the baskets, then we can solve the frequent-pairs problem in a single pass. In that pass, we read one block of the data file at a time. We shall neglect the amount of main memory needed to hold this block (or even several blocks if baskets span two or more blocks), since we may assume th at the space needed to represent a basket is tiny compared with M . For each basket found on this block, we execute a double loop through its items and for each pair of items in the basket, we add one to the count for th at pair. The essential problem we face, then, is how do we store the counts of the pairs of items in M bytes of memory. There are two reasonable ways to do so, and which is better depends on whether it is common or unlikely that a given pair of items occurs in at least one basket. In what follows, we shall make the simplifying assumption that all integers, whether used for a count or to represent an item, require four bytes. Here are the two contending approaches to maintaining counts.
T riangular M a trix
If most of the possible pairs of items are expected to appear at least once in the dataset, then the most efficient use of main memory is a triangular array. T hat is, let a be a one-dimensional integer array occupying all available main memory. We count the pair (i,j), where 0 < i < j < k in <z[n], where:
As long as M > 2k2, there is enough room to store array a, with four bytes per count. Notice th at this method takes only half the space th at would be used by a square array, of which we used only the upper or lower triangle to count the pairs (i, j) where i < j.
T ab le o f C o u n ts
If the probability of a pair of items ever occurring is small, then we can do with space less than 0 ( k 2). We instead construct a hash table of triples (i ,j , c ), where i < j and { i ,j } is one of the itemsets th at actually occurs in one or more of the baskets. Here, c is the count for th at pair. We hash the pair (i , j ) to find the bucket in which the count for th at itemset is kept. A triple (i , j, c) requires 12 bytes, so we can maintain counts for M /12 pairs .2 P ut another way, if p pairs ever occur in the data, we need main memory at least M > 12p. Notice th at there are approximately k 2/ 2 possible pairs if there are k dif ferent items. If the number of pairs p = k 2 / 2, then the table of counts requires three times as much main memory as the triangular matrix. However, if only 1/3 of all possible pairs occur, then the two methods have the same memory requirements, and if the probability th at a given pair occurs is less than 1/3, then the table of counts is preferable.
A d d itio n a l C o m m e n ts A b o u t th e N a iv e A lg o r ith m
In summary, we can use the naive, one-pass algorithm to find all frequent pairs if the number of bytes of main memory M exceeds either 2fc2 or 12p, where k is the number of different items and p is the number of pairs of items th at occur in at least one basket of the dataset. The same approach can be used to count triples, provided th at there is enough memory to count either all possible triples or all triples th at actually occur in the data. Likewise, we can count quadruples or itemsets of any size, although the likelihood th at we have enough memory goes down as the size goes up. We leave the formulas for how much memory is needed as an exercise.
22.2.3
The A-Priori Algorithm is a method for finding frequent itemsets of size n, for any n, in n passes. It normally uses much less main memory than the naive algorithm, and it is certain to use less memory if the support threshold is sufficiently high that some singleton sets are not frequent. The important
2Whatever kind of hash table we use, there will be some additional overhead, which we shall neglect. For example, if we use open addressing, then it is generally necessary to leave a small fraction of the buckets unfilled, to limit the average search for a triple.
1103
insight th at makes the algorithm work is monotonicity of the property of being frequent. That is: If an itemset S is frequent, so is each of its subsets. The tru th of the above statement is easy to see. If 5 is a subset of at least s baskets, where s is the support threshold, and T C 5, then T is also a subset of the same baskets th at contain S , and perhaps T is a subset of other baskets as well. The use of monotonicity is actually in its contrapositive form: If S is not a frequent itemset, then no superset of 5 is frequent. On the first pass, the a-priori algorithm counts only the singleton sets of items. If some of those sets are not frequent by themselves, then their items cannot be part of any frequent pair. Thus, the nonfrequent items can be ignored on a second pass through the data, and only the pairs consisting of two frequent items need be counted. For example, if only half the items are frequent, then we need to count only 1/4 of the number of pairs, so we can use 1/4 as much main memory. Or put another way, with a fixed amount of main memory, we can deal with a dataset that has twice as many items. We can continue to construct the frequent triples on another pass, the fre quent quadruples on the fourth pass, and so on, as high as we like and that frequent itemsets exist. The generalization is that for the nth pass we begin with a candidate set of itemsets C, and we produce a subset Fn of Cn consist ing of the frequent itemsets of size n. That is, C\ is the set of all singletons, and Fi is those singletons th at are frequent. C2 is the set of pairs of items, both of which are in F i , and F2 is those pairs th at are frequent. The candidate set for the third pass, C3, is those triples { i,j, k} such th at each doubleton subset, { i,j} , {*,}, and { j,k } , is in F2. The following gives the algorithm formally. A lg o rith m 22 .4 : A-Priori Algorithm.
IN P U T : A file D consisting of baskets of items, a support threshold s, and a
E x am p le 2 2 .5 : Let us execute the A-Priori Algorithm on the data of Fig. 22.1 with support s = 4. Initially, C\ is the set of all six movies. In the first pass, we count the singleton sets, and we find that B I , B S , HP 1, and HP 2 occur at least four times; the other two movies do not. Thus, Fi {B I, B S, HP 1, HP2}, and C2 is the set of six pairs th at can be formed by choosing two of these four movies.
1104 1 ) 2) 3) 4) 5)
CH APTER 22. DATA MINING LET Ci = all items that appear in file F ; FOR n := 1 TO q DO BEGIN Fn := those sets in Cn that occur at least s times in D; IF n = q BREAK; LET Cn+i = all itemsets S of size n + 1 such that every subset of S of size n is in Fn ; END Figure 22.4: The A-Priori Algorithm
On the second pass, we count only these six pairs, and we find th at F2 = { { B I, B S } , {H P l, HP2}, { B I, H P l}, {B S , H P l}}; the other two pairs are not frequent. Assuming q > 2, we try to find frequent triples. Cz consists of only the triple { B I, B S , H P l}, because th at is the only set of three movies, all pairs of which are in F2. However, these three movies appear together only in three rows: V4 , V5 , and V. Thus, F3 is empty, and there are no more frequent itemsets, no m atter how large q is. The algorithm returns Fi U F2.
22.2.4
Figure 22.4 is just an outline of the algorithm. We must consider carefully how the steps are implemented. The heart of the algorithm is line (3), which we shall implement, each time through, by a single pass through the input data. The let-statements of lines (1) and (5) are just definitions of what Cn is, rather than assignments to be executed. T hat is, as we run through the baskets in line (3), the definition of Cn tells us which sets of size n need to be counted in main memory, and which need not be counted. The algorithm should be used only if there is enough main memory to satisfy the requirements to count all the candidate sets on each pass. If there is not enough memory, then either a more space-efficient algorithm must be used, or several passes must be used for a single value of n. Otherwise, the system will thrash, with pages being moved in and out of main memory during a pass, thus greatly increasing the running time. We can use either method discussed in Section 22.2.2 to organize the mainmemory counts during a pass. It may not be obvious th at the triangular-matrix method can be used with a-priori on the second pass, since the frequent items are not likely to have numbers 0 , 1 , . . . up to as many frequent items as there are. However, after finding the frequent items on pass 1, we can construct a small main-memory table, no larger than the set of items itself, th at translates the original items numbers into consecutive numbers for just the frequent items.
1105
22.2.5
We expect that the memory bottleneck comes on the second pass of Algo rithm 22.4, that is, at the execution of line (3) of Fig. 22.4 with n 2. T hat is, we assume counting candidate pairs takes more space than counting candidate triples, quadruples, and so on. Thus, let us concentrate on how we could reduce the number of candidate pairs for the second pass. To begin, the typical use of main memory on the first two passes of the A-Priori Algorithm is suggested by Fig. 22.5.
Item Counts Frequent Items
_ _-- --- 1
Pass 1
Pass 2
Figure 22.5: Main-memory use by the A-Priori Algorithm On the first pass (n = 1), all we need is space to count all the items, which is typically very small compared with the amount of memory needed to count pairs. On the second pass (n = 2), the counts are replaced by a list of the frequent items, which is expected to take even less space than the counts took on the first pass. All the available memory is devoted, as needed, to counts of the candidate pairs. Could we do anything with the unused memory on the first pass, in order to reduce the number of candidate pairs on the second pass? If so, data sets with larger numbers of frequent pairs could be handled on a machine with a fixed amount of main memory. The P C Y Algorithm3 exploits the unused memory by filling it entirely with an unusual sort of hash table. The buckets of this table do not hold pairs or other elements. Rather, each bucket is a single integer count, and thus occupies only four bytes. We could even use two-byte buckets if the support threshold were less than 216, since once a count gets above the threshold, we do not need to see how large it gets. During the first pass, as we examine each basket, we not only add one to the count for each item in the basket, but we also hash each pair of items to its bucket in the hash table and add one to the count in th at bucket. W hat we
3F o r th e a u th o rs, J . S. P a rk , M .-S. C hen, a n d P. S. Yu.
1106
hope for is th at some buckets will wind up with a count less than s, the support threshold. If so, we know th at no pair {*, j } th at hashes to that bucket can be frequent, even if both i and j are frequent as singletons.
Pass 1
Pass 2
Figure 22.6: Main-memory use by the PCY Algorithm Between the first and second passes, we replace the buckets by a bitmap with one bit per bucket. The bit is 1 if the corresponding bucket is a frequent bucket; th at is, its count is at least the support threshold s; otherwise the bit is 0. A bucket, occupying 32 bits (4 bytes) is replaced by a single bit, so the bitmap occupies roughly 1/32 of main memory on the second pass. There is thus almost as much space available for counts on the second pass of the PCY Algorithm as there is for the A-Priori Algorithm. Figure 22.6 illustrates memory use during the first two passes of PCY. On the second pass, { i ,j } is a candidate pair if and only if the following conditions are satisfied: 1. Both i and j are frequent items. 2. { i ,j } hashes to a bucket th at the bitmap tells us is a frequent bucket. Then, on the second pass, we can count only this set of candidate pairs, rather than all the pairs th at meet the first condition, as in the A-Priori Algorithm.
22.2.6
In the PCY Algorithm, the set of candidate pairs is sufficiently irregular that we cannot use the triangular-matrix method for organizing counts; we must use a table of counts. Thus, it does not make sense to use PCY unless the number of candidate pairs is reduced to at most 1/3 of all possible pairs. Passes of the PCY Algorithm after the second can proceed just as in the A-Priori Algorithm, if they are needed.
1107
Further, in order for PCY to be an improvement over A-Priori, a good fraction of the buckets on the first pass must not be frequent. For if most buckets are frequent, condition (3) above does not eliminate many pairs. Any bucket to which even one frequent pair hashes will itself be frequent. However, buckets to which no frequent pair hashes could still be frequent if the sum of the counts of the pairs that do hash there exceeds the threshold s. To a first approximation, if the average count of a bucket is less then s, we can expect at least half the buckets not to be frequent, which suggests some benefit from the PCY approach. However, if the average bucket has a count above s, then most buckets will be frequent. Suppose the total number of occurrences of pairs of items among all the baskets in the dataset is P . Since most of the main memory M can be devoted to buckets, the number of buckets will be approximately M /4. The average count of a bucket will then be A P /M . In order that there be many buckets that are not frequent, we need 4P /M < s, or M > 4 P /s. The exercises allow you to explore some more concrete examples.
22.2.7
Instead of counting pairs on the second pass, as we do in A-Priori or PCY, we could use the same bucketing technique (with a different hash function) on the second pass. To make the average counts even smaller on the second pass, we do not even have to consider a pair on the second pass unless it would be counted on the second pass of PCY; that is, the pair consists of two frequent items and also hashed to a frequent bucket on the first pass.
Pass 1
Pass 2
Pass 3
Figure 22.7: Main-memory use in the three-pass version of the multistage algo rithm This idea leads to the three-pass version of the Multistage Algorithm for finding frequent pairs. The algorithm is sketched in Fig. 22.7. Pass 1 is just
1108
like Pass 2 of PCY, and between Passes 1 and 2 we collapse the buckets to bits and select the frequent items, also as in PCY. However, on Pass 2, we again use all available memory to hash pairs into as many buckets as will fit. Because there is a bitmap to store in main memory on the second pass, and this bitmap compresses a 4-byte (32-bit) integer into one bit, there will be approximately 31/32 as many buckets on the second pass as on the first. On the second pass, we use a different hash function from that used on Pass 2. We hash a pair {*, j } to a bucket and add one to the count there if and only if: 1. Both i and j are frequent items. 2. { i ,j } hashed to a frequent bucket on the first pass. This decision is made by consulting the bitmap. T hat is, we hash only those pairs we would count on the second pass of the PCY Algorithm. Between the second and third passes, we condense the buckets of the second pass into another bitmap, which must be stored in main memory along with the first bitmap and the set of frequent items. On the third pass, we finally count the candidate pairs. In order to be a candidate, the pair { i,j} must satisfy all of: 1. Both i and j are frequent items. 2. { i,j} hashed to a frequent bucket on the first pass. This decision is made by consulting the first bitmap. 3. { i,j} hashed to a frequent bucket on the second pass. This decision is made by consulting the second bitmap. As with PCY, subsequent passes can construct frequent triples or larger item sets, if desired, using the same method as A-Priori. The third condition often eliminates many pairs th at the first two conditions let through. One reason is that on the second pass, not every pair is hashed, so the counts of buckets tend to be smaller than on the first pass, resulting in many more infrequent buckets. Moreover, since the hash functions on the first two passes are different, infrequent pairs th at happened to hash to a frequent bucket on the first pass have a good chance of hashing to an infrequent bucket on the second pass. The Multistage Algorithm is not limited to three passes for computation of frequent pairs. We can have a large number of bucket-filling passes, each using a different hash function. As long as the first pass eliminates some of the pairs because they belong to a nonfrequent bucket, then subsequent passes will eliminate a rapidly growing fraction of the pairs, until it is very unlikely that any candidate pair will turn out not to be frequent. However, there is a point of diminishing returns, since each bitmap requires about 1/32 of the memory.
1109
If we use too many passes, not only will the algorithm take more time, but we can find ourselves with available main memory that is too small to count all the frequent pairs.
22.2.8
E x ercise 22 .2 .1 : Simulate the A-Priori Algorithm on the data of Fig. 22.3, with s 3. ! E x ercise 22.2.2: Suppose we want to count all itemsets of size n using one pass through the data. a) W hat is the generalization of the triangular-matrix method for n > 2? Give the formula for locating the array element that counts a given set of n elements {*i, *2 > ,*} b) How much main memory does the generalized triangular-matrix method take if there are k items? c) W hat is the generalization of the table-of-counts method for n > 2? d) How much main memory does the generalized table-of-counts method take if there are p itemsets of size n that appear in the data? E x ercise 22.2.3: Imagine that there are 1100 items, of which 100 are big and 1000 are little. A basket is formed by adding each big item with proba bility 1/10, and each little item with probability 1/100. Assume the number of baskets is large enough that each itemset appears in a fraction of the baskets th at equals its probability of being in any given basket. For example, every pair consisting of a big item and a little item appears in 1/1000 of the baskets. Let s be the support threshold, but expressed as a fraction of the total number of baskets rather than as an absolute number. Give, as a function of s ranging from 0 to 1, the number of frequent items on Pass 1 of the A-Priori Algorithm. Also, give the number of candidate pairs on the second pass. ! E x ercise 22.2.4: Consider running the PCY Algorithm on the data of Ex ercise 22.2.3, with 100,000 buckets on the first pass. Assume th at the hash function used distributes the pairs to buckets in a conveniently random fash ion. Specifically, the 499,500 little-little pairs are divided as evenly as possible (approximately 5 to a bucket). One of the 100,000 big-little pairs is in each bucket, and the 4950 big-big pairs each go into a different bucket. a) As a function of s, the ratio of the support threshold to the total number of baskets (as in Exercise 22.2.3), how many frequent buckets are there on the first pass? b) As a function of s, how many pairs must be counted on the second pass?
1110
E x ercise 22.2.5: Using the assumptions of Exercise 22.2.4, suppose we run a three-pass Multistage Algorithm on the dataset. Assuming that on the second pass there are again 100,000 buckets, and the hash function distributes pairs randomly among the buckets, answer the following questions, all in terms of s the ratio of the support threshold to the number of baskets. a) Approximately how many frequent buckets will there be on the second pass? b) Approximately how many pairs are counted on the third pass? E x ercise 22.2.6: Suppose baskets are in a file that is distributed over many processors. Show how you would use the map-reduce framework of Section 20.2 to: a) Find the counts of all items. ! b) Find the counts of all pairs of items.
22.3
We now turn to the version of the frequent-itemsets problem th at supports marketing activities for on-line merchants and a number of other interesting applications such as finding similar documents on the Web. We may start with the market-basket model of data, but now we search for pairs of items that appear together a large fraction of the times that either appears, even if neither item appears in very many baskets. Such items are said to be similar. The key technique is to create a short signature for each item, such that the difference between signatures tells us the difference between the items themselves.
22.3.1
Our starting point is to define exactly what we mean by similar items. Since we are interested in finding items that tend to appear together in the same baskets, the natural viewpoint is that each item is a set: the set of baskets in which it appears. Thus, we need a definition for how similar two sets are. The Jaccard similarity (or just similarity, if this similarity measure is un derstood) of sets S and T is |S fl T |/|S U T\, that is, the ratio of the sizes of their intersection and union. Thus, disjoint sets have a similarity of 0, and the similarity of a set with itself is 1. As another example, the similarity of sets {1,2,3} and {1,3,4,5} is 2/5, since there are two elements in the intersection and five elements in the union.
22.3.2
A number of important data-mining problems can be expressed as finding sets with high Jaccard similarity. We shall discuss two of them in detail here.
1111
Suppose we are given data about customers on-line purchases. One way to tell what items to pitch to a customer is to find pairs of customers that bought similar sets of items. When a customer logs in, they can be pitched an item that a similar customer bought, but that they did not buy. To compaxe customers, represent a customer by the set of items they bought, and compute the Jaccard similarity for each pair of customers. There is a dual view of the same data. We might want to know which pairs of items are similar, based on their having been bought by similar sets of customers. We can frame this problem in the same terms as finding similar customers. Now, the items are represented by the set of customers that bought them, and we need to find pairs of items th at have similar sets of customers. Notice, incidentally, th at the same data can be viewed as market-basket data in two different ways. The products can be the items and the customers the baskets, or vice-versa. You should not be surprised. Any many-many relationship can be seen as market-basket data in two ways. In Section 22.1 we viewed the data in only one way, because when the baskets are really shopping carts at a stores checkout stand, there is no real interest in finding similar shopping carts or carts that contain many items in common.
S im ila r D o c u m e n ts
There are many reasons we would like to find pairs of textually similar docu ments. If we are crawling the Web, documents th at are very similar might be mirrors of one another, perhaps differing only in links to other documents at the local site. A search engine would not want to offer both sites in response to a search query. Other similar pairs might represent an instance of plagiarism. Note th at one document d\ might contain an excerpt from another document d2, yet di and d2 are identical in only 10% of each; that could still be an instance of plagiarism. Telling whether documents are character-for-character identical is easy; just compare characters until you find a mismatch or reach the ends of the docu ments. Finding whether a sentence or short piece of text appears characterfor-character in a document is not much harder. Then you have to consider all places in the document where the sentence of fragment might start, but most of those places will have a mismatch very quickly. W hat is harder is to find documents that are similar, but are not exact copies in long stretches. For instance, a draft document and its edited version might have small changes in almost every sentence. A technique that is almost invulnerable to large numbers of small changes is to represent a document by its set of k-grams, that is, by the set of substrings of length k. k-Shingle is another word for fc-gram. For example, the set of 3-grams th at we find in the first sentence of Section 22.3.2 (A number of- ) contains "A n", " nu", "num", and so on. If we pick k large enough so that the probability of a randomly chosen fc-gram appearing in a document is small,
1112
Compressed Shingles
In order that a document be characterized by its set of fc-shingles, we have to pick k sufficiently large that it is rare for a given shingle to appear in a document, fc = 5 is about the smallest we can choose, and it is not unusual to have k around 10. However, then there are so many possible shingles, and the shingles are so long, th at certain algorithms take more time than necessary. Therefore, it is common to hash the shingles to integers of 32 bits or less. These hash-values are still numerous enough th at they differentiate between documents, but they can be compared and processed quickly.
then a high Jaccard similarity of the sets of fc-grams representing a pair of documents is a strong indication that the documents themselves are similar.
22.3.3
Minhashing
Computing the Jaccard similarity of two large sets is time consuming. Moreover, even if we can compute similarities efficiently, a large dataset has fax too many pairs of sets for us to compute the similarity of every pair. Thus, there are two tricks we need to learn to extract only the similar pairs from a large dataset. Both are a form of hashing, although the techniques are completely different uses of hashing. 1. Minhashing is a technique that lets us form a short signature for each set. We can compute the Jaccard similarity of the sets by measuring the similarity of the signatures. As we shall see, the similarity for signatures is simple to compute, but it is not the Jaccard similarity. We take up minhashing in this section. 2. Locality-Sensitive Hashing is a technique that lets us focus on pairs of signatures whose underlying sets are likely to be similar, without exam ining all pairs of signatures. We take up locality-sensitive hashing in Section 22.4. To introduce minhashing, suppose that the elements of each set are chosen from a universal set of n elements e o ,e i,... ,e n_ i. Pick a random permuta tion of the n elements. Then the minhash value of a set S is the first element, in the permuted order, that is a member of S.
E x a m p le 2 2 .6 : Suppose the universal set of elements is {1,2,3, 4 , 5} and the
permuted order we choose is (3,5,4,2,1). Then the hash value of any set that contains 3, such as {2,3,5} is 3. A set th at contains 5 but not 3, such as {1,2,5}, hashes to 5. For another example, {1,2} hashes to 2, because 2 appears before 1 in the permuted order.
1113
Suppose we have a collection of sets. For example, we might be given a collection of documents and think of each document as represented by its set of 10-grams. We compute signatures for the sets by picking a list of m permuta tions of all the possible elements (e.g., all possible character strings of length 10, if the elements are 10-grams). Typically, m would be about 100. The signature of a set S is the list of the minhash values of S, for each of the m permutations, in order. E x am p le 2 2.7: Suppose the universal set of elements is again {1,2,3,4,5}, and choose m 3, that is, signatures of three minhash values. Let the per mutations be 7T i = (1,2,3,4,5), ir? = (5,4,3,2,1), and ir3 (3,5,1,4,2). The signature of S = {2,3,4} is (2,4,3). To see why, first notice that in the order 7Ti, 2 appears before 3 and 4, so 2 is the first minhash value. In 7r2, 4 appears before 2 and 3, so 4 is the second minhash value. In jt3, 3 appears before 2 and 4, so 3 is the third minhash value.
22.3.4
There is a surprising relationship between the minhash values and the Jaccard similarity: If we choose a permutation at random, the probability that it will produce the same minhash values for two sets is the same as the Jaccard similarity of those sets. Thus, if we have the signatures of two sets S and T, we can estimate the Jaccard similarity of S and T by the fraction of corresponding minhash values for the two sets that agree. E x am p le 2 2.8: Let the permutations be as in Example 22.7, and consider another set, T {1,2,3}. The signature for T is (1,3,3). If we compare this signature with (2,4,3), the signature of the set S = {2,3,4}, we see that the signatures agree in only the last of the three components. We therefore estimate the Jaccard similarity of S and T to be 1/3. Notice th at the true Jaccard similarity of S and T is 1/2. In order that the signatures are very likely to estimate the similarity closely, we need to pick considerably more than three permutations. We suggest that 100 permutations may be enough for the law of large numbers to hold. How ever, the exact number of signatures needed depends on how closely we need to estimate the similarity.
22.3.5
W hy Minhashing Works
To see why the Jaccard similarity is the probability th at two sets have the same minhash value according to a randomly chosen permutation of elements, let S
1114
and T be two sets. Imagine going down the list of elements in the permuted order, until you find an element e that appears in at least one of S and T. There are two cases: 1. If e appears in both S and T, then both sets have the same minhash value, namely e. 2. But if e appears in one of S and T but not the other, then one set gets minhash value e and the other definitely gets some other minhash value. We do not meet e until the first time we find, in the permuted order, an element th at is in S U T . The probability of Case 1 occuring is the fraction of members of S U T that are in S fl T . That fraction is exactly the Jaccard similarity of 5 and T. But Case 1 is also exactly when S and T have the same minhash value, which proves the relationship.
22.3.6
Implementing Minhashing
While we have spoken of choosing a random permutation of all possible ele ments, it is not feasible to do so. It would take far too long, and we might have to deal with elements that appeared in none of our sets. Rather, we simulate the choice of a random permutation by instead picking a random hash function h from elements to some large sequence of integers 0 ,1 ,... , B 1 (i.e., bucket numbers). We pretend that the permutation that h represents places element e in the position h(e). Of course, several elements might thus wind up in the same position, but as long as B is large, we can break ties as we like, and the simulated permutations will be sufficiently random that the relationship between signatures and similarity still holds. Suppose our dataset is presented one set at a time. To compute the minhash value for a set S { ai, a2, . . . , ara} using a hash function h, we can execute: V := oo;
FOR
i
:= 1 TO
D O
IF h{a,i) < V THEN V := /i(af); As a result, V will be set to the hash value of the element of S that has the smallest hash value. This hash value may not identify a unique element, because several elements in the universe of possible elements may hash to this value, but as long as h hashes to a large number of possible values, the chances of a coincidence is small, and we may continue to assume th at a common minhash value suggests two sets have an element in common. If we want to compute not just one minhash value but the minhash values for set S according to m hash functions h\, h2, , hm, then we can compute m minhash values in parallel, as we process each member of S. The code is suggested in Fig. 22.8.
22.3. FINDING SIM ILAR ITEM S FOR j := 1 TO m D O Vj : = o o ; FOR i := 1 TO n D O FOR j := 1 TO m D O IF hj(ai) < Vj THEN Vj := hj(ai); Figure 22.8: Computing m minhash values at once
1115
It is somewhat harder to compute signatures if the data is presented basketby-basket as in Section 22.1. That is, suppose we want to compute the sig natures of items, but our data is in a file consisting of baskets. Similarity of items is the Jaccard similarity of the sets of baskets in which these items appear. Suppose there are k items, and we want to construct their minhash signa tures using m different hash functions h i,h 2 , . . . ,h m. Then we need to maintain km values, each of which will wind up being the minhash value for one of the items according to one of the hash functions. Let Vy be the value for item i and hash function h j . Initially, set all Vys to infinity. When we read a basket b, we compute hj(b) for all j = 1 ,2 ,... , m. However, we adjust values only for those items i th at are in b. The algorithm is sketched in Fig. 22.9. At the end, Vij holds the jth minhash value for item i. FOR i := 1 TO k D O FOR j := 1 TO m D O
V^ : = 0 0 ;
FOR EACH b ask e t b D O BEGIN FOR j := 1 TO m D O compute h j(b); FOR EACH item i in b D O FOR j := 1 TO m D O IF hj (b) < Vij THEN Vij := hj(b); END Figure 22.9: Computing minhash values for all items and hash functions
22.3.7
E x ercise 22 .3 .1 : Compute the Jaccard similarity of each pair of the following sets: {1,2,3,4,5}, {1,6,7}, {2,4,6,8}. E x ercise 22 .3 .2 : W hat are all the 4-grams of the following string: "abc def ghi"
1116
Do not count the quotation marks as part of the string, but remember that blanks do count. E x ercise 2 2 .3 .3 : Suppose th at the universal set is { 1 ,2 ,... , 10}, and signa tures for sets are constructed using the following list of permutations: 1. (1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,1 0 ) 2. (1 0 ,8 ,6 ,4 ,2 ,9 ,7 ,5 ,3 ,1 ) 3. (4 ,7 ,2 ,9 ,1 ,5 ,3 ,1 0 ,6 ,8 ) Construct minhash signatures for the following sets: a) {3,6,9}. b) {2,4,6,8} c) {2,3,4} How does the estimate of the Jaccard similarity for each pair, derived from the signatures, compare with the true Jaccard similarity? E x ercise 22 .3 .4 : Suppose th at instead of using particular permutations to construct signatures for the three sets of Exercise 22.3.3, we use hash functions to construct the signatures. The three hash functions we use are: f( x ) x mod 10 g(x) = (2x + 1) mod 10 h(x) = (3a: -I- 2) mod 10 Compute the signatures for the three sets, and compare the resulting estimate of the Jaccard similarity of each pair with the true Jaccard similarity. ! E x ercise 22.3.5: Suppose data is in a file th at is distributed over many pro cessors. Show how you would use the map-reduce framework of Section 20.2 to compute a minhash value, using a single hash function, assuming: a) The file must be partitioned by rows. b) The file must be partitioned by columns.
22.4
L ocality-Sensitive Hashing
Now, we take up the problem th at was not really solved by taking minhash signatures. It is true th at these signatures may make it much faster to estimate the similarity of any pair of sets, but there may still be far too many pairs of sets to find all pairs th at meet a given similarity threshold. The technique called locality-sensitive hashing, or LSH, may appear to be magic; it allows us, in
1117
a sense, to hash sets or other elements to buckets so th at similar elements are assigned to the same bucket. There are tradeoffs, of course. There is a (typically small) probability that we shall miss a pair of similar elements, and the lower we want that probability to be, the more work we must do. After some examples, we shall take up the general theory.
22.4.1
Recall our discussion of entity resolution in Section 21.7. There, we had a large collection of records, and we needed to find similar pairs. The notion of sim ilarity was not Jaccard similarity, and in fact we left open what similarity meant. Whatever definition we use for similarity of records, there may be far too many pairs to measure them all. For example, if there are a million records not a very large number then there are about 500 billion pairs of records. An algorithm like R-Swoosh may allow merging with fewer than that number of comparisons, provided there are many large sets of similar records, but if no records are similar to other records, then there is no way we can discover that fact without doing all possible comparisons. It would be wonderful to have a way to hash records so that similar records fell into the same bucket, and nonsimilar pairs never did, or rarely did. Then, we could restrict our examination of pairs to those that were in the same bucket. If, say, there were 1000 buckets, and records distributed evenly, then we would only have to compare 1/1000 of the pairs. We cannot do exactly what is described above, but we can come surprisingly close. E x am p le 2 2.9: Suppose for concreteness that records are as in the running example of Section 21.7: name-address-phone triples, where each of the three fields is a character string. Suppose also that we define records to be similar if the sum of the edit distances of their three corresponding pairs of fields is no greater than 5. Let us use a hash function h that hashes the name field of a record to one of a million buckets. How h works is unimportant, except that it must be a good hash function one that distributes names roughly uniformly among the buckets. But we do not stop here. We also hash the records to another set of a million buckets, this time using the address, and a suitable hash function on addresses. If h operates on any strings, we can even use h. Then, we hash records a third time to a million buckets, using the phone number. Finally, we examine each bucket in each of the three hash tables, a total of 3,000,000 buckets. For each bucket, we compare each pair of records in each bucket, and we report any pair th at has total edit distance 5 or less. Suppose there are n records. Assuming even distribution of records in each hash table, there are n/106 records in each bucket. The number of pairs of records in each bucket is approximately n 2/ ( 2 x 1012). Since there are 3 x 106 buckets, the total number of comparisons is about 1.5n2/106. And since there are about ra2/ 2 pairs of records, we have managed to look at only fraction 3 x 10-6 of the records, a big improvement.
1118
In fact, since the number of buckets was chosen arbitrarily, it seems we can reduce the number of comparisons to whatever degree we wish. There are limitations, of course. If we choose too large a number of buckets, we run out of main-memory space, and regardless of how many buckets we use, we cannot avoid the pairs of records th at are really similar. Have we given up anything? Yes, we have; we shall miss some similar pairs of records that meet the similarity threshold, because they differ by a few characters in each of the three fields, yet no more than five characters in total. What fraction of the truly similar pairs we lose depends on the distribution of discrepancies among the fields of records that truly represent the same entity. However, if the threshold for total edit distance is 5, we do not expect to miss too many truly similar pairs. But what if the threshold on edit distance in Example 22.9 were not 5, but 20? There might be many pairs of similar records that had no one field identical. To deal with this problem, we need to: 1. Increase the number of hash functions and hash tables. 2. Base each hash function on a small part of a field. E x am p le 22.10: We could break the name into first, middle, and last names, and hash each to buckets. We could break the address into house number, street name, city name, state, and zip code. The phone number could be broken into area code, exchange, and the last four digits. Since phones are numbers, we could even choose any subset of the ten digits in a phone number, and hash on those. Unfortunately, since we are now hashing short subfields, we are limited in the number of buckets th at we can use. If we pick too many buckets, most will be empty. After hashing records many times, we again look in each bucket of each of the hash tables, and we compare each pair of records that fall into the same bucket at least once. However, the total running time is much higher than for our first example, for two reasons. First, the number of record occurrences among all the buckets is proportional to the number of hash functions we use. Second, hash functions based on small pieces of data cannot divide the records into as many buckets as in Example 22.9.
22.4.2
The use of locality-sensitive hashing in Example 22.10 is relatively straightfor ward. For a more subtle application of the general idea, let us return to the problem introduced in Section 22.3, where we saw the advantage of replacing sets by their signatures. When we need to find similar pairs of sets that are represented by signatures, there is a way to build hash functions for a localitysensitive hashing, for any desired similarity threshold. Think of the signatures of the various sets as a matrix, with a column for each sets signature and a row
1119
for each hash function. Divide the matrix into b bands of r rows each, where br is the length of a signature. The arrangement is suggested by Fig. 22.10.
Buckets
rows
t
I
b bands
Figure 22.10: Dividing signatures into bands and hashing based on the values in a band For each band we choose a hash function th at maps the portion of a signature in th at band to some large number of buckets, B. That is, the hash function applies to sequences of r integers and produces one integer in the range 0 to B 1. In Fig. 22.10, B = 4. If two signatures agree in all rows of any one band, then they surely will wind up in the same bucket. There is a small chance that they will be in the same bucket even if they do not agree, but by using a very large number of buckets B , we can make sure there are very few false positives. Every bucket of each hash function has its members compared for similarity, so a pair of signatures that agree in even one band will be compared. Signatures th at do not agree in any band probably will not be compared, although as we mentioned, there is a small probability they will hash to the same bucket anyway, and would therefore be compared. Let us compute the probability that a pair of minhash signatures will be compared, as a function of the Jaccard similarity s of their underlying sets, the
1120
number of bands b, and the number of rows r in a band. For simplicity, we shall assume th at the number of buckets is so large th at there are no coincidences; signatures hash to the same bucket if and only if they have the same values in the entire band on which the hash function is based. First, the probability th at the signatures agree on one row is s, as we saw in Section 22.3.5. The probability th at they agree on all r rows of a given band is sr. The probability th at they do not agree on all rows of a band is 1 sr, and the probability th at for none of the b bands do they agree in all rows of that band is (1 sr)b. Finally, the probability th at the signatures will agree in all rows of at least one band is 1 (1 sr)b. This function is the probability that the signatures will be compared for similiarity. E x a m p le 22 .1 1 : Suppose r = 5 and b 20; th at is, we have signatures of 100 integers, divided into 20 bands of five rows each. The formula for the probability th at two signatures of similarity s will be compared becomes 1 (1 s5)20. Suppose s 0.8; i.e., the underlying sets have Jaccard similarity 80%. s5 = 0.328. T hat is, the chance th at the two signatures agree in a given band is small, only about 1/3. However, we have 20 chances to win, and (1 0.328)20 is tiny, only about 0.00035. Thus, the chance th at we do find this pair of signatures together in at least one bucket is 1 0.00035, or 0.99965. On the other hand, suppose s = 0.4. Then 1 (l (0.4)5) 20 = (1 .Ol)20, or approximately 20%. If s is much smaller than 0.4, the probability th at the signatures will be compared drops below 20% very rapidly. We conclude that the choice b = 20 and r = 5 is a good one if we are looking for pairs with a very high similarity, say 80% or more, although it would not be a good choice if the similarity threshold were as small as 40%.
Similarity s Figure 22.11: The probability th at a pair of signatures will appear together in at least one bucket The function 1 (1 sr)b always looks like Fig. 22.11, but the point of rapid
1121
transition from a very small value to a value close to 1 varies, depending on b and r. Roughly, the breakpoint is at similarity s = (1 /b)1/ r.
22.4.3
The two ideas, minhashing and LSH. must be combined properly to solve the sort of problems we discussed in Section 22.3.2. Suppose, for example, that we have a large repository of documents, which we have already represented by their sets of shingles of some length. We want to find those documents whose shingle sets have a Jaccard similarity erf at least s. 1. Start by computing a minhash signature for each document; how many hash functions to use depends on the desired accuracy, but several hundred should be enough for most purposes.
2. Perform a locality-sensitive hashing to get candidate pairs of signatures th at hash to the same bucket for at least one band. How many bands and how many rows per band depend on the similarity threshold s, as
discussed in Section 22.4.2. 3. For each candidate pair, compote the estimate of their Jaccard similarity by counting the number of components in which their signatures agree. 4. Optionally, for each pair whose signatures are sufficiently similar, compute their true Jaccard similarity hr examining the sets themselves. Of course, this method introduces false positives candidate pairs th at get eliminated in step (2), (3), or (4). However, the second and third steps also allow some false negatives pairs with a sufficiently high Jaccard similarity th at are not candidates or are ehminated from the candidate pool. a) At step (2), a pair could have TCfy similar signatures, yet there happens to be no band in which the signatures agree in all rows of the band. b) In step (3), a pair could have Jaccard similarity at least s, but their signatures do not agree in fraction * o f the components. One way to reduce the number o f false negatives is to lower the similarity threshold at the initial stages. At step (2), choose a smaller number of rows r or a larger number of bands b than would be indicated by the target similarity s. At step (3) choose a smaller fraction than s o f corresponding signature components th at allows a pair to move on to step (4). Unfortunately, these changes each increase the number of false positives, so t o o must consider carefully how small you can afford to make your thresholds. Another possible way to avoid false negatives is to skip step (3) and go directly to step (4) for each candidate pair. That is, we compute the true
1122
Jaccard similarity of every candidate pair. The disadvantage of doing so is th at the minhash signatures were devised to make it easier to compare the underlying sets. For example, if the objects being compared are actually large documents, comparing complete sets of Ai-shingles is far more time consuming than matching several hundred components of signatures. In some applications, false negatives are not a problem, so we can tune our LSH to allow a significant fraction of false negatives, in order to reduce false positives and thus to speed up the entire process. For instance, if an on-line retailer is looking for pairs of similar customers, in order to select an item to pitch to each customer, it is not necessary to find every single pair of similar customers. It is sufficient to find a few very similar customers for each customer.
22.4.4
E x ercise 2 2 .4 .1 : This exercise is based on the entity-resolution problem of Example 22.9. For concreteness, suppose th at the only pairs records th at could possibly be total edit distance 5 or less from each other consist of a true copy of a record and another corrupted version of the record. In the corrupted version, each of the three fields is changed independently. 50% of the time, a field has no change. 20% of the time, there is a change resulting in edit distance 1 for th at field. There is a 20% chance of edit distance 2 and 10% chance of edit distance 10. Suppose there are one million pairs of this kind in the dataset. a) How many of the million pairs are within total edit distance 5 of each other? b) If we hash each field to a large number of buckets, as suggested by Ex ample 22.9, how many of these one million pairs will hash to the same bucket for at least one of the three hashings? c) How many false negatives will there be; that is, how many of the one million pairs are within total edit distance 5, but will not hash to the same bucket for any of the three hashings? E x ercise 22.4.2 : The function p = 1 (1 sr)b gives the probability p that two minhash signatures th at come from sets with Jaccard similarity s will hash to the same bucket at least once, if we use an LSH scheme with b bands of r rows each. For a given similarity threshold s, we want to choose b and r so that p = 1/2 at s. we suggested th at approximately s = (1 /b )1/ r is where p = 1/2, but th at is only an approximation. Suppose signatures have length 24. We can pick any integers b and r whose product is 24. That is, the choices for r are 1, 2, 3, 4, 6, 8, 12, or 24, and b must then be 24/r. a) If s 1/2, determine the value of p for each choice of b and r. Which would you choose, if 1/2 were the similarity threshold? ! b) For each choice of b and r, determine the value of s th at makes p = 1/2.
1123
22.5
Clustering is the problem of taking a dataset consisting of points and grouping the points into some number of clusters. Points within a cluster must be near to each other in some sense, while points in different clusters are far from each other. We begin with a study of distance measures, since only if we have a notion of distance can we talk about whether points are near or far. An important kind of distance is Euclidean, a distance based on the location of points within a space. Curiously, not all distances are Euclidean, and an im portant problem in clustering is dealing with sets of points th at do not live anywhere in a space, yet have a notion of distance. We next consider the two major approaches to clustering. One, called agglomerative, is to start with points each in their own cluster, and repeatedly merge nearby clusters. The second, point assignment, initializes the clus ters in some way and then assigns each point to its best cluster.
22.5.1
Applications o f Clustering
Many discussions of clustering begin with a small example, in which a small number of points are given in a two-dimensional space, such as Fig, 22.12. Al gorithms to cluster such data are relatively simple, and we shall mention the techniques only in passing. The problem becomes hard when the dataset is large. It becomes even harder when the number of dimensions of the data is large, or when the data doesnt even belong to a space th at has dimensions. Let us begin by examining some examples of interesting uses of clustering al gorithms on large-scale data.
#
::
In Section 22.3.2 we discussed the problem of finding similar products or similar customers by looking at the set of items each customer bought. The output of analysis using minhashing and locality-sensitive hashing could be a set of pairs of similar products (those bought by many of the same customers. Alternatively,
1124
we could look for pairs of similar customers (those buying many of the same products). It may be possible to get a better picture of relationships if we cluster products (points) into groups of similar products. These might represent a natural class of products, e.g., classical-music CDs. Likewise, we might find it useful to cluster customers with similar tastes; e.g., one cluster might be people who like classical music. For clustering to make sense, we must view the distance between points representing customers or items as low if the similarity is high. For example, we shall see in Section 22.5.2 how one minus the Jaccard similarity can serve as a suitable notion of distance.
C lu sterin g D o c u m e n ts by Topic
We could use the technique described above for products and customers to cluster documents based on their Jaccard similarity. However, another applica tion of document clustering is to group documents into clusters based on their topics (e.g., topics such as sports or medicine), even if documents on the same topic are not very similar character-by-character. A simple approach is to imagine a very high-dimensional space, where there is one dimension for each word that might appear in the document. Place the document at point ( x i,x 2 , ), where X{ = 1 if the ith word appears in the document and x; = 0 if not. Distance can be taken to be the ordinary Euclidean distance, although as we shall see, this distance measure is not as useful as it might appear at first.
C lu sterin g D N A S eq u en ces
DNA is a sequence of base-pairs, represented by the letters C, G, A, and T. Because these strands sometimes change by substitution of one letter for another or by insertion or deletion of letters, there is a natural edit-distance between DNA sequences. Clustering sequences based on their edit distance allows us to group similar sequences.
E n tity R e so lu tio n
In Section 21.7.4, we discussed an algorithm for merging records that, in effect, created clusters of records, where each cluster was one connected component of the graph formed by connecting records that met the similarity condition.
S k yC at
In this project, approximately two billion sky objects such as stars and galax ies were plotted in a 7-dimensional space, where each dimension represented the radiation of the object in one of seven different bands of the electromagnetic spectrum. By clustering these objects into groups of similar radiation patterns, the project was able to identify approximately 20 different kinds of objects.
1125
Euclidean Spaces
W ithout going into the theory, for our purposes we may think of a Eu clidean space as one with some number of dimensions n. The points in the space are all n-tuples of real numbers ( x \ , x 2, , x n). The common Euclidean distance is but one of many plausible distance measures in a Euclidean space.
22.5.2
D istance Measures
A distance measure on a set of points is a function d{x, y) that satisfies: 1. d (x,y) > 0 for all points x and y. 2. d (x,y) = 0 if and only if x = y. 3. d (x,y) = d(y,x) (symmetry). 4. d (x,y) < d (x,z) + d(z, y) for any points x, y, and 2 ( triangle inequality). T hat is, the distance from a point to itself is 0, and the distance between any two different points is positive. The distance between points does not depend on which way you travel (symmetry), and it never reduces the distance if you force yourself to go through a particular third point (the triangle inequality). The most common distance measure is the Euclidean distance between points in an n-dimensional Euclidean space. In such a space, points can be represented by n coordinates x = (x-i, , x n) and y (y i, 2 / 2, , yn). The distance d(x,y) is y/J2i=i(x i ~ Vi)2 is, the square root of the sum of the squares of the differences in each dimension. However, there are many other ways to define distance; we shall examine some below.
D ista n c e s B a se d o n N o r m s
In a Euclidean space, the conventional distance mentioned above is only one possible choice. More generally, we can define the distance
n
d(x,y) =
(^ 2 \x i i= 1
V i\r ) 1 /r
for any r. This distance is said to be derived from the L r-norm. The conven tional Euclidean distance is the case r = 2, and is often called the L2-norm. Another common choice is the Li-norm, that is, the sum of the distances along the coordinates of the space. This distance is often called the Manhattan distance, because it is the distance one has to travel along a rectangular grid of streets found in many cities such as M anhattan.
1126
Yet another interesting choice is the Loo-norm, which is the maximum of the distances in any one coordinate. That is, as r approaches infinity, the value f Yli= 1 1 x i ~ Vi\r)1^ r approaches the maximum over all i of |Xi yi\.
E x a m p le 2 2 .1 2 : Let x = (1,2,3) and y = (2,4,1). Then the L a distance
d{x,y) is v /|l - 2|2 + |2 4|2 + |3 - 1|2 = ^ ( 1 + 4 + 4) = 3. Note th at this distance is the conventional Euclidean distance. The M anhattan distance be tween x and y is |1 2| + |2 4| + |3 1| = 5. The Loo-norm gives distance between x and y of m ax(|l 2|, |2 4|, |3 1|) = 2.
J a cca rd D ista n c e
The Jaccard distance between points th at are sets is one minus the Jaccard similarity of those sets. T hat is, if x and y are sets, then d(x,y) = 1 - (\x C \ y \/\x U y\) For example, if the two points represent sets {1,2,3} and {2,3,4,5}, then the Jaccard similarity is 2/5, so the Jaccard distance is 3/5. One might naturally ask whether the Jaccard distance satisfies the axioms of a distance measure. It is easy to see that d(x, x) = 0, because 1(|ar
D x \/\x
U x|)
= 1 (1/1) '' 0
It is also easy to see th at the Jaccard distance cannot be negative, since the intersection of sets cannot be bigger than their union. Symmetry of the Jac card distance is likewise straightforward, since both union and intersection are commutative. The hard part is showing the triangle inequality. Coming to our rescue is the theorem from Section 22.3.4 th at says the Jaccard similarity of two sets is the probability th at a random permutation will result in the same minhash value for those sets. Thus, the Jaccard distance is the probability th at the sets will not have the same minhash value. Suppose x and y have different minhash values according to a permutation t t . Then at least one of the pairs { x ,z } and { z ,y } must have different minhash values; possibly both do. Thus, the probability th at x and y have different minhash values is no greater than the sum of the probability th at x and z have different minhash values plus the probability th at 2 and y have different minhash values. These probabilities are the Jaccard distances mentioned in the triangle inequality. T hat is, we have shown th at the Jaccard distance from x to y is no greater than the sum of the Jaccard distances from x to z and from z to y.
C o sin e D ista n c e
Suppose our points are in a Euclidean space. We can think of these points as vectors from the origin of the space. The cosine distance between two points is the angle between the vectors.
1127
E x am p le 22.13: Suppose documents are characterized by the presence or absence of five words, so points (documents) are vectors of five 0s or l s. Let (0,0,1,1,1) and (1,0,0,1,1) be the two points. The cosine of the angle between them is computed by taking the dot product of the vectors, and dividing by the product of the lengths of the vectors. In this case, the dot product is 0 x l + 0 x 0 + l x 0 + l x l + l x l = 0 + 0 + 0 + l + l = 2 . Both vectors have length \/3. Thus, the cosine of the angle between the vectors is 2 /(\/3 x y/S) = 2/3. The angle is about 48 degrees.
Cosine distance satisfies the axioms of a distance measure, as long as points are treated as directions, so two vectors, one of which is a multiple of the other are treated as the same. Angles can only be positive, and if the angle is 0 then the vectors must be in the same direction. Symmetry holds because the angle between x and y is the same as the angle between y and x. The triangle inequality holds because the angle between two vectors is never greater than the sum of the angles between those vectors and a third vector.
E d it D istan ce Various forms of edit distance satisfy the axioms of a distance measure. Let us focus on the edit distance that allows only insertions and deletions. If strings x and y are at distance 0 (i.e., no edits are needed) then they surely must be the same. Symmetry follows because insertions and deletions can be reversed. The triangle inequality follows because one way to turn x into y is to first turn x into 2 and then turn 2 into y. Thus, the sum of the edit distances from x to z and from z to y is the number of edits needed for one possible way to turn x into y. This number of edits cannot be less than the edit distance from x to y , which is the minimum over all possible ways to get from x to y.
1128
22.5.3
We shall now begin our study of algorithms for computing clusters. The first approach is, at the highest level, straightforward. Start with every point in its own cluster. Until some stopping condition is met, repeatedly find the closest pair of clusters to merge, and merge them. This methodology is called agglomerative or hierarchical clustering. The term hierarchical comes from the fact that we not only produce clusters, but a cluster itself has a hierarchical substructure th at reflects the sequence of mergers th at formed the cluster. The devil, as always, is in the details, so we need to answer two questions: 1. How do we measure the closeness of clusters? 2. How do we decide when to stop merging?
D e fin in g C lo se n e ss
There are many ways we could define the closeness of two clusters C and D. Here are two popular ones: a) Find the minimum distance between any pair of points, one from C and one from D. b) Average the distance between any pair of points, one from C and one from D. These measures of closeness work for any distance measure. If the points are in a Euclidean space, then we have additional options. Since real numbers can be averaged, any set of points in a Euclidean space has a centroid, the point that is the average, in each coordinate, of the points in the set. For example, the centroid of the set {(1,2,3), (4,5,6), (2,2,2)} is (2.33, 3, 3.67) to two decimal places. For Euclidean spaces, another good choice of closeness measure is: c) The distance between the centroids of clusters C and D.
S to p p in g th e M erg er
One common stopping criterion is to pick a number of clusters k, and keep merging until you are down to k clusters. This approach is good if you have an intuition about how many clusters there should be. For instance, if you have a set of documents th at cover three different topics, you could merge until you have three clusters, and hope th at these clusters correspond closely to the three topics. Other stopping criteria involve a notion of cohesion, the degree to which the merged cluster consists of points that are all close. Using a cohesion-based stopping policy, we decline to merge two clusters whose combination fails to meet the cohesion condition th at we have chosen. At each merger round, we may merge two clusters th at are not closest of all pairs of clusters, but are closer
1129
than any other pair that meet the cohesion condition. We even could define closeness to be the cohesion score, thus combining the merger selection with the stopping criterion. Here are some ways th at we could define a cohesion score for a cluster: i. Let the cohesion of a cluster be the average distance of each point to the centroid. Note that this definition only makes sense in a Euclidean space. ii. Let the cohesion be the diameter, the largest distance between any pair of points in the cluster. in . Let the cohesion be the average distance between pairs of points in the cluster.
(1,5) (3,4)
#d,2) A D
# (6,2) F
(5,1)
E x am p le 22.14: Consider the six points in Fig. 22.13. Assume the normal Euclidean distance as our distance measure. We shall choose as the distance between clusters the minimum distance between any pair of points, one from each cluster. Initially, each point is in a cluster by itself, so the distances between clusters are just the distances between the points. These distances, to two decimal places, are given in Fig. 22.14 A 4.00 5.39 4.12 2.83 3.00 B 5.83 5.10 5.66 2.24 C 3.61 3.00 3.61 D 1.41 3.16 E 2.00
F E D C B
1130
The closest two points are D and F, so these get merged into one cluster. We must compute the distance between the cluster D F and each of the other points. By the closeness rule we chose, this distance is the minimum of the distances from a node to D or F. The table of distances becomes: E DF C B A 5.39 4.00 2.83 3.00 B 5.10 5.66 2.24 C 3.00 3.61 DF 2.00
The shortest distance above is between E and D F, so we merge these two clusters into a single cluster D E F . The distance to this cluster from each of the other points is the minimum of the distance to any of D, E, and F. This table of distances is: DEF C B A 4.00 2.83 3.00 B 5.10 2.24 C 3.00
Next, we merge the two closest clusters, which are B and C. The new table of distances is: DEF BC A 4.00 2.83 BC 3.00
The last possible merge is A with BC . The result is two clusters, A B C and DEF. However, we may wish to stop the merging earlier. As an example stop ping criterion, let us reject any merger that results in a cluster with an average distance between points over 2.5. Then we can merge D, E , and F; the cohe sion (average of the three distances between pairs of these points) is 2.19 (see Fig. 22.14 to check). At the point where the clusters are A, B C , and D E F , we cannot merge A with B C , even though these are the closest clusters. The reason is that the average distance among the points in A B C is 2.69, which is too high. We might consider merging D E F with B C , which is the second-closest pair of clusters at th at time, but the cohesion for the cluster B C D E F is 3.56, also too high. The third option would be to merge A with D E F , but the cohesion of A D E F is 3.35, again too high.
22.5.4
A>Means Algorithms
The second broad approach to clustering is called point-assignment. A popular version, which is typical of the approach is called k-means. This approach is really a family of algorithms, just as agglomerative clustering is. The outline of a fc-means algorithm is:
1131
1. Start by choosing k initial clusters in some way. These clusters might be single points, or small sets of points. 2. For each unassigned point, place it in the nearest cluster. 3. Optionally, after all points are assigned to clusters, fix the centroid of each cluster (assuming the points are in a Euclidean space, since non-Euclidean spaces do not have a notion of centroid). Then reassign all points to these k clusters. Occasionally, some of the earliest points to be assigned will thus wind up in another cluster. One way to initialize a ft-means clustering is to pick the first point at random. Then pick a second point as far from the first point as possible. Pick a third point whose minimum distance to either of the other two points is as great as possible. Proceed in this manner, until k points are selected, each with the maximum possible minimum distance to the previously selected points. These points become the initial k clusters. E x am p le 22.15: Suppose our points are those in Fig. 22.13, k = 3, and we choose A as the seed of the first cluster. The point furthest from A is E . so E becomes the seed of the second cluster. For the third point, the minimum distances to A or are as follows. B: 3.00, C: 2.83, D : 3.16, F: 2.00 The winner is D, with the largest minimum distance of 3.16. Thus, D becomes the third seed. Having picked the seeds for the k clusters, we visit each of the remaining points and assign it to a cluster. A simple way is to assign each point to the closest seed. However, if we are in a Euclidean space, we may wish to maintain the centroid for each cluster, and as we assign each point, put it in the cluster with the nearest centroid. E x am p le 22.16: Let us continue with Example 22.15. We have initialized each of the three clusters A, D, and E , so their centroids are the points them selves. Suppose we assign B to a cluster. The nearest centroid is A, at distance 3.00. Thus, the first cluster becomes A B , and its centroid is (1,3.5). Suppose we assign C next. Clearly C is closer to the centroid of A B than it is to either D or E, so C is assigned to A B , which becomes A B C with centroid (1.67,3.67). Last, we assign F; it is closer to D than to E or to the centroid of A B C . Thus, the three clusters are A B C , D F, and E , with centroids (1.67,3.67), (5.5,1.5), and (6,4), respectively. We could reassign all points to the nearest of these three centroids, but the resulting clusters would not change.
1132
22.5.5
We shall now examine an extension of fc-means th at is designed to deal with sets of points th at are so large they cannot fit in main memory. The goal is not to assign every point to a cluster, but to determine where the centroids of the clusters are. If we really wanted to know the cluster of every point, we would have to make another pass through the data, assigning each point to its nearest centroid and writing out the cluster number with the point. This algorithm, called the BFR Algorithm,4 assumes an n-dimensional Eu clidean space. It may therefore represent clusters, as they are forming, by their centroids. The BFR Algorithm also assumes th at the cohesion of a cluster can be measured by the variance of the points within a cluster; the variance of a cluster is the average square of the distance of a point in the cluster from the centroid of the cluster. However, for convenience, it does not record the centroid and variance, but rather the following 2n + 1 summary statistics: 1. N , the number of points in the cluster. 2. For each dimension i, the sum of the ith coordinates of the points in the cluster, denoted SUM*. 3. For each dimension i, the sum of the squares of the ith coordinates of the points in the cluster, denoted SUMSQ^. The reason to use these parameters is th at they are easy to compute when we merge clusters. Just add the corresponding values from the two clusters. However, we can compute the centroid and variance from these values. The rules are: The ith coordinate of the centroid is SUM i/N . The variance in the ith dimension is SUMSQi/iV (SUMj/iV)2. Also remember th at c t *, the standard deviation in the ith dimension is the square root of the variance in th at dimension. The BFR Algorithm reads the data one main-memory-full at a time, leaving space in memory for the summary statistics for the clusters and some other data th at we shall discuss shortly. It can initialize by picking k points from the first memory-load, using the approach of Example 22.15. It could also do any sort of clustering on the first memory load to obtain k clusters from th at data. During the running of the algorithm, points are divided into three classes: 1. The discard set: points th at have been assigned to a cluster. These points do not appear in main memory. They are represented only by the sum mary statistics for their cluster.
4F o r th e a u th o rs , P . S. B radley, U. M . F ay y ad , a n d C . R e in a .
1133
2. The compressed set: There can be many groups of points th at are suffi ciently close to each other that we believe they belong in the same cluster, but they are not close to any clusters current centroid, so we do not know to which cluster they belong. Each such group is represented by its sum mary statistics, just like the clusters are, and the points themselves do not appear in main memory. 3. The retained set: These points are not close to any other points; they are outliers. They will eventually be assigned to the nearest cluster, but for the moment we retain each such point in main memory. These sets change as we process successive memory-loads of the data. Fig ure 22.15 suggests the state of the data after some number of memory-loads have been processed by the BFR Algorithm.
Figure 22.15: A cluster, several compressed sets and several points of the re tained set
22.5.6
We shall now describe how one memory load of points is processed. We assume th at main memory current contains the summary statistics for the k clusters and also for zero or more groups of points that are in the compressed set. Main memory also holds the current set of points in the retained set. We do the following steps:
1134
1. For all points (x i , x 2, ,x n) that are sufficiently close (a term we shall define shortly) to the centroid of a cluster, add the point to this cluster. The point itself goes into the discard set. We add 1 to A T in the summary statistics for th at cluster. We also add Xi to SUM, and add x 2 to SUMSQj for th at cluster. 2. If this memory load is the last, then merge each group from the compressed set and each point of the retained set into its nearest cluster. Remember th at it is easy to merge clusters and groups using their summary statistics. Just add the counts N , and add corresponding components of the SUM and SUMSQ vectors. The algorithm ends at this point. 3. Otherwise (the memory load is not the last), use any main-memory clus tering algorithm to cluster the remaining points from this memory load, along with all points in the current retained set. Set a threshold on the cohesiveness of a cluster, so we do not merge points unless they are rea sonably close. 4. Those points th at remain in clusters of size 1 (i.e., they are not near any other point) become the new retained set. Clusters of more than one point become groups in the compressed set and are replaced by their summary statistics. 5. Consider merging groups in the compressed set. Use some cohesiveness threshold to decide whether groups are close enough; we shall discuss how to make this decision shortly. If they can be merged, then it is easy to combine their summary statistics, as in (2) above.
D e c id in g W h e th e r a P o in t is C lo se E n o u g h to a C lu ster
Intuitively, each cluster has a size in each dimension th at indicates how far out in th at dimension typical points extend. Since we have only the summary statistics to work with, the appropriate statistic is the standard deviation in th at dimension. Recall from Section 22.5.5 th at we can compute the standard deviations from the summary statistics, and in particular, the standard devia tion is the square root of the variance. However, clusters may be cigar-shaped, so the standard deviations could vary widely. We want to include a point if its distance from the cluster centroid is not too many standard deviations in any dimension. Thus, the first thing to do with a point p = (x i , x 2, . . . , x n) th at we are considering for inclusion in a cluster is to normalize p relative to the centroid and the standard deviations of the cluster. That is, we transform the point into P1 = (v i , 2 /2 , ,y n), where /.; = (xi cij/a f, here c* is the coordinate of the centroid in the ith dimension and < Ji is the standard deviation of the cluster in th at dimension. The normalized distance of p from the centroid is the absolute distance of p' from the origin, th at is, y/Y l'iLi Vi2- This distance is sometimes
1135
called the Mahalanobis distance, although it is actually a simplifed version of the concept. E x am p le 22.17: Suppose p is the point (5,10,15), and we are considering whether to include p in a cluster with centroid (10,20,5). Also, let the standard deviation of the cluster in the three dimensions be 1, 2, and 10, respectively. Then the Mahalanobis distance of p is y/ ((5 - 10)/1)2 + ((10 - 20)/2 )2 + ((15 - 5)/10)2 = V25 + 2 5 + 1 = 7.14
Having computed the Mahalanobis distance of point p, we can apply a threshold to decide whether or not to include p in the cluster. For instance, suppose we use 3 as the threshold; that is, we shall include the point if and only if its Mahalanobis distance from the centroid is not greater than 3. If values axe normally distributed, then very few of these values will be more than 3 standard deviations from the mean (approximately one in a million will be that far from the mean). Thus, we would only reject one in a million points th at belong in the cluster. There is a good chance that, at the end, the rejected points would wind up in the cluster anyway, since there may be no closer cluster. D e cid in g W h e th e r to M erg e G ro u p s o f th e C o m p re sse d S et We discussed methods of computing the cohesion of a prospective cluster in Section 22.5.3. However, for the BFR algorithm, these ideas must be modified so we can make a decision using only the summary statistics for the two groups. Here are some options: 1. Choose an upper bound on the sum of the variances of the combined group in each dimension. Recall th at we compute the summary statistics for the combined group by adding corresponding components, and compute the variance in each dimension using the formula in Section 22.5.5. This approach has the effect of limiting the region of space in which the points of a group exist. Groups in which the distances between typical pairs of points is too large will exceed the upper bound on variance, no m atter how many points are in the group and how dense the points are within the region of space the group occupies. 2. Put an upper limit on the diameter in any dimension. Since we do not know the locations of the points exactly, we cannot compute the exact diameter. However, we could estimate the diameter in the ith dimension as the distance between the centroids of the two groups in dimension i plus the standard deviation of each group in dimension i. This approach also limits the size of the region of space occupied by a group.
1136
3. Use one of the first two approaches, but divide the figure of merit (sum of variances or maximum diameter) by a quantity such as N or y/N that grows with the number of points in the group. T hat way, groups can occupy more space, as long as they remain dense within th at space.
22.5.7
the points in Fig. 22.13, using minimum distance between points as the measure of closeness of clusters. Repeat the example using each of the following ways of measuring the distance between clusters. a) The distance between the centroids of the clusters. b) The maximum distance between points, one from each cluster. c) The average distance between points, one from each cluster.
E x e r c ise 2 2 .5 .4 : We could also modify Example 22.14 by using a different distance measure. Suppose we use the Loo-norm as the distance measure. Note th at this distance is the maximum of the distances along any axis, but when comparing distances you can break ties according to the next largest dimension. Show the sequence of mergers of the points in Fig. 22.13 th at result from the use of this distance measure. E x ercise 2 2 .5 .5 : Suppose we want to select three nodes in Fig. 22.13 to start three clusters, and we want them to be as far from each other as possible, as in Example 22.15. W hat points are selected if we start with (a) point B (b) point
Cl
E x e r c ise 2 2 .5 .6 : The BFR Algorithm represents clusters by summary statis tics, as described in Section 22.5.5. Suppose the current members of a cluster are {(1,2), (3,4), (2,1), (0,5)}. W hat are the summary statistics for this clus ter? E x e r c ise 2 2 .5 .7 : For the cluster described in Example 22.17, compute the
(b) (10,25,25).
1137
22.6
Summary of Chapter 22
Data Mining: This term refers to the discovery of simple summaries of data. The Market-Basket Model of Data: A common way to represent a manymany relation is as a collection of baskets, each of which contains a set of items. Often, this data is presented not as a relation but as a file of baskets. Algorithms typically make passes through this file, and the cost of an algorithm is the number of passes it makes. Frequent Itemsets: An important summary of some market-basket data is the collection of frequent itemsets: sets of items that occur in at least some fixed number of baskets. The minimum number of baskets that make an itemset frequent is called the support threshold. Association Rules: These are statements of the form that say if a certain set of items appears in a basket, then there is at least some minimum probability th at another particular item is also in th at basket. The prob ability is called the confidence of the rule. The A-Priori Algorithm: This algorithm finds frequent itemsets by ex ploiting the fact th at if a set of items occurs at least s times, then so does each of its subsets. For each size of itemset, we start with the candidate itemsets, which are all those whose every immediate subset (the set minus one element) is known to be frequent. We then count the occurrences of the candidates in a single pass, to determine which are truly frequent. The P C Y Algorithm: This algorithm makes better use of main memory than A-priori does, while counting the singleton items. PCY additionally hashes all pairs to buckets and counts the total number of baskets that contain a pair hashing to each bucket. To be a candiate on the second pass, a pair has to consist of items that not only are frequent as singletons, but also hash to a bucket whose count exceeded the support threshold. The Multistage Algorithm: This algorithm improves on PCY by using several passes in which pairs are hashed to buckets using different hash functions. On the final pass, a pair can only be a candidate if it consists of frequent items and also hashed each time to a bucket that had a count at least equal to the support threshold. Similar Sets and Jaccard Similarity: Another important use of marketbasket data is to find similar baskets, that is, pairs of baskets with many elements in common. A useful measure is Jaccard similarity, which is the ratio of the sizes of the intersection and union of the two sets.
1138
Shingling Documents: We can find similar documents if we convert each document into its set of fc-shingles all substrings of k consecutive char acters in the document. In this manner, the problem of finding similar documents can be solved by any technique for finding similar sets. Minhash Signatures: We can represent sets by short signatures th at en able us to estimate the Jaccard similarity of any two represented sets. The technique known as minhashing chooses a sequence of random per mutations, implemented by hash functions. Each permutation maps a set to the first, in the permuted order, of the members of th at set, and the signature of the set is the list of elements th at results by applying each permutation in this way. Minhash Signatures and Jaccard Similarity: The reason minhash signa tures serve to represent sets is th at the Jaccard similarity of sets is also the probability th at two sets will agree on their minhash values. Thus, we can estimate the Jaccard similarity of sets by counting the number of components on which their minhash signatures agree. Locality-Sensitive Hashing: To avoid having to compare all pairs of sig natures, locality-sensitive hashing divides the signatures into bands, and compares two signatures only if they agree exactly in at least one band. By tuning the number of bands and the number of components per band, we can focus attention on only the pairs th at are likely to meet a given similarity threshold. Clustering: The problem is to find groups (clusters) of similar items (points) in a space with a distance measure. One approach, called agglom erative, is to build bigger and bigger clusters by merging nearby clusters. A second approach is to estimate the clusters initially and assign points to the nearest cluster. Distance Measures: A distance on a set of points is a function that assigns a nonnegative number to any pair of points. The function is 0 only if the points are the same, and the function is commutative. It must also satisfy the triangle inequality. Commonly Used Distance Measures: If points occupy a Euclidean space, essentially a space with some number of dimensions and a coordinate system, we can use the ordinary Euclidean distance, or modifications such as the M anhattan distance (sum of the distances along the coordinates). In non-Euclidean spaces, we can use distance measures such as the Jaccard distance between sets (one minus Jaccard similiarity) or the edit distance between strings. BFR Algorithm: This algorithm is a variant of fc-means, where points are assigned to k clusters. Since the BFR Algorithm is intended for data sets th at are two large to fit in main memory, it compresses most points into
1139
sets that are represented only by their count and, for each dimension, the sum of their coordinates and the sum of the squares of their coordinates, each
22.7
Two useful books on data mining are [7] and [10]. The A-Priori Algorithm comes from [1] and [2], The PCY Algorithm is from [9] and the multistage algorithm is from [6]. The use of shingling and minhashing to discover similar documents is from [4] and the theory of minhashing is in [5]. Locality-sensitive hashing is from [8]. Clustering of non-main-memory data sets was first considered in [11]. The BFR Algorithm is from [3]. 1. R. Agrawal, T. Imielinski, and A. Swami, Mining associations between sets of items in massive databases, Proc. ACM SIGMOD Intl. Conf. on Management of Data, pp. 207-216, 1993. 2. R. Agrawal and R. Srikant, Fast algorithms for mining association rules, Intl. Conf. on Very Large Databases, pp. 487-499, 1994. 3. P. S. Bradley, U. M. Fayyad, and C. Reina, Scaling clustering algorithms to large databases, Proc. Knowledge Discovery and Data Mining, pp. 915, 1998. 4. A. Z. Broder, On the resemblance and containment of documents, Proc. Compression and Complexity of Sequences, pp. 21-29, Positano Italy, 1997. 5. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, Minwise independent permutations, J. Computer and System Sciences 60:3 (2000), pp. 630-659. 6. M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ull man, Computing iceberg queries efficiently, Intl. Conf. on Very Large Databases, pp. 299-310, 1998. 7. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Ad vances in Knowledge Discovery and Data Mining, MIT Press, 1996. 8. P. Indyk and R. Motwani, Approximate nearest neighbors: toward re moving the curse of dimensionality, ACM Symp. on Theory of Comput ing, pp. 604-613, 1998. 9. J. S. Park, M.-S. Chen, and P. S. Yu, An effective hash-based algorithm for mining association rules, Proc. ACM SIGMOD Intl. Conf. on Man agement of Data, pp. 175-186, 1995.
1140
10. P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining , Addison-Wesley, Boston MA, 2006. 11. T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: an efficient data clustering method for very large databases, Proc. AC M SIGMOD Intl. Conf. on Management of Data, pp. 103-114,1996.
Chapter 23
23.1
The search engine has become one of the most important tools of the 21st century. The repositories managed by the major search engines are among the largest databases on the planet, and surely no other database is accessed so frequently and by so many users. In this section, we shall examine the key components of a search engine, which are suggested schematically in Fig. 23.1. 1141
1142
23.1.1
There are two main functions th at a search engine must perform. 1. The Web must be crawled. That is, copies of many of the pages on the Web must be brought to the search engine and processed. 2. Queries must be answered, based on the material gathered from the Web. Usually, the query is in the form of a word or words that the desired Web pages should contain, and the answer to a query is a ranked list of the pages that contain all those words, or at least some of them. Thus, in Fig. 23.1, we see the crawler interacting with the Web and with the page repository, a database of pages th at the crawler has found. We shall discuss crawling in more detail in Section 23.1.2. The pages in the page repository are indexed. Typically, these indexes are inverted indexes, of the type discussed in Section 14.1.8. That is, for each word, there is a list of the pages th at contain that word. Additional information in the index for the word may include its location(s) within the page or its role, e.g., whether the word is in the header. We also see in Fig. 23.1 a user issuing a query that consists of one or more words. A query engine takes those words and interacts with the indexes, to determine which pages satisfy the query. These pages are then ordered by a ranker, and presented to the user, typically 10 at a time, in ranked order. We shall have more to say about the query process in Section 23.1.3.
1143
23.1.2
Web Crawlers
A crawler can be a single machine that is started with a set S, containing the URLs of one or more Web pages to crawl. There is a repository R of pages, with the URLs that have already been crawled; initially R is empty. A lg o rith m 23.1: A Simple Web Crawler.
IN P U T : An initial set of URLs S. O U T P U T : A repository R of Web pages. M ETH O D : Repeatedly, the crawler does the following steps.
1. If S is empty, end. 2. Select a page p from the set S to crawl and delete p from S. 3. Obtain a copy of p , using its URL. If p is already in repository R , return to step (1) to select another page. 4. If p is not already in R: (a) Add p to R. (b) Examine p for links to other pages. Insert into S the URL of each page q that p links to, but that is not already in R or S. 5. Go to step (1).
Algorithm 23.1 raises several questions. a) How do we terminate the search if we do not want to search the entire Web? b) How do we check efficiently whether a page is already in repository R ? c) How do we select a page p from S to search next? d) How do we speed up the search, e.g., by exploiting parallelism? T e rm in a tin g S earch Even if we wanted to search the entire Web, we must limit the search some how. The reason is that some pages are generated dynamically, so when the crawler asks a site for a URL, the site itself constructs the page. Worse, that page may have URLs that also refer to dynamically constructed pages, and this process could go on forever. As a consequence, it is generally necessary to cut off the search at some point. For example, we could put a limit on the number of pages to crawl, and
1144
stop when that limit is reached. The limit could be either on each site or on the total number of pages. Alternatively, we could limit the depth of the crawl. That is, say that the pages initially in set S have depth 1. If the page p selected for crawling at step (2) of Algorithm 23.1 has depth i, then any page q that we add to S at step (4b) is given depth i + 1. However, if p has depth equal to the limit, then we do not examine links out of p at all. Rather we simply add p to R, if it is not already there.
M a n a g in g th e R e p o sito r y
There are two points where we must avoid duplication of effort. First, when we add a new URL for a page q to the set S, we should check that it is not already there or among the URLs of pages in R. There may be billions of URLs in R and/or S, so this job requires an efficient index structure, such as those in Chapter 14. Second, when we decide to add a new page p to R at step (4a) of Algo rithm 23.1, we should be sure the page is not already there. How could it be, since we make sure to search each URL only once? Unfortunately, the same page can have several different URLs, so our crawler may indeed encounter the same page via different routes. Moreover, the Web contains mirror sites, where large collection of pages are duplicated, or nearly duplicated (e.g., each may have different internal links within the site, and each may refer to the other mirror sites). Comparing a page p with all the pages in R can be much too time-consuming. However, we can make this comparison efficient as follows: 1. If we only want to detect exact duplicates, hash each Web page to a signature of, say, 64 bits. The signatures themselves are stored in a hash table T ; i.e., they are further hashed into a smaller number of buckets, say one million buckets. If we are considering inserting p into R, compute the 64-bit signature h(p), and see whether h(p) is already in the hash table T. If so, do not store p; otherwise, store p in R. Note that we could get some false positives; it could be th at h(p) is in T , yet some page other than p produced the same signature. However, by making signatures sufficiently long, we can reduce the probability of a false positive essentially to zero. 2. If we want to detect near duplicates of p, then we can store minhash signa tures (see Section 22.3) in place of the simple hash-signatures mentioned in (1). Further, we need to use locality-sensitive hashing (see Section 22.4) in place of the simple hash table T of option (1).
S e le c tin g th e N e x t P a g e
We could use a completely random choice of next page. A better strategy is to manage 5 as a queue, and thus do a breadth-first search of the Web from the starting point or points with which we initialized S. Since we presumably start the search from places in the Web that have im portant pages, we thus are
1145
assured of visiting preferentially those portions of the Web that the authors of these im portant pages thought were also important. An alternative is to try to estimate the importance of pages in the set S, and to favor those pages we estimate to be most important. We shall take up in Section 23.2 the idea of PageRank as a measure of the importance th at the Web attributes to certain pages. It is impossible to compute PageRank exactly while the crawl is in progress. However, a simple approximation is to count the number of known in-links for each page in set S. That is, each time we examine a link to a page q at step (4b) of Algorithm 23.1, we add one to the count of in-links for q. Then, when selecting the next page p to crawl at step (2), we always pick one of the pages with the highest number of in-links.
S p e e d in g U p th e C raw l
We do not need to limit ourselves to one crawling machine, and we do not need to limit ourselves to one process per machine. Each process th at acts on the set of available URLs (what we called S in Algorithm 23.1) must lock the set, so we do not find two processes obtaining the same URL to crawl, or two processes writing the same URL into the set at the same time. If there are so many processes that the lock on S becomes a bottleneck, there are several options. We can assign processes to entire hosts or sites to be crawled, rather than to individual URLs. If so, a process does not have to access the set of URLs S so often, since it knows no other process will be accessing the same site while it does. There is a disadvantage to this approach. A crawler gathering pages at a site can issue page requests at a very rapid rate. This behavior is essentially a denialof-service attack, where the site can do no useful work while it strives to answer all the crawlers requests. Thus, a responsible crawler does not issue frequent requests to a single site; it might limit itself to one every several seconds. If a crawling process is visiting a single site, then it must slow down its rate of requests to the point that it is often idle. That in itself is not a problem, since we can run many crawling processes at a single machine. However, operatingsystem software has limits on how many processes can be alive at any time. An alternative way to avoid bottlenecks is to partition the set 5, say by hashing URLs into several buckets. Each process is assigned to select new URLs to crawl from a particular one of the buckets. When a process follows a link to find a new URL, it hashes that URL to determine which bucket it belongs in. That bucket is the only one that needs to be examined to see if the new URL is already there, and if it is not, that is the bucket into which the new URL is placed. The same bottleneck issues th at arise for the set S of active URLs also come up in managing the page repository R and its set of URLs. The same two techniques assigning processes to sites or partitioning the set of URLs by hashing serve to avoid bottlenecks in the accessing of R as well.
1146
23.1.3
Search engine queries are not like SQL queries. Rather they are typically a set of words, for which the search engine must find and rank all pages containing all, or perhaps a subset of, those words. In some cases, the query can be a boolean combination of words, e.g., all pages th at contain the word data or the word base. Possibly, the query may require th at two words appear consecutively, or appear near each other, say within 5 words. Answering queries such as these requires the use of inverted indexes. Recall from our discussion of Fig. 23.1 th at once the crawl is complete, the indexer constructs an inverted index for all the words on the Web. Note th at there will be hundreds of millions of words, since any sequence of letters and digits surrounded by punctuation or whitespace is an indexable word. Thus, words on the Web include not only the words in any of the worlds natural languages, but all misspellings of these words, error codes for all sorts of systems, acronyms, names, and jargon of many kinds. The first step of query processing is to use the inverted index to determine those pages th at contain the words in the query. To offer the user acceptable response time, this step must involve few, if any, disk accesses. Search engines today give responses in fractions of a second, an amount of time so small that it amounts to only a few disk-access times. On the other hand, the vectors th at represent occurrences of a single word have components for each of the pages indexed by the search engine, perhaps tens of billions of pages. Very rare words might be represented by listing their occurrences, but for common, or even reasonably rare words, it is more efficient to represent by a bit vector the pages in which they occur. The AND of bit vec tors gives the pages containing both words, and the OR of bit vectors gives the pages containing one or both. To speed up the selection of pages, it is essential to keep as many vectors as possible in main memory, since we cannot afford disk accesses. Teams of machines may partition the job, say each managing the portion of bit vectors corresponding to a subset of the Web pages.
23.1.4
Ranking Pages
Once the set of pages th at match the query is determined, these pages are ranked, and only the highest-ranked pages are shown to the user. The exact way th at pages are ranked is a secret formula, as closely guarded by search engines as the formula for Coca Cola. One important component is the PageRank, a measure of how im portant the Web itself believes the page to be. This measure is based on links to the page in question, but is significantly more complex than that. We discuss PageRank in detail in Section 23.2. Some of the other measures of how likely a page is to be a relevant response to the query are fairly easy to reason out. The following is a list of typical components of a relevance measure for pages. 1. The presence of all the query words. While search engines will return
1147
pages with only a proper subset of the query words, these pages are gen erally ranked lower than pages having all the words. 2. The presence of query words in important positions in the page. For ex ample, we would expect that a query word appearing in a title of the page would indicate more strongly that the page was relevant to that word than its mere occurrence in the middle of a paragraph. Likewise, appearance of the word in a header cell of a table would be a more favorable indication than its appearance in a data cell of the same table. 3. Presence of several query words near each other would be a more favorable indication than if the words appeared in the page, but widely separated. For example, if the query consists of the words sally and jones, we are probably looking for pages that mention a certain person. Many pages have lists of names in them. If sally and jones appear adjacent, or perhaps separated by a middle initial, then there is a better chance the page is about the person we want than if sally appeared, but nowhere near jones. In that case, there are probably two different people, one with first name Sally, and the other with last name Jones. 4. Presence of the query words in or near the anchor text in links leading to the page in question. This insight was one of the two key ideas that made the Google search engine the standard for the field (the other is PageRank, to be discussed next). A page may lie about itself, by using words designed to make it appear to be a good answer to a query, but it is hard to make other people confirm your lie in their own pages.
23.2
One of the key technological advances in search is the PageRank1 algorithm for identifying the importance of Web pages. In this section, we shall explain how the algorithm works, and show how to compute PageRank for very large collections of Web pages.
23.2.1
The insight th at makes Google and other search engines able to return the im portant pages on a topic is that the Web itself points out the important pages. When you create a page, you tend to link that page to others that you think are important or valuable, rather than pages you think are useless. Of course others may differ in their opinions, but on balance, the more ways one can get to a page by following links, the more likely the page is to be important. We can formalize this intuition by imagining a random walker on the Web. At each step, the random walker is at one particular page p and randomly
'A f t e r L a rry P ag e, w ho first p ro p o sed th e a lg o rith m .
1148
picks one of the pages th at p links to. At the next step, the walker is at the chosen successor of p. The structure of the Web links determines the longrun probability th at the walker is at each individual page. This probability is termed the PageRank of the page. Intuitively, pages th at a lot of other pages point to are more likely to be the location of the walker than pages with few in-links. But all in-links are not equal. It is better for a page to have a few links from pages th at themselves are likely places for the walker to be than to have many links from pages th at the walker visits infrequently or not at all. Thus, it is not sufficient to count the in-links to compute the PageRank. Rather, we must solve a recursive equation th at formalizes the idea: A Web page is im portant if many im portant pages link to it.
23.2.2
To describe how the random walker moves, we can use the transition matrix of the Web. Number the pages 1 ,2 ,... ,n. The matrix M , the transition matrix of the Web has element m y in row i and column j , where: 1. rriij = 1 /r if page j has a link to page i, and there are a total of r > 1 pages th at j links to. 2. m y = 0 otherwise. If every page has at least one link out, then the transition matrix will be (left) stochastic elements are nonnegative, and its columns each sum to exactly 1. If there are pages with no links out, then the column for th at page will be all 0s, and the transition matrix is said to be substochastic (all columns sum to at most 1).
E x a m p le 2 3 .2 : As we all know, the Web has been growing exponentially, so if you extrapolate back to 1839, you find th at the Web consisted of only three pages. Figure 23.2 shows what the Web looked like in 1839. We have numbered the pages 1, 2, and 3, so the transition m atrix for this graph is:
M =
1 /2 1/ 2
1/2
0
1 /2
1
0
For example, node 3, the page for Microsoft, links only to node 2, the page for Amazon. Thus, in column 3, only row 2 is nonzero, and its value is 1 divided by the number of out-links of node 3, which is 1. As another example, node 1, Yahoo!, links to itself and to Amazon (node 2). Thus, in column 1, row 3 is 0, and rows 1 and 2 are each 1 divided by the number of out-links from node 1, i.e., 1/2.
1149
Suppose y, a, and m represent the fractions of the time the random walker spends at the three pages of Fig. 23.2. Then multiplying the column-vector of these three values by M will not change their values. The reason is that, after a large number of moves, the walkers distribution of possible locations is the same at each step, regardless where the walker started. That is, the unknowns y, a, and m must satisfy:
y
a m
1/2 1/2 0
1/2 0 1/2
0 ' 1 0
a m
Although there are three equations in three unknowns, you cannot solve these equations for more than the ratios of y, a, and m . That is, if [y. a, m] is a solution to the equations, then [cy, ca, cm] is also a solution, for any constant c. However, since y, a, and m form a probability distribution, we also know y + a + m = 1. While we could solve the resulting equations without too much trouble, solving large numbers of simultaneous linear equations takes time 0 ( n 3), where n is the number of variables or equations. If n is in the billions, as it would be for the Web of today, it is utterly infeasible to solve for the distribution of the walkers location by Gaussian elimination or another direct solution method. However, we can get a good approximation by the method of relaxation, where we start with some estimate of the solution and repeatedly multiply the estimate by the matrix M . As long as the columns of M each add up to 1, then the sum of the values of the variables will not change, and eventually they converge to
1150
Figure 23.2: The Web in 1839 the distribution of the walkers location. In practice, 50 to 100 iterations of this process suffice to get very close to the exact solution. E x am p le 23.3: Suppose we start with [y,a,m] = [1/3,1/3,1/3]. Multiply this vector by M to get ' 1/2 2/6 ' 1/2 3/6 = 1/6 0 1/2 0 1/2 0 ' 1/3 1 1 1/3 0 1/3
At the next iteration, we multiply the new estimate [2/6,3/6,1/6] by M , as: 5/12 ' ' 1/2 = 4/12 1/2 3/12 0 1/2 0 1/2 0 ' 2/6 ' 1 3/6 0 1/6
If we repeat this process, we get the following sequence of vectors: 9/24 ' ' 20/48 ' 11/24 > 17/48 4/24 11/48 ' 2/5 ' 2/5 . VS .
5 * * * 5
T hat is, asymptotically, the walker is equally likely to be at Yahoo! or Amazon, and only half as likely to be at Microsoft as either one of the other pages.
23.2.3
The graph of Fig. 23.2 is atypical of the Web, not only because of its size, but for two structural reasons:
1151
1. Some Web pages (called dead ends) have no out-links. If the random walker arrives at such a page, there is no place to go next, and the walk ends. 2. There are sets of Web pages (called spider traps) with the property that if you enter that set of pages, you can never leave, because there are no links from any page in the set to any page outside the set. Any dead end is, by itself, a spider trap. However, one also finds on the Web spider traps all of whose pages have out-links. For example, any page that links only to itself is a spider trap. If a spider trap can be reached from outside, then the random walker may wind up there eventually, and never leave. P ut another way, applying relaxation to the matrix of the Web with spider traps can result in a limiting distribution where all probabilities outside a spider trap are 0.
Figure 23.3: The Web, if Microsoft becomes a spider trap E x am p le 23.4: Suppose Microsoft decides to link only to itself, rather than Amazon, resulting in the Web of Fig. 23.3. Then the set of pages consisting of Microsoft alone is a spider trap, and that trap can be reached from either of the other pages. The matrix M for this Web graph is
'
1 /2
1 /2
M =
1/2
0
0
1 /2 1
Here is the sequence of approximate distributions that is obtained if we start, as we did in Example 23.3, with [y, a, m] [1/3,1/3,1/3] and repeatedly multiply by the matrix M for Fig. 23.3: r 1/3 1/3 1/3 > ' 2/6 ' 1/6 3/6 ' 3/12 ' 2/12 7/12 5/24 ' 3/24 16/24 8/48 ' ' 0 ' 5/48 ) *** ? 0 35/48 1
1152
T hat is, with probability 1, the walker will eventually wind up at the Microsoft page and stay there. If we interpret these PageRank probabilities as importance of pages, then the Microsoft page has gathered all importance to itself simply by choosing not to link outside. T hat situation intuitively violates the principle th at other pages, not you yourself, should determine your importance on the Web. The other problem we mentioned dead ends also cause the PageRank not to reflect importance of pages, as we shall see in the next example.
E x a m p le 23 .5 : Suppose th at instead of linking to itself, Microsoft links no where, as suggested in Fig. 23.4. The matrix M for this Web graph is
1 /2 1 /2 0 1 /2 0 1 /2 0 0 0
M =
Notice th at this m atrix is not stochastic, because its columns do not all add up to 1. If we try to apply the method of relaxation to this matrix, with initial vector [1/3,1/3,1/3], we get the sequence: r
1/3 ]
1/3 1/3
J
>
, . . .
' 0 0 0
T hat is, the walker will eventually arrive at Microsoft, and at the next step has nowhere to go. Eventually, the walker disappears.
1153
23.2.4
The solution to both spider traps and dead ends is to limit the time the random walker is allowed to wander at random. We pick a constant (3 < 1, typically in the range 0.8 to 0.9, and at each step, we let the walker follow a random out-link, if there is one, with probability /3. W ith probability 1 f3 (called the taxation rate), we remove that walker and deposit a new walker at a randomly chosen Web page. This modification solves both problems. If the walker gets stuck in a spider trap, it doesnt m atter, because after a few time steps, that walker will disappear and be replaced by a new walker. If the walker reaches a dead end and disappears, a new walker will take over shortly. E x am p le 2 3.6: Let us use f i = 0.8 and reformulate the calculation of PageRank for the Web of Fig. 23.3. If p n e w and p m are the new and old distributions of the location of the walker after one iteration, the relationship between these two can be expressed as: 1/2 0.8 1/2 0 1/2 0 1/2 0 ' 0 P old 1 ' 1/3 ' 0*2 1/3 1/3
Pnew
That is, with probability 0.8, we multiply p 0u by the matrix of the Web to get the new location of the walker, and with probability 0.2 we start with a new walker at a random place. If we start with p 0M = [1/3,1/3,1/3] and repeatedly compute p new and then replace p 0id by Pnew, we get the following sequence of approximations to the asymptotic distribution of the walker: ' .280 ' .259 ' ' .333 ' .333 ' .333 5 .200 > .200 1 .179 .467 .520 .563 .333 7/33 ' 5/33 21/33
, . . .
Notice that Microsoft, because it is a spider trap, gets a large share of the im portance. However, the effect of the spider trap has been mitigated considerably by the policy of redistributing the walker with probability 0.2. The same idea fixes dead ends as well as spider traps. The resulting matrix th at describes transitions is substochastic, since a column will sum to 0 if there are no out-links. Thus, there will be a small probability that the walker is nowhere at any given time. That is, the sums of the probabilities of the walker being at each of the pages will be less than one. However, the relative sizes of the probabilities will still be a good measure of the importance of the page.
1154
Teleportation of Walkers
Another view of the random-walking process is th at there are no new walkers, but rather the walker teleports to a random page with probability 1 /?. For this view to make sense, we have to assume th at if the walker is at a dead end, then the probability of teleport is 100%. Equivalently, we can scale up the probabilities to sum to one at each step of the iteration. Doing so does not affect the ratios of the probabilities, and therefore the relative PageRank of pages remains the same. For instance, in Example 23.7, the final pageRank vector would be [35/81,25/81,21/81],
E x a m p le 23 .7 : Let us reconsider Example 23.5, using f3 = 0.8. The formula for iteration is now:
P n e w 0.8
1/2 1/2 0
1/2 0 1/2
Starting with p 0u = [1/3,1/3,1/3], we get the following sequence of approxi mations to the asymptotic distribution of the walker: ' .333 .333 .333 ' .333 ' .200 .200 ' .280 ' .259 ' .200 > .179 .147 .147 ' 35/165 25/165 21/165
, . . .
Notice th at these probabilities do not sum to one, and there is slightly more than 50% probability th at the walker is lost at any given time. However, the ratio of the importances of Yahoo!, and Amazon are the same as in Example 23.6. T hat makes sense, because in neither Fig. 23.3 nor Fig. 23.4 are there links from the Microsoft page to influence the importance of Yahoo! or Amazon.
23.2.5
E x ercise 2 3 .2 .1 : Compute the PageRank of the four nodes in Fig. 23.5, as suming no taxation. E x ercise 2 3 .2 .2 : Compute the PageRank of the four nodes in Fig. 23.5, as suming a taxation rate of: (a) 10% (b) 20%. E x ercise 2 3 .2 .3 : Repeat Exercise 23.2.2 for the Web graph of i. Fig. 23.6. ii. Fig. 23.7.
1155
1156
! E x ercise 23 .2 .4 : Suppose th at we want to use the map-reduce framework of Section 20.2 to compute one iteration of the PageRank computation. T hat is, we are given data th at represents the transition m atrix of the Web and the current estimate of the PageRank for each page, and we want to compute the next estimate by multiplying the old estimate by the m atrix of the Web. Suppose it is possible to break the data into chunks th at correspond to sets of pages th at is, the PageRank estimates for those pages and the columns of the m atrix for the same pages. Design map and reduce functions th at implement the iteration, so th at the computation can be partitioned onto any number of processors.
23.3
The calculation of PageRank is unbiased as to the content of pages. However, there are several reasons why we might want to bias the calculation to favor certain pages. For example, suppose we axe interested in answering queries only about sports. We would want to give a higher PageRank to a page th at discusses some sport than we would to another page th at had similar links from the Web, but did not discuss sports. Or, we might want to detect and eliminate spam pages those th at were placed on the Web only to increase the PageRank of some other pages, or which were the beneficiaries of such planned attem pts to increase PageRank illegitimately. In this section, we shall show how to modify the PageRank computation to favor pages of a certain type. We then show how the technique yields solutions to the two problems mentioned above.
23.3.1
Teleport Sets
In Section 23.2.4, we taxed each page 1 fi of its estimated PageRank and distributed the tax equally among all pages. Equivalently, we allowed random walkers on the graph of the Web to choose, with probability 1 fi, to teleport to a randomly chosen page. We axe forced to have some taxation scheme in any calculation of PageRank, because of the presence of dead-ends and spider traps on the Web. However, we are not obliged to distribute the tax (or random walkers) equally. We could, instead, distribute the tax or walkers only among a selected set of nodes, called the teleport set. Doing so has the effect not only of increasing the PageRank of nodes in the teleport set, but of increasing the PageRank of the nodes they link to, and with diminishing effect, the nodes reachable from the teleport set by paths of lengths two, three, and so on. E x am p le 2 3.8: Let us reconsider the original Web graph of Fig. 23.2, which we reproduce here as Fig. 23.8. Assume we are interested only in retail sales, so we chose a teleport set th at consists of Amazon alone. We shall use fi 0.8, i.e., a taxation rate of 20%. If y, a, and m are variables representing the PageRanks
1157
of Yahoo!, Amazon, and Microsoft, respectively, then the equations we need to solve are: a m
y"
= 0.8
1/2 1/2 0
1/2 0 1/2
0 1 0
y
a m + 0.2
' 0 1 0
The vector [0,1,0] added at the end represents the fact that all the tax is distributed equally among the members of the teleport set. In this case, there is only one member of the teleport set, so the vector has 1 for that member (Amazon) and 0s elsewhere. We can solve the equations by relaxation, as we have done before. However, the example is small enough to apply Gaussian elimination and get the exact solution; it is y = 10/31, a 15/31, and m = 6/31. The expected thing has happened; the PageRank of Amazon is elevated, because it is a member of the teleport set. The general rule for setting up the equations in a topic-specific PageRank problem is as follows. Suppose there are k pages in the teleport set. Let t be a column-vector th at has l / k in the positions corresponding to members of the teleport set and 0 elsewhere. Let 1 fi be the taxation rate, and let M be the transition matrix of the Web. Then we must solve by relaxation the following iterative rule:
P new
+ (1
/^)^
Example 23.8 was an illustration of this process, although we set both pew and p 0u to [y,a,m \ and solved for the fixedpoint of the equations, rather than iterating to converge to the solution.
1158
23.3.2
Suppose we had a set of pages th at we were certain were about a particular topic, say sports. We make these pages the teleport set, which has the effect of increasing their PageRank. However, it also increases the PageRank of pages linked to by pages in the teleport set, the pages linked to by those pages, and so on. We hope th at many of these pages are also about sports, even if they are not in the teleport set. For example, the page mlb.com, the home page for major-league baseball, would probably be in the teleport set for the sports topic. T hat page links to many other pages on the same site pages th at sell baseball-related products, offer baseball statistics, and so on. It also links to news stories about baseball. All these pages are, in some sense, about sports. Suppose we issue a search query batter. If the PageRank that the search engine uses to rank the importance of pages were the general PageRank (i.e., the version where all pages are in the teleport set), then we would expect to find pages about baseball batters, but also cupcake recipes. If we used the PageRank th at is specific to sports, i.e., one where only sports pages are in the teleport set, then we would expect to find, among the top-ranked pages, nothing about cupcakes, but only pages about baseball or cricket. It is not hard to reason th at the home page for a major-league sport will be a good page to use in the teleport set for sports. However, we might want to be sure we got a good sample of pages th at were about sports into our teleport set, including pages we might not think of, even if we were an expert on the subject. For example, starting at major-league baseball might not get us to pages for the Springfield Little League, even though parents in Springfield would want that page in response to a search involving the words baseball and Springfield. To get a larger and wider selection of pages on sports to serve as our teleport set, some approaches are: 1. Start with a curated selection of pages. For example, the Open Directory (www.dmoz.org) has human-selected pages on sixteen topics, including sports, as well as many subtopics. 2. Learn the keywords th at appear, with unusually high frequency, in a small set of pages on a topic. For instance if the topic were sports, we would expect words like ball, player, and goal to be among the selected keywords. Then, examine the entire Web, or a larger subset thereof, to identify other pages th at also have unusually high concentrations of some of these keywords. The next problem we have to solve, in order to use a topic-specific PageRank effectively, is determining which topic the user is interested in. Several possibilities exist. a) The easiest way is to ask the user to select a topic. b) If we have keywords associated with different topics, as described in (2) above, we can try to discover the likely topic on the users mind. We can
1159
examine pages that we think are important to the user, and find, in these pages, the frequency of keywords that are associated with each of the topics. Topics whose keywords occur frequently in the pages of interest are assumed to be the preference(s) of the user. To find these pages of interest, we might: i. Look at the pages the user has bookmarked. ii. Look at the pages the user has recently searched.
23.3.3
Link Spam
Another application of topic-specific PageRank is in combating link spam. Because it is known that many search engines use PageRank as part of the formula to rank pages by importance, it has become financially advantageous to invest in mechanisms to increase the PageRank of your pages. This observation spawned an industry: spam farming. Unscrupulous individuals create networks of millions of Web pages, whose sole purpose is to accumulate and concentrate PageRank on a few pages.
Figure 23.9: A spam farm concentrates PageRank in page T A simple structure th at accumulates PageRank in a target page T is shown in Fig. 23.9. Suppose that, in a PageRank calculation with taxation 1 fi, the pages shown in the bottom row of Fig. 23.9 get, from the outside, a total PageRank of r, and let the total PageRank of these pages be x. Also, let the PageRank of page T be t. Then, in the limit, t = fix, because T gets all the PageRank of the other pages, except for the tax. Also, x = r + fit, because the other pages collectively get r from the outside and a total of fit from T . If we solve these equations for t, we get t = fir /( l fi2). For instance, if fi .85, then we have amplified the external PageRank by factor 0.85/(l (0.85)2) = 3.06. Moreover, we have concentrated this PageRank in a single page, T. Of course, if r 0 then T still gets no PageRank at all. In fact, it is cut off from the rest of the Web and would be invisible to search engines. However, it is not hard for spam farmers to get a reasonable value for r. As one example, they create links to the spam farm from publicly accessible blogs, with messages like I agree with you. See xl23456.mySpamFarm.com. Moreover, if the number
1160
of pages in the bottom row is large, and the tax is distributed among all pages, then r will include the share of the tax th at is given to these pages. T hat is why spam farmers use many pages in their structure, rather than just one or two.
23.3.4
A search engine needs to detect pages th at are on the Web for the purpose of creating link spam. A useful tool is to compute the TrustRank of pages. Al though the original definition is somewhat different, we may take the TrustRank to be the topic-specific PageRank computed with a teleport set consisting of only trusted pages. Two possible methods for selecting the set of trusted pages are: 1. Examine pages by hand and do an evaluation of their role on the Web. It is hard to automate this process, because spam farmers often copy the text of perfectly legitimate pages and populate their spam farm with pages containing that text plus the necessary links. 2. Start with a teleport set th at is likely to contain relatively little spam. For example, it is generally believed th at the set of university home pages form a good choice for a widely distributed set of trusted pages. In fact, it is likely th at modern search engines routinely compute PageRank using a teleport set similar to this one. Either of these approaches tends to assign lower PageRank to spam pages, because it is rare th at a trusted page would link to a spam page. Since TrustRank, like normal PageRank, is computed with a positive taxation factor 1 /3, the trust imparted by a trusted page attenuates, the further we get from th at trusted page. The TrustRank of pages may substitute for PageRank, when the search engine chooses pages in response to a query. So doing reduces the likelihood th at spam pages will be offered to the queryer. Another approach to detecting link-spam pages is to compute the spam mass of pages as follows: a) Compute the ordinary PageRank, th at is, using all pages as the teleport set. b) Compute the TrustRank of all pages, using some reasonable set of trusted pages. c) Compute the difference between the PageRank and TrustRank for each page. This difference is the negative TrustRank. d) The spam mass of a page is the ratio of its negative TrustRank to its ordinary PageRank, th at is, the fraction of its PageRank th at appears to come from spam farms.
1161
While TrustRank alone can bias the PageRank to minimize the effect of link spam, computing the spam mass also allows us to see where the link spam is coming from. Sites that have many pages with high spam mass may be owned by spam farmers, and a search engine can eliminate from its database all pages from such sites.
23.3.5
E x ercise 23.3.1: Compute the topic-specific PageRank for Fig. 23.5, assum ing a) Only A is in the teleport set. b) The teleport set is {A ,B }. Assume a taxation rate of 20%. E x ercise 2 3 .3 .2 : Repeat Exercise 23.3.1 for the graph of Fig. 23.6. E x ercise 2 3 .3 .3 : Repeat Exercise 23.3.1 for the graph of Fig. 23.7. !! E x ercise 23.3.4: Suppose we fix the taxation rate and compute the topicspecific PageRank for a graph G, using only node a as the teleport set. We then do the same using only another node b as the teleport set. Prove that the average of these PageRanks is the same as what we get if we repeated the calculation with {a, 6} as the teleport set. !! E x ercise 2 3 .3 .5 : W hat is the generalization of Exercise 23.3.4 to a situation where there are two disjoint teleport sets Si and S 2, perhaps with different numbers of elements? T hat is, suppose we compute the PageRanks with just Si and then just S 2 as the teleport sets. How could we use these results to compute the PageRank with Si U 52 as the teleport set?
23.4
D ata Stream s
We now turn to an extension of the ideas contained in the traditional DBMS to deal with data streams. As the Internet has made communication among ma chines routine, a class of applications has developed that stress the traditional model of a database system. Recall that a typical database system is primarily a repository of data. Input of data is done as part of the query language or a special data-load utility, and is assumed to occur at a rate controlled by the DBMS. However, in some applications, the inputs arrive at a rate the DBMS cannot control. For example, Yahoo! may wish to record every click, that is, every page request made by any user anywhere. The sequence of URLs representing these requests arrive at a very high rate th at is determined only by the desires of Yahoo!s customers.
1162
23.4.1
If we are to allow queries on such streams of data, we need some new mecha nisms. While we may be able to store the data on high-rate streams, we cannot do so in a way th at allows instantaneous queries using a language like SQL. Further, it is not even clear what some queries mean; for instance, how can we take the join of two streams, when we never can see the completed streams? The rough structure of a data-stream-management system (DSMS) is shown in Fig. 23.10.
Ad-hoc Queries Results
The system accepts data streams as input, and also accepts queries. These queries may be of two kinds: 1. Conventional ad-hoc queries. 2. Standing queries that are stored by the system and run on the input stream(s) at all times. E x am p le 2 3.9: Whether ad-hoc or standing, queries in a DSMS need to be expressed so they can be answered using limited portions of the streams. As an example, suppose we are receiving streams of radiation levels from sensors around the world. While the DSMS cannot store and query streams from arbitrarily far back in time, it can store a sliding window of each input stream. It might be able to keep on disk, in the working storage referred to in Fig. 23.10, all readings from all sensors for the past 24 hours. D ata from further back in time could be dropped, could be summarized (e.g., replaced by the daily average), or copied in its entirety to the permanent store (archive).
23.4. DATA ST R E A M S
1163
An ad-hoc query might ask for the average radiation level over the past hour for all locations in North Korea. We can answer this query, because we have all data from all streams over the past 24 hours in our working store. A standing query might ask for a notification if any reading on any stream exceeds a certain limit. As each data element of each stream enters the system, it is compared with the threshold, and an output is made if the entering value exceeds the threshold. This sort of query can be answered from the streams themselves, although we would need to examine the working store if, say, we asked to be alerted if the average over the past 5 minutes for any one stream exceeded the threshold.
23.4.2
Stream Applications
Before addressing the mechanics of data-stream-management systems, let us look at some of the applications where the data is in the form of a stream or streams. 1. Click Streams. As we mentioned, a common source of streams is the clicks by users of a large Web site. A Web site might wish to analyze the clicks it receives for a number of reasons; an increase in clicks on a link may indicate th at link is broken, or th at it has become of much more interest recently. A search engine may want to analyze clicks on the links to ads th at it shows, to determine which ads are most attractive. 2. Packet Streams. We may wish to analyze the sources and destinations of IP packets th at pass through a switch. An unusual increase in packets for a destination may warn of a denial-of-service attack. Examination of the recent history of destinations may allow us to predict congestion in the network and to reroute packets accordingly. 3. Sensor Data. We also mentioned a hypothetical example of a network of radiation sensors. There are many kinds of sensors whose outputs need to be read and considered collectively, e.g., tsunami warning sensors th at record ocean levels at subsecond frequencies or the signals th at come from seismometers around the world, recording the shaking of the earth. Cities that have networks of security cameras can have the video from these cameras read and analyzed for threats. 4. Satellite Data. Satellites send back to earth incredible streams of data, often petabytes per day. Because scientists are reluctant to throw any of this data away, it is often stored in raw form in archival memory sys tems. These are half-jokingly referred to as write-only memory. Useful products are extracted from the streams as they arrive and stored in more accessible storage places or distributed to scientists who have made standing requests for certain kinds of data.
1164
5. Financial Data. Trades of stocks, commodities, and other financial instru ments are reported as a stream of tuples, each representing one financial transaction. These streams are analyzed by software that looks for events or patterns th at trigger actions by traders. The most successful traders have access to the largest amount of data and process it most quickly, because opportunities involving stock trades often last for only fractions of a second.
23.4.3
We shall now offer a data model useful for discussing algorithms on data streams. First, we shall assume the following about the streams themselves: Each stream consists of a sequence of tuples. The tuples have a fixed relation schema (list of attributes), just as the tuples of relations do. However, unlike relations, the sequence of tuples in a stream may be unbounded. Each tuple has an associated arrival time, at which time it becomes avail able to the data-stream-management system for processing. The DSMS has the option of placing it in the working storage or in the permanent storage, or of dropping the tuple from memory altogether. The tuple may also be processed in simple ways before storing it. For any stream, we can define a sliding window (or just window), which is a set consisting of the most recent tuples to arrive. A window can be time-based with a constant r , in which case it consists of the tuples whose arrival time is between the current time t and t r. Or, a window can be tuple-based, in which case it consists of the most recent n tuples to arrive, for some fixed n. We shall describe windows on a stream 5 by the notation S [PF], where W is the window description, either: 1. Rows n, meaning the most recent n tuples of the stream, or 2. Range r , meaning all tuples th at arrived within the previous amount of time r. E x a m p le 23.10: Let S en so rs(sen sID , temp, tim e) be a stream, each of whose tuples represent a tem perature reading of temp at a certain tim e by the sensor named sensID. It might be more common for each sensor to produce its own stream, but all readings could also be merged into one stream if the data were accumulated outside the data-stream-management system. The expression Sensors [Rows 1000] describes a window on the S ensors stream consisting of the most recent 1000 tuples. The expression
1165
describes a window on the same stream consisting of all tuples th at arrived in the past 10 seconds.
23.4.4
Windows allow us to convert streams into relations. That is, the window ex pressions as in Example 23.10 describe a relation at any time. The contents of the relation typically changes rapidly. For example, consider the expression Sensors [Rows 1000] . Each time a new tuple of Sensors arrives, it is inserted into the described relation, and the oldest of the tuples is deleted. For the ex pression Sensors [Range 10 Seconds], we must insert tuples of the stream when they arrive and delete tuples 10 seconds after they arrive. Window expressions can be used like relations in an extended SQL for streams. The following example suggests what such an extended SQL looks like. E x am p le 23.11: Suppose we would like to know, for each sensor, the highest recorded tem perature to arrive at the DSMS in the past hour. We form the appropriate time-based window and query it as if it were an ordinary relation. The query looks like: SELECT sensID, MAX(temp) FROM Sensors [Range 1 Hour] GROUP BY sensID; This query can be issued as an ad-hoc query, in which case it is executed once, based on the window that exists at the instant the query is issued. Of course the DSMS must have made available to the query processor a window on Sensors of at least one hours length.2 The same query could be a standing query, in which case the current result relation should be maintained as if it were a materialized view that changes from time to time. In Section 23.4.5 we shall consider an alternative way to represent the result of this query as a standing query. Window relations can be combined with other window relations, or with ordinary relations those that do not come from streams. An example will suggest what is possible. E x am p le 23.12: Suppose that our DSMS has the stream Sensors as an input stream and also maintains in its working storage an ordinary relation Calibrate(sensID, mult, add)
2S tric tly sp e ak in g , th e D SM S only needs to have re ta in e d en o u g h in fo rm a tio n to answ er th e query. F or exam p le, it could still an sw er th e q u e ry a t an y tim e if it th re w aw ay every tu p le fo r w hich th e re w as a la te r read in g fro m th e sa m e se n so r w ith a h ig h er te m p e ra tu re .
1166
which gives a multiplicative factor and additive term that are used to correct the reading from each sensor. The query SELECT MAX(mult*temp + add) FROM Sensors [Range 1 Hour], Calibrate WHERE Sensors.sensID = Calibrate.sensID; finds the highest, properly calibrated temperature reported by any sensor in the past hour. Here, we have joined a window relation from Sensors with the ordinary relation C a lib ra te . We can also compute joins of window-relations. The following query illus trates a self-join by means of a subquery, but all the SQL tools for expressing joins are available. E x am p le 23.13: Suppose we wanted to give, for each sensor, its maximum tem perature over the past hour (as in Example 23.11), but we also wanted the resulting tuples to give the most recent time at which that maximum temper ature was recorded. Figure 23.11 is one way to write the query using window relations. SELECT s.sensID, s.temp, s.time FROM Sensors [Range 1 Hour] s WHERE NOT EXISTS ( SELECT * FROM Sensors [Range 1 Hour] WHERE sensID = s.sensID AND ( temp > s.temp OR (temp = s.temp AND time > s.time) ) ); Figure 23.11: Including time with the maximum temperature readings of sensors T hat is, the subquery checks if there is not another tuple in the windowrelation Sensors [Range 1 Hour] that refers to the same sensor as the tuple s, and has either a higher temperature or has the same temperature but a more recent time. If no such tuple exists, then the tuple s is part of the result.
23.4.5
When we issue queries such as that of Example 23.11 as standing queries, the resulting relations change frequently. Maintaining these relations as material ized views may result in a lot of effort making insertions and deletions th at no one ever looks at. An alternative is to convert the relation th at is the result of the query back into streams, which may be processed like any other streams.
1167
For example, we can issue an ad-hoc query to construct the query result at a particular time when we are interested in its value. If R is a relation, define Istream (i?) to be the stream consisting of each tuple th at is inserted into R. This tuple appears in the stream at the time the insertion occurs. Similarly, define Dstreajn(i?) to be the stream of tuples deleted from R; each tuple appears in this stream at the moment it is deleted. An update to a tuple can be represented by an insertion and deletion at the same time. E x am p le 23.14: Let R be the relation constructed by the query of Exam ple 23.13, that is, the relation that has, for each sensor, the maximum temper ature it recorded in any tuple that arrived in the past hour, and the time at which th at temperature was most recently recorded. Then Istream (i?) has a tuple for every event in which a new tuple is added to R. Note th at there are two events that add tuples to R: 1. A Sensors tuple arrives with a temperature th at is at least as high as any tuple currently in R with the same sensor ID. This tuple is inserted into R and becomes an element of Istream(_R) at th at time. 2. The current maximum temperature for a sensor i was recorded an hour ago, and there has been at least one tuple for sensor i in the Sensors stream in the past hour. In that case, the new tuple for R and for Istream (-R) is the Sensors tuple for sensor i that arrived in the past hour, but no other tuple for i that also arrived in the past hour has: (a) A higher temperature, or (b) The same temperature and a more recent time. The same two events may generate tuples for the stream Dstreeun(J?) as well. In (1) above, if there was any other tuple in R for the same sensor, then th at tuple is deleted from R and becomes an element of D stream (i?). In (2), the hour-old tuple of R for sensor i is deleted from R and becomes an element of D stream (il). If we compute the Istream and Dstream for a relation like that constructed by the query of Fig. 23.11, then we do not have to maintain that relation as a materialized view. Rather, we can query its Istream and Dstream to answer queries about the relation when we wish. E x am p le 23.15: Suppose we form the Istream I and the Dstream D for the relation R of Fig. 23.11. When we wish, we can issue an ad-hoc query to these streams. For instance, suppose we want to find the maximum tempera ture recorded by sensor 100 th at arrived over the past hour. That will be the temperature in the tuple in I for sensor 100 that: 1. Has a tim e in the past hour.
1168
2. Was not deleted from R (i.e., is not in D restricted to the past hour). This query can be written as shown in Fig. 23.12. The keyword Now represents the current time. Note th at we must check th at a tuple of I both arrived in the past hour and th at it has a timestamp within the past hour. To see why these conditions are not the same, consider the case of a tuple of I th at arrived in the past hour, because it became the maximum tem perature t for sensor 100 thirty minutes ago. However, th at tem perature itself has an associated tim e that is eighty minutes ago. The reason is th at a tem perature higher than t was recorded by sensor 100 ninety minutes ago. It wasnt until 30 minutes ago that t became the highest tem perature for sensor 100 in the sixty minutes preceding. (SELECT * FROM I [Range 1 Hour] WHERE sensID = 100 AND time >= [Now - 1 Hour]) EXCEPT (SELECT * FROM D [Range 1 Hour] WHERE sensID = 100); Figure 23.12: Querying an Istream and a Dstream
23.4.6
E x ercise 2 3 .4 .1 : Using the S ensors stream from Example 23.11, write the following queries: a) Find the oldest tuple (lowest tim e) among the last 1000 tuples to arrive. b) Find those sensors for which at least two readings have arrived in the past minute. ! c) Find those sensors for which more readings arrived in the past minute than arrived between one and two minutes ago. E x ercise 2 3 .4 .2 : Following the example of sensor data from this section, sup pose th at the following temperature-time readings are generated by sensor 100, and each arrives at the DSMS at the time generated: (80,0), (70,50), (60,70), (65,100). Times are in minutes. If R is the query of Fig. 23.11, W hat are the tuples of Istream (.R ) and D stream (/?), and at what time is each of these tuples generated? E x ercise 2 3 .4 .3 : Suppose our stream consists of baskets of items, as in the market-basket model of Section 22.1.1. Since we assume elements of streams are tuples, the contents of a basket must be represented by several consecutive tuples with the schema B askets (b a sk e t, item ). Write the following queries:
1169
a) Find those items that have appeared in at least 1% of the baskets that arrived over the past hour.3 b) Find those pairs of items that have appeared in at least twice as many baskets in the previous half hour as in the half hour before that. c) Find the most frequent pair(s) of items over the past hour.
23.5
When processing streams, there are a number of problems that become quite hard, even though the analogous problems for relations are easy. In this section, we shall concentrate on representing the contents of windows more succinctly than by listing the current set of tuples in the window. Surely, we are not then able to answer all possible queries about the window, but if we know what kinds of queries we are expected to support, we might be able to compress the window and answer those queries. Another possibility is that we cannot compress the window and answer our selected queries exactly, but we can guarantee to be able to answer them within a fixed error bound. We shall consider two fundamental problems of this type. First, we con sider binary streams (streams of 0s and l s), and ask whether we can answer queries about the number of l s in any time range contained within the window. Obviously, if we keep the exact sequence of bits and their timestamps, we can manage to answer those questions exactly. However, it is possible to compress the data significantly and still answer this family of queries within a fixed error bound. Second, we address the problem of counting the number of different values within a sliding window. Here is another family of problems that cannot be answered exactly without keeping the data in the window exactly. However, we shall see th at a good approximation is possible using much less space than the size of the window.
23.5.1
M otivation
Suppose we wish to have a stream with a window of a billion integers. Such a window could fit in a large main memory of four gigabytes, and it would have no trouble fitting on disk. Surely, if we are only interested in recent data from the stream, a billion tuples should suffice. But what if there are a million such streams? For example, we might be trying to integrate the data from a million sensors placed around a city. Or we might be given a stream of market baskets, and try to compute the frequency, over any time range, of all sets of items contained in
3T echnically, som e b u t n o t all of a b ask et could arriv e w ith in th e p a s t h o u r. Ignore th is edge effect, a n d assu m e th a t e ith e r all o r none o f a b a s k e ts tu p le s a p p e a r in an y given w indow .
1170
those baskets. In th at case, we need a window for each set, with bits indicating whether or not th at set was contained in each of the baskets. In situations such as these, the amount of space needed to store all the windows exceeds what is available using disk storage. Moreover, for efficient response, we might want to keep all windows in main memory. Then, a few windows of length a billion, or a few thousand windows of length a million exceed what even a large main memory can hold. We are thus led to consider compressing the data in windows. Unfortunately, even some very simple queries cannot be answered if we compress the window, as the next example suggests. E x a m p le 23.16: Suppose we have a sliding window th at stores stream ele ments th at are integers, and we have a standing query th at asks for an alert any time the sum of the integers in the window exceeds a certain threshold t. We thus only need to maintain the sum of the integers in the window in order to answer this query. When a new integer comes in, we can add it to the sum. However, at certain times, integers leave the window and must be subtracted from the sum. If the window is tuple-based, then we must subtract the last integer from the sum each time a new integer arrives. If the window is timebased, then when the time of an integer in the window expires, it must be subtracted from the sum. Unfortunately, if we dont know exactly what integers are in the window, or we dont know their order of arrival (for tuple-based windows) or their time of arrival (for time-based windows), then we cannot maintain the sum properly. To see why we cannot compress, observe the following. If there is any compression at all, then two different window-contents, W\ and W2, must have the same compressed value. Since W \ ^ W2, there is some time t at which the integers for time t are different in W\ and W2. Consider what happens when t is the oldest time in the window, and another integer arrives. We must have to do different subtractions from the sum, to maintain the sums for W\ and W 2. But since the compressed representation does not tell us which of Wi and W2 is the true contents of the window, we cannot maintain the proper sum in both cases.
Example 23.16 tells us th at we cannot compress the sum of a sliding window if we are to get exact answers for the sum at all times. However, suppose we are willing to accept an approximate sum. Then there are many options, and we shall look at a very simple one here. We can group the stream elements into groups of 100; say the first hundred elements of the stream ever to arrive, then the next hundred, and so on. Each group is represented by the sum of elements in th at group. Thus, we have a compression factor of 100; i.e., the window is represented by 1/100th of the number of integers th at are theoretically in in window. Suppose for simplicity th at we have a tuple-based window, and the number of tuples in the window is a multiple of 100. When the number of stream elements th at have arrived is also a multiple of 100, then we can get the sum of the elements in the window exactly, just by summing the sums of the groups.
1171
Suppose another integer arrives. That integer starts another group, so we keep it as the sum of th at group. Now, we can only estimate the sum of all the integers in the window. The reason is that the last group has only 99 of its 100 members in the window, and we dont know the value of the integer, from the last group, that is no longer in the window. The best estimate of the deleted integer is 1% of the sum of the last group. That is, we estimate the sum of all the integers in the window by taking 0.99 times the recorded sum of the last group, plus the recorded sums of all the other groups. Forty-nine arrivals later, there are fifty integers in the group formed from the most recent arrivals, and the sum of the window includes exactly half of the last group. Our best estimate of the sum of the fifty integers of the last group th at remain in the window is half the groups sum. After another fifty arrivals, the most recent group is complete, and the last group has left the window entirely. We therefore can drop the recorded sum of the last group and prepare to start another group with the next arrival. Intuitively, this method gives a good approximation to the sum. If integers are nonnegative, and there is not too much variance in the values of the integers, then assuming th at the missing integers are average for their group is a close estimate. Unfortunately, if the variance is high, or integers can be both positive and negative, there is no worst-case bound on how bad the estimate of the sum can be. Consider what happens if integers can range from minus infinity to plus infinity, and the last group consists of fifty large negative numbers followed by fifty large positive numbers, such that the sum for the group is 0. Then the estimate of the contribution of the last group, when only half of it is in the window is zero, but in fact the true sum is very large perhaps much larger than the sum of all the integers that followed them in the stream. One can modify this compression approach in various ways. For example, we can increase the size of the groups to reduce the amount of space taken by the representation. Doing so increases the error in the estimate, however. In the next section, we shall see how to get a bounded error rate, while getting significant compression, for the binary version of this problem, where stream elements are either 0 or 1. The same method extends to streams of positive inte gers with an upper bound, if we treat each position in the binary representation of the integers as a bit stream (see Exercise 23.5.3).
23.5.2
Counting Bits
In this section, we shall examine the following problem. Assume that the length of the sliding window is N , and the stream consists of bits, 0 or 1. We assume th at the stream began at some time in the past, and we associate a time with each arriving bit that is its position in the stream; i.e., the first to arrive is at time 1, the next at time 2, and so on. Our queries, which may be asked at any time, are of the form how many l s are there in the most recent k bits? where k is any integer between 1 and
1172
N . Obviously, if we stored the window with no compression, we could answer any such query exactly, although we would have to sum the last k bits to do so. Since k could be very large, the time needed to answer queries could itself be large. Suppose, however, th at along with the bits themselves we stored the _ We sums of certain groups of consecutive bits groups of size 2, 4, 8,_ could then decrease the time needed to answer the queries exactly to 0(log N ). However, if we also stored sums of these groups, then even more space would be needed than what we use to store the window elements themselves. An attractive alternative is to keep an amount of information about the window th at is logarithmic in N , and yet be able to answer any query of the type described above, with a fractional error that is as low as we like. Formally, for any e > 0, we can produce an estimate th at is in the range of 1 e to 1 + e times the true result. We shall give the method for e = 1/2, and we leave the generalization to any e > 0 as an exercise with hints (see Exercise 23.5.4). B u ck ets To describe the algorithm for approximate counting of l s, we need to define a bucket of size m; it is a section of the window th at contains exactly m l s. The window will be partitioned completely into such buckets, except possibly for some 0s th at are not part of any bucket. Thus, we can represent any such bucket by (m ,t ), where m is the size of the bucket, and t is the time of the most recent 1 belonging to th at bucket. There axe a number of rules th at we shall follow in determining the buckets th at represent the current window: 1. The size of every bucket is a power of 2. 2. As we look back in time, the sizes of the buckets never decrease. 3. For m = 1 ,2 ,4 ,8 ,... up to some largest-size bucket, there are one or two buckets of each size, never zero and never more than two. 4. Each bucket begins somewhere within the current window, although the last (largest) bucket maybe partially outside the window. Figure 23.13 suggests what a window partitioned into buckets might look like. R e p re s e n tin g B u ck ets We shall see that under these assumptions, a bucket can be represented by 0 (\o g N ) bits. Further, there axe at most (9(log N ) buckets th at must be rep resented. Thus, a window of length N can be represented in space 0(log2 TV), rather than O (N ) bits. To see why only 0(log2 N ) bits axe needed, observe the following: A bucket (m , t ) can be represented in O(log N ) bits. First, m , the size of a bucket, can never get above N . Moreover, m is always a power of 2, so
1173
Two of length 1
10010101
l o o o i o i m o i o i o i o i o i o n c101010101011
N Figure 23.13: Bucketizing a sliding window
we dont have to represent m itself; rather we can represent log2 m . That requires O (log logiV) bits. However, we also need to represent t, the time of the most recent 1 in the bucket. In principle, t can be an arbitrarily large integer, but it is sufficient to represent t modulo N , since we know t has to be in the window of length N . Thus, 0(log N ) bits suffice to represent both m and t. So that we can know the time of newly arriving l s, we maintain the current time, but also represent it modulo N , so O(logiV) bits suffice for this count. There can be only O(logiV) buckets. The sum of the sizes of the buckets is at most N , and there can be at most two of any size. If there are more than 2 + 2 log2 N buckets, then the largest one is of size at least 2 x 2log2 N , which is 2N. There must be a smaller bucket of half that size, so the supposed largest bucket is certainly completely outside the window. A n sw erin g Q u eries A p p ro x im ately , U sing B uckets Notice th at we can answer a query to count the l s in the most recent k bits approximately, as follows. Find the least recent bucket B whose most recent bit arrived within the last k time units. All later buckets are entirely within the range of k time units. We know exactly how many l s are in each of these buckets; it is their size. The bucket B is partially in the querys range, and partially outside it. We cannot tell how much is in and how much is out, so we choose half its size as the best guess. E x am p le 23.17: Suppose k = N and the window is represented by the buckets of Fig. 23.13. We see two buckets of size 1 and one of size 2, which implies four l s. Then, there are two buckets of size 4, giving another eight l s, and two buckets of size 4, implying another sixteen l s. Finally, the last bucket, of size 16, is partially in the window, so we add another 8 to the estimate. The approximate answer is thus 2 x l + l x 2 + 2 x 4 + 2 x 8 + 8 = 36.
1174
M ain ta in in g B uckets There Eire two reasons the buckets change as new bits arrive. The first is easy to handle: if a new bit arrives, and the last bucket now has a most recent bit that is more than N lower than the time of the arriving bit, then we can drop th at bucket from the representation. Such a bucket can never be part of the answer to any query. Now, suppose a new bit arrives. If the bit is a 0, there are no changes, except possibly the deletion of the last bucket as mentioned above. Suppose the new bit is a 1. We create a new bucket of size 1 representing just that bit. However, we may now have three buckets of size 1, which violates the rule that there can be only one or two buckets of each size. Thus, we enter a recursive combining-buckets phase. Suppose we have three consecutive buckets of size m, say (m, t2), and (m, t-j), where ti < t 2 < t$. We combine the two least recent of the buckets, (m, t\) and (m, t2), into one bucket of size 2m. The time of the most recent bit for the combined bucket is that of the most recent bit for the more recent of the two combined buckets. That is, (m, <i) and (m, t2) are replaced by a bucket (2m ,t 2). This combination may cause there to be three consecutive buckets of size 2m, if there were two of that size previously. Thus, we apply the combination algorithm recursively, with the size now 2m. It can take no more than 0(log N ) time to do all the necessary combinations. E x am p le 23.18: Suppose we have the list of bucket sizes implied by Fig. 23.13, th at is, 16 ,8,8,4,4,2,1,1. If a 1 arrives, we have three buckets of size 1, so we combine the two earlier l s, to get the list 16,8,8,4,4,2,2,1. As this combina tion gives us only two buckets of size 2, no recursive combining is needed. If another 1 arrives, no combining at all is needed, and we get sequence of bucket sizes 1 6 ,8 ,8 ,4 ,4 ,2 ,2 ,1 ,1 . When the next 1 arrives, we must combine l s, leav ing 1 6 ,8 ,8,4 ,4 ,2 ,2 ,2 ,1 . Now we have three 2s, so we recursively combine the least recent of them, leaving 16,8,8,4,4,4,2,1. Now there are three 4s, and the least recent of them are combined to give 16,8,8,8,4,2,1. Again, we must combine the least recent of the three 8s, giving us the final list of bucket sizes 16,16,8,4,2,1. A B o u n d o n th e E rro r Suppose that in answer to a query the last bucket whose represented l s are in the range of the query has size m. Since we estimate m /2 for its contribution to the count, we cannot be off by more than m /2. The correct answer is at least the sum of all the smaller buckets, and there is at least one bucket of each size m /2, m /4, m /8 ,... ,1. This sum is m 1. Thus, the fractional error is at most (m /2 )/(m 1), or approximately 50%. In fact, if we look more carefully, 50% is an exact upper bound. The reason is that when we underestimate (i.e., all m l s from the last bucket are in the query range), the error is no more than 1/3.
1175
When we overestimate, we can really only overestimate by (m /2) 1, not m /2, since we know th at at least one 1 contributes to the query. Since (m /2) 1 is less than half m 1, the error is truly upper bounded by 50%.
23.5.3
We now turn to another im portant problem: counting the distinct elements in a (window on) a stream. The problem has a number of applications, such as the following: 1. The popularity of a Web site is often measured by unique visitors per month or similar statistics. Think of the logins at a site like Yahoo! as a stream. Using a window of size one month, we want to know how many different logins there are. 2. Suppose a crawler is examining sites. We can think of the words encoun tered on the pages as forming a stream. If a site is legitimate, the number of distinct words will fall in a range th at is neither too high (few repeti tions of words) nor too low (excessive repetition of words). Falling outside th at range suggests th at the site could be artificial, e.g., a spam site. To get an exact answer to the question, we must store the entire window and apply the 8 operator to it, in order to find the distinct elements. However, we dont want to see the distinct elements; we just want to know how many there are. Even getting this count requires th at we maintain the window in its entirety, but we can get an approximation to the count by several different methods. The following technique actually computes the number of distinct elements in the entire stream, rather than in a finite window. However, we can, if we like, restart the process periodically, e.g., once a month to count unique visitors or each time we visit a new site (to count distinct words). The necessary tools are a number N that is certain to be at least as large as the number of distinct values in the stream, and a hash function h th at maps values to log2 N bits. We maintain a number R th at is initially 0. As each stream value v arrives, do the following: 1. Compute h(v). 2. Let r be the number of trailing 0s in h(v). 3. If r > R , set R to be r. Then, the estimate of the number of distinct values seen so far is 2R. To see why this estimate makes sense, note the following. a) The probability that h(v) ends in at least i 0s is 2 *. b) If there are m distinct elements in the stream so far, the probability that R > i is (1 - 2~i)m.
1176
c) If i is much less than log2 m , then this probability is close to 1, and if i is much greater than log2 m , then this probability is close to 0. d) Thus, R will frequently be near log2 rn, and 2R, our estimate, will fre quently be near m. While the above reasoning is comforting, it is actually inaccurate, to say the least. The reason is that the expected value of 2R is infinite, or at least it is as large as possible given that N is finite. The intuitive reason is that, for large R, when R increases by 1, the probability of R being that large halves, but the value of R doubles, so each possible value of R contributes the same to the expected value. It is therefore necessary to get around the fact that there will occasionally be a value of R that is so large it biases the estimate of m upwards. While we shall not go into the exact justification, we can avoid this bias by: 1. Take many estimates of R , using different hash functions. 2. Group these estimates into small groups and take the median of each group. Doing so eliminates the effect of occasional large Us. 3. Take the average of the medians of the groups.
23.5.4
E x ercise 23.5.1: Starting with the window of Fig. 23.13, suppose that the next ten bits to arrive are all l s. W hat will be the sequence of buckets at that time? E x ercise 23.5.2: W hat buckets are used in Fig. 23.13 to answer queries of the form how many l s in the most recent k bits? if k is (a) 10 (b) 15 (c) 20? W hat are the estimates for each of these queries? How close are the estimates? ! E x ercise 23.5.3: Suppose that we have a stream of integers in the range 0 to 1023. How can you adapt the method of Section 23.5.2 to estimate the sum of the integers in a window of size N , keeping the error to 50%? H int : treat each of the ten bits that represent an integer as a separate stream. ! E x ercise 23.5.4: We can modify the algorithm of Section 23.5.2 to use buckets whose sizes are powers of 2, but there are between p and p + 1 buckets of each size, for a chosen integer p > 1. As before, sizes do not decrease as we go further back in time. a) Give the recursive rule for combining buckets when there are too many buckets of a given size. b) Show that the fractional error of this scheme is at most 1/2 p.
1177
E x ercise 23.5.5: Suppose th at we wish to estimate the number of distinct values in a stream of integers. The integers are in the range 0 to 1023. Well use the following hash functions, each of which hashes to a 9-bit integer: a) hi(v) v modulo 512. b) h,2 (v) = v + 159 modulo 512. c) hs(v) = v + 341 modulo 512. Compute the estimate of the number of distinct values in the following stream, using each of these hash functions: 24,45,102,24,78,222,45,24,670,78,999,576,222,24 E x ercise 23 .5 .6 : In Example 23.11 we observed that if all we wanted was the maximum of N temperature readings in a sliding window of time-temperature tuples, then when a reading of t arrives, we can delete immediately any earlier reading th at is smaller than t. ! a) Does this rule always compress the data in the window? !! b) Suppose temperatures are real numbers chosen uniformly and at random from some fixed range of values. On average, how many tuples will be retained, as a function of N ?
23.6
Summary of Chapter 23
Search Engines: A search engine requires a crawler to gather information about pages and a query engine to answer search queries. Crawlers: A crawler consists of one or more processes that visit Web pages and follow links found in those pages. The crawler must maintain a repository of pages already visited, so it does not revisit the same page too frequently. Shingling and minhashing can be used to detect duplicate pages with different URLs. Limiting the Crawl: Crawlers normally limit the depth to which they will search, declining to follow links from pages that are too far from their root page or pages. They also can prioritize the search to visit preferentially pages th at are estimated to be popular. Preparing Crawled Pages to Be Searched: The search engine creates an inverted index on the words of the crawled pages. The index may also in clude information about the role of the word (e.g., is it part of a header?), and the index for each word may be represented by a bit-vector indicating on which pages the word appears.
1178
Answering Search Queries: A search query normally consists of a set of words. The query engine uses the inverted index to find the Web pages containing all these words. The pages are then ranked, using a formula th at is determined by each search engine, but typically favors pages with close occurrences of the words, use of the words in im portant places (e.g., headers), and favors im portant pages using a measure such as PageRank. The Transition Matrix of the Web: This m atrix is an im portant analytic tool for estimating the importance of Web pages. There is a row and column for each page, and the column for page j has 1/ r in the *th row if page i is one of r pages with links from page j , and 0 otherwise. PageRank: The PageRank of Web pages is the principal eigenvector of the transition m atrix of the Web. If there are n pages, we can compute the PageRank vector by starting with a vector of length n, and repeatedly multiplying the current vector by the transition m atrix of the Web. Taxation of PageRank: Because of Web artifacts such as dead ends (pages without out-links) and spider traps (sections of the Web that cannot be exited), it is normal to introduce a small tax, say 15%, and redistribute that fraction of a pages PageRank equally among all pages, after each matrix-vector multiplication. Teleport Sets: Instead of redistributing the tax equally among all pages during an iteration of the PageRank computation, we can distribute the tax only among a subset of the pages, called the teleport set. Then, the computation of PageRank simulates a walker on the graph of the Web who normally follows a randomly chosen out-link from their current page, but with a small probability instead jumps to a random member of the teleport set. Topic-Specific PageRank: One application of the teleport-set idea is to pick a teleport set consisting of a set of pages known to be about a certain topic. Then, the PageRank will measure not only the importance of the page in general, but to what extent it is relevant to the selected topic. Link Spam: Spam farmers create large collections of Web pages whose sole purpose is to increase the PageRank of certain target pages, and thus make them more likely to be displayed by a search engine. One way to combat such spam farms is to compute PageRank using a teleport set consisting of known, trusted pages those th at are unlikely to be spam. Data Streams: A data stream is a sequence of tuples arriving at a fixed place, typically at a rate so fast as to make processing and storage in its entirety difficult. Examples include streams of data from satellites and click streams of requests at a Web site.
1179
Data-Stream-Management Systems-. A DSMS accepts data in the form of streams. It maintains working storage and permanent (archival) storage. Working storage is limited, although it may involve disks. The DSMS accepts both ad-hoc and standing queries about the streams. Sliding Windows: To query a stream, it helps to be able to talk about portions of the stream as a relation. A sliding window is the most recent portion of the stream. A window can be time-based, in which case it consists of all tuples arriving over some fixed time interval, or tuple-based, in which case it is a fixed number of the most recently arrived tuples. Compressing Windows: If the DSMS must maintain large windows on many streams, it can run out of main memory, or even disk space. De pending on the family of queries that will be asked about the window, it may be possible to compress the window so it uses significantly less space. However, in many cases, we can compress a window only if we are willing to accept approximate answers to queries. Counting Bits: A fundamental problem th at allows a space/accuracy trade-off is that of counting the number of l s in a window of a bit stream. We partition the window into buckets representing exponentially increasing numbers of l s. The last bucket may be partially outside the window, leading to inaccuracy in the count of l s, but the error is limited to a fixed fraction of the count and can be any e > 0. Counting Distinct Elements: Another important stream problem is count ing the number of distinct elements in the stream without keeping a table of all the distinct elements ever seen. An unbiased estimate of this number can be made by picking a hash function, hashing elements to bit strings, and estimating the number of distinct elements to be 2 raised to the power that is the largest number of consecutive 0s ever seen at the end of the hash function of any stream element.
23.7
References [3] and [8] summarize issues in crawling, based on the Stanford WebBase system. An analysis of the degree to which crawlers reach the entire Web was given in [15]. PageRank and the Google search engine are described in [6] and [16]. An alternative formulation of Web structure, often referred to as hubs and au thorities, is in [14]. Topic-specific PageRank, as described here, is from [12]. TrustRank and combating link spam are discussed in [11]. Two on-line histories of search engines are [17] and [18]. The study of data streams as a data model can be said to begin with the chronicle data model of [13]. References [7] and [2] describe the architecture
1180
of early data-stream management systems. Reference [5] surveys data-stream systems. The algorithm described here for approximate counting of l s in a sliding window is from [9], The problem of estimating the number of distinct elements in a stream originated with [10] and [4], The method described here is from [1], which also generalizes the technique to estimate higher moments of the data, e.g., the sum of the squares of the number of occurrences of each element. 1. N. Alon, Y. Matias, and M. Szegedy, The space complexity of approx imating frequency moments, Twenty-Eighth AC M Symp. on Theory of Computing (1996), pp. 20-29. 2. A. Arasu, S. Babu, and J. Widom, The CQL continuous query language: semantic foundations and query execution, http://dbpubs.Stanford.edu/pub/2003-67 Dept, of Computer Science, Stanford Univ., Stanford CA, 2003. 3. A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan, Searching the Web, AC M Trans, on Internet Technologies 1:1 (2001), pp. 2-43. 4. M. M. Astrahan, M. Schkolnick, and K.-Y. Whang, Approximating the number of unique values of an attribute without sorting, Information Systems 12:1 (1987), pp. 11-15. 5. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, Models and issues in data stream systems, Twenty-First ACM Symp. on Principles of Database Systems (2002), pp. 261-272. 6. S. Brin and L. Page, Anatomy of a large-scale hypertextual Web search engine, Proc. Seventh Intl. World-Wide Web Conference, 1998. 7. D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik, Monitoring streams a new class of data management applications, Proc. Intl. Conf. on Very Large Database Systems (2002), pp. 215-226. 8. J. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, S. Ragha van, and G. Wesley, Stanford WebBase components and applications, A C M Trans, on Internet Technologies 6:2 (2006), pp. 153-186. 9. M. Datar, A. Gionis, P. Indyk, and R. Motwani, Maintaining stream statistics over sliding windows, SIA M J. Computing 31 (2002), pp. 17941813. 10. P. Flagolet and G. N. Martin, Probabilistic counting for database appli cations, J. Computer and System Sciences 31:2 (1985), pp. 182-209.
1181
11. Z. Gyongyi, H. Garcia-Molina, and J. Pedersen, Combating Web spam with TrustRank, Proc. Intl. Conf. on Very Large Database Systems (2004), pp. 576-587. 12. T. Haveliwala, Topic-sensitive PageRank, Proc. Eleventh Intl. WorldWide Web Conference (2002). 13. H. V. Jagadish, I. S. Mumick, and A Silberschatz, View maintenance issues for the chronicle data model, Fourteenth ACM Symp. on Principles of Database Systems (1995), pp. 113-124. 14. J. Kleinberg, Authoritative sources in a hyperlinked environment," J. ACM 46:5 (1999), pp. 604-632. 15. S. Lawrence and C. L. Giles, Searching the World-Wide Web, Science 280(5360) :98, 1998. 16. L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank citation ranking: bringing order to the Web, unpublished manuscript, Dept, of CS, Stanford Univ., Stanford CA, 1998. 17. L. Underwood, A brief history of search engines, www. w eb referen ce. com/ a u th o rin g / search .h is to ry 18. A. Wall, Search engine history, w w w .searchenginehistory.com .
Index
A
Abiteboul, S. 12, 515 Abort 852 See also Rollback Abstract query plan See Logical query plan Achilles, A.-C. 12 ACID properties 9 See also Atomicity, Consistency, Durability, Isolation ACR schedule 957-958 See also Cascading rollback Action 332-333, 335, 889 Acyclic hypergraph 1003-1007 ADA 378 AD D 33, 326 Addition rule 84 Address See Database address, Forward ing address, Logical address, Memory address, Physical address, Structured address, Virtual memory Adornment 1057, 1059, 1061-1062 After-trigger 334 Agent See SQL agent Agglomerative clustering 1123,11281130 Aggregation 172,177-178,181, 213215, 283-285, 287-288,540, 714, 726, 733-734, 777-779, 802, 990 See also Average, Count, Data cube, GROUP BY, Maximum, 1183
Minimum, Sum Agrawal, R. 1139 Agrawal, S. 367 Aho, A. V. 122 Algebra 38 See also Relational algebra Algebraic law 768 See also Associative law, Com mutative law, Idempotence, Representability Alias See AS ALL 271,282-283 Alon, N. 1180 ALTER TABLE 33, 326 Ancestor 522 And 254-255 Anomaly 67 See also Deletion anomaly, Re dundancy, Update anomaly ANSI 243 Antisemijoin 58 ANY 271 Application server 370-372 a p p ly -te m p la te s 547 A-Priori Algorithm 1102-1104 Arasu, A. 1180 Archive 844, 875-879 Arithmetic atom 223 Armstrong, W. W. 122 Armstrongs axioms 81 See also Augmentation rule, Reflexivity rule, Transitive rule Array 188, 196, 418 AS 247 Assertion 328-331
1184 Assignment statement 393-394 Association 172-175, 179 Association class 172, 175 Association rule 1093, 1097-1098 Associative array 418 Associative law 212, 768-769, 790791, 1083 Astrahan, M. M. 13, 309, 841, 1180 Atom 223, 760 Atomicity 2, 9, 298-299, 847, 10081009 Attribute 22-23, 126-127, 134, 144, 172,184-185,194 198,260, 343, 445, 490-492,499-502, 506-507, 518, 521 See also Input attribute, Out put attribute Attribute-based check 320-321,323, 331 Augmentation rule 81, 83-84 Authorization 425-436 Authorization ID 425 Autoadmin 367-368 Automatic swizzling 598-599 Average 214, 284 Avoid cascading rollback See Cascading rollback Axford, S. J. 1091 Axis 517, 521-522
B
INDEX Battleships database 37, 55-57, 528529 Bayer, R. 698 BCNF 88-92, 111, 113 Beekmann, N. 698 Beeri, C. 122-123 Before-trigger 334 BEGIN 394 Benjelloun, O. 1091 Bentley, J. L. 698 699 Berenson, H. 309 Bernstein, P. A. 123, 309, 881, 951, 1034 BFR Algorithm 1132-1136 Binary large object See BLOB Binary number 691 Binary operation 711, 830-834,991992 Binary relationship 129-130, 134135, 172 Binding parameters 411 Biskup, J. 123 Bit See Commit bit, Counting bits, Parity bit Bit string 30, 250 Bitmap 1106 Bitmap index 688-695 Blasgen, M. W. 757, 881 BLOB 608-609 Block See Disk block Block address See Database address Block header 592, 595, 614 Block-based nested-loop join 719722 Body 224 Boolean 30, 188, 533 See also Condition Bottom-up enumeration 810-811 Bound adornment See Adornment
Babcock, B. 1180 Babu, S. 1180 Baeza-Yates, R. 698 Bag 188-189,196,205-212,228-230, 770 Balakrishnan, H. 1034 Balanced tree 634 Bancilhon, F. 202, 241 Band 1119 Barghouti, N. S. 983 Basis 80 Batini, Carlo 202 Batini, Carol 202
INDEX Boyce-Codd normal form See BCNF Bradley, P. S. 1132, 1139 Bradstock, D. 423 Branch-and-bound 811-812 Branching 540-541, 551 See also ELSE, ELSEIF, IF Brin, S. 1180-1181 Broder, A. Z. 1139 Bruce, J. 423 B-tree 633-647, 661, 927-928, 963 Bucket 626-627,630,666, 668,11721174 See also Frequent bucket, Indi rect bucket Buffer 573, 705, 712, 723-724, 746751, 848-849, 855 Buffer manager 7, 746-751,818, 852, 883 Build relation 815 Buneman, P. 515 Burkhard, W. A. 698 Bushy tree 816 C C 378 Cache 558 Call statement 393, 402 Call-level interface 369, 379, 404405 See also CLI Candidate key 72 Candidate set 1103, 1121 Capabilities specification 1057-1058 Capability-based plan selection 1056 1060 Carney, D. 1180 Cartesian product See Product Cascade 314 315 Cascade policy 433-436 Cascading rollback 955-957 Case sensitivity 248, 530 Catalog 373-375 Cattell, R. G. G. 202, 618
CDATA
1185 499 Celko, J. 309 Centralized locking 1015 Ceri, S. 202, 340, 700, 983 Cetintemel, U. 1180 Chain algorithm 1061-1068 Chamberlin, D. D. 309, 554, 841 Chandra, A. K. 1091 Chang, P. Y. 841 Chang, Y.-M. 423 Character set 375 Character string See String Charikar, M. 1139 Chase 96-100, 115-119 Chaudhuri, S. 367, 757
CHECK
See Assertion, Attribute-based check, Tuple-based check Check-in-check-out 976 Checkpoint 857-861, 866-868, 872873 Checksum 576-577 Chen, M.-S. 1105,1139 Chen, P. M. 618 Chen, P. P. 202 Cherniack, M. 1180 Child 521 Cho, J. 1180 Choice 505-506 Chord circle 1022-1031 Chou, H.-T. 757 Class 172, 179, 184, 188, 193-194, 451 CLI 369, 405-412 Click stream 1163 Client 375-376, 593 Clock algorithm 748-749 CLOSE 385, 707 Closed set of attributes 84 Closing tag 488 Closure, of attributes 75-79 Closure, of FD sets 80-81, 115 Cluster 374 Clustered file 625-626
1186 Clustering 715, 739-741,1087,11231136 Cobol 378 Cochrane, R. J. 340 CODASYL 3 Codd, E. F. 3, 65, 123, 241 Code See Error-correcting code Cohesion 1128-1129 Collaborative filtering 1095, 1111, 1123-1124 Collation 375 Collection type 189 See also Array, Bag, Dictionary, List, Set Column store 609-610 Combining rule 73-74 Comer, D. 698 Commit 300, 852 See also Group commit, Twophase commit Commit bit 934 Communication heterogeneity 1040 Commutative law 212, 768-769,790791, 1083 Comparison 461-463, 523-524,537538 See also Lexicographic order Compatibility matrix 907 Compensating transaction 979-981 Complementation rule 109-110 Complete subclasses 176, 180 Complex type 503-506 Composition 172, 178, 181 Compressed bitmap 691-693 Compressed set 1133 Compression See Data compression Concurrency See Locking, Scheduler, Serial izability, Timestamp, Trans action, Validation Concurrency control 7-8, 883, 978 See also Optimistic concurrency control
INDEX Condition 332-334, 523-525 See also Boolean, Selection, Thetajoin, WHERE Confidence 1097 Conflict 890-892 Conflict-serializability 890-895 Conjunctive query 1070 Connecting entity set 135, 145 Connection 376-377, 405, 412-413, 419, 427-428 Consistency 9 , 898, 906 Consistent state 847 Constant 38-39 Constraint 18-19, 58-62, 148, 151, 311-331 See also CHECK, Dependency, Do main constraint, Key, Trig ger Constraint modification 325-327 Containment 59 Containment mapping 1073-1074 Containment, of conjunctive queries 1070, 1073-1077 Containment, of value sets 797-798 Convey, C. 1180 Copyright 1021 Correctness principle 847-848 Corrolated subquery 273-274 Cosine distance 1126 Cost-based plan selection 803-812, 1060 Count 214, 284-285, 287 Counting bits 1171-1174 Counting distinct elements 1174-1176 Crash See Media failure Crawler See Web crawler CREATE 328-329, 333, 341, 351, 451, 462 CREATE TABLE 30-36, 313, 391, 454 CROSS JOIN 275-276 Cross product See Product Current instance 24
INDEX Curse of dimensionality 1127 Cursor 383-387, 396, 415, 419-420 Cylinder 562, 568-570
D
1187 Davidson, S. 515 Dayal, U. 123, 340 DBMS 1-10 DDL See Data-definition language Dead end 1150-1153 Deadlock 9 , 903, 966-974, 1018 Dean, J. 1034 Decision-support query 464 Declaration 393 See also CREATE TABLE DECLARE 381, 397 Decomposition 86-87 Default value 34 Deferrable constraint 316 317 Deferred checking 315-317 Deletion 292-294,426, 614, 631, 642645, 647, 650-651, 694 Deletion anomaly 86 Delobel, C. 123, 202 Dense index 621-622, 624, 637 Dependency See Constraint, Functional de pendency, Multivalued de pendency Dependency preservation 93, 100101, 113 DERIVED 455-456 Descendant 522 Description record 405 Design 140-145, 169 See also Model, Normalization DeWitt, D. J. 757-758 Diaz, O. 340 Dicing 469 472 Dictionary 188, 196 Difference 39-40, 5 0, 207-208, 231, 265-266,268, 282-283,716717, 722, 727, 731, 734, 737, 771, 801, 990 Digital library 1021 Dimension See Curse of dimensionality, Eu clidean space Dimension table 467-469
Dangling tuple 219-220, 315, 1001 Darwen, H. 309 Data compression 610-611 D ata cube 425, 466-467, 473-477 Data disk 579 D ata file 620 D ata mining 1093-1136, 1169-1176 Data model 17-18 See also Model Data region 683 Data replication See Replication Data source See Source Data stream 1161-1176 Data type 1041 See also UDT Data warehouse 5, 465 See also Warehouse Database 1 Database address 594, 601 Database administrator 5 Database management system See DBMS Database schema 373-375 See also Relational database schema Database server 370, 372 Database state See State Data-definition language 1, 5, 29 See also ODL, Schema Datalog 205, 222-238, 439, 10611062 See also Conjunctive query Data-manipulation language 2, 29 Datar, M. 1180 Data-stream management system 1161 1163 Date 31, 251-252 Date, C. J. 309
1188 Dirty data 302 304, 935-937, 954 955 Discard set 1132 DISCONNECT 377 Disjoint subclasses 176, 180 Disk 562-589 See also Floppy disk, Shared disk Disk access 564-566 Disk assembly 562 Disk block 7, 352-353, 560, 592594, 634, 649, 706, 847 See also Database address, Over flow block, Pinned block Disk controller 564, 570 Disk crash See Media failure Disk head 563 See also Head assembly Disk I/O 568-569, 645-646, 1098 1099 Disk scheduling 571-573 Disk striping See RAID, Striping Distance measure 1125-1127 See also Cosine distance, Edit distance, Jaccard distance DISTINCT See Duplicate elimination Distinct elements See Counting distinct elements Distributed commit See Two-phase commit Distributed database 997-1019 See also Peer-to-peer network Distributed hashing 1021-1031 Distributed locking 1014-1019 Distributed transaction 998-999 Distributive law 212-213 DML See Data-manipulation language Document 488, 499, 502-503, 518519, 1111, 1124 Document retrieval 628 631
INDEX Document type definition See DTD DOM 515 Domain 23, 375 Domain constraint 61 Domain relational calculus See Relational calculus Dominance relation 1086 Double buffering 573 Drill-down 471 Driver 412 DROP 33, 326, 330, 345 DROP TABLE 33 DSMS See Data-stream management system DTD 489, 495-502 Duplicate elimination 213-214,281284,538-539,712-713,722, 725, 731-733, 737, 777, 789790, 802, 990 See also DISTINCT Durability 2, 7, 9 DVD See Optical disk Dynamic hash table 651-652 Dynamic hashing See Extensible hashing, Linear hashing Dynamic programming 811-812,819824 Dynamic SQL 388-389 E Ear 1004 Edit distance 1080, 1127 Element 488,490,496-497, 503-504, 518, 846, 849 See also Node Elevator algorithm 571-573 Ellis, J. 423 ELSE 394 ELSEIF 394 Embedded SQL 378-389 Empty element 496
INDEX Empty set 59 Empty string 533 Encryption 611 END-394, 396 Entity 126 Entity resolution 1078-1087, 1117 1118 Entity set 126-127, 144, 157, 172 See also Connecting entity set, Supporting entity set, Weak entity set Entity/relationship model See E /R model Enumeration 184-185,188, 508-509 See also Bottom-up enumera tion, Top-down enumera tion Environment 372-374, 405 Equal-height histogram 804 Equal-width histogram 804 Equijoin 790 Equivalence, of FD s 73 E /R diagram 127-128 E /R model 125-171 Error-correcting code 589 See also Hamming code Escape character 252 Eswaran, K. P. 757, 951 Euclidean space 1125 Even parity 576 Event 332-334 Event-condition-action rule See Trigger EXCEPT See Difference Exception 400-402 Exclusive lock 905-907 EXEC SQL 380 Execute (a SQL statement) 389, 407408, 413, 419-421, 426 Execution engine 7 EXISTS 270 Expanding solutions 1071-1073 Expression 38, 51 Expression tree 47-48, 236-237
1189 Extended projection 213, 217-219 Extensible hashing 652-655 Extensible markup language See XML Extensible modeling language See XML Extensible stylesheet language See XSLT Extractor See Wrapper F Fact table 466-467 Fagin, R. 123, 480, 699 Failure See Intermittent failure, Mean time to failure, Media fail ure, Write failure Faithfulness 140-141 Faloutsos, C. 698-700 Fang,M . 1139 Fayyad, U. M. 1132, 1139 FD See Functional dependency FD promotion rule 109 Feasible plan 1058 Federated databases 1041-1042 Fellegi, I. P. 1091 Fetch statement 384, 408-410 Field 509, 590 See also Repeating field, Tagged field FIFO See First-in-first-out File See Clustered file, D ata file, Grid file, Index file, Sequential file File system 2 Filter 811, 827, 1052-1053 Finger table 1024 Finkel, R. A. 699 Finkelstein, S. J. 480 First normal form 103 First-come-first-served 920
1190 First-in-first-out 748 Fisher, M. 423 Flagolet, P. 1180 Floating-point number 31, 188 See also Real number FLWR expression 530-534 For-all 539-540 For-clause 530-533 Foreign key 312-317, 510-512 For-loop 398-400, 549 Fortran 378 Forwarding address 596, 613 4NF 110-113 Free adornment See Adornment Frequent bucket 1106, 1108 Frequent itemset 1093-1109 Friedman, J. H. 698 Frieze, A. M. 1139 FROM 244-246, 259, 274-275 Full outerjoin See Outerjoin Full reducer 1003, 1005-1007 Function 391-392, 402 Functional dependency 67-83 Functional language 530
G
INDEX Global lock 1017-1019 Global-as-view mediator 1069 Goodman, N. 881, 951, 1034 Google 1147 Gotlieb, L. R. 758 Graefe, G. 758, 841 Graham, M. H. 1034 Grammar 761-762 Grant diagram 431-432 Grant statement 375 Granting privileges 430-431 Graph See Hypergraph, Precedence graph, Similarity graph, Waits-for graph Gray, J. N. 309, 618, 881, 951, 983 Greedy algorithm 824-825 Grid computing 1020 Grid file 665-671, 673 Griffiths, P. P. 480 See also Selinger, P. G. GROUP BY 285-289 Group commit 959-960 Group mode 918, 925 Grouping 213, 215-217, 461, 714, 722, 726, 731, 733-734, 737, 777-779, 802, 990 See also GROUP BY Guassian elimination 1150 Gulutzan, P. 309 Gunther, O. 699 Gupta, A. 241, 367, 1092 Guttman, A. 699 Gyongi, Z. 1180
H
Gaede, V. 699 Gallaire, H. 241 Gap 562-563 Garcia-Molina, H. 65, 515, 618,983, 1034,1091-1092,1139,1180 GAV See Global-as-view mediator Generator 460-461 Generic interface 245, 378 Geographic information system 661 662 GetNext 707 Ghemawat, S. 1034 Gibson, G. A. 618 Giles, C. L. 1181 Gionis, A. 1180 Glaser, T. 698
Haas, L. 1092 Haderle, D. J. 881, 983 Hadoop 1034 Hadzilacos, V. 881, 951 Haerder, T. 881 Hall, P. A. V. 841 Hamming code 584, 589 Hamming distance 589 Handle 405-407
INDEX Harinarayan, V. 241, 367 Hash function 650, 989 See also Partitioned hash func tion Hash join 734-735 See also Hybrid hash join Hash key 732 Hash table 648-659, 665, 732-738, 754-755 See also Dynamic hashing, Localitysensitive hashing, Minhash ing, PCY Algorithm Haveliwala, T. 1180 HAVING 288-289 Head 224 Head assembly 562 Head crash See Media failure Header See Block header, Record header Held, G. 13 Hellerstein, J. M. 13 Heterogeneity 1040-1041 Hierarchical clustering See Agglomerative clustering Hierarchical model 3, 21 Hill climbing 812 Hinterberger, H. 699 HiPAC 340 Histogram 804-807 Holt, R. C. 983 Host language 245, 369, 378 Howard, J. H. 123 Hsu, M. 881 HTML 488, 493, 545, 630 Hull, R 12 Hybrid hash join 735-737 Hypergraph 1003 See also Acyclic hypergraph
1191 Idempotence 1083 IDREF 500-502 IF 394 Imielinski, T. 1139 Impedance mismatch 380 IMPLIED 499 Importance, of pages 1144-1147 See also PageRank IN 270-272 Incomplete transaction 856, 864 Increment lock 911-913 Index 7-8, 350-358, 619-695, 739745, 829 See also Bitmap index, B-tree, Clustering index, Dense in dex, Inverted index, Mul tidimensional index, Mul tilevel index, Primary in dex, Secondary index, Sparse index Index file 620 Index scan 704, 740-742 Indirection 626-627 Indyk, P. 1139, 1180 Information integration 4-5,486,10371087 See also Federated databases, Mediator, Warehouse Information retrieval 632 See also Document retrieval Information source See Source INGRES 12 Inheritance See Isa relationship, Subclass INPUT 848 Input attribute 774 Insensitive cursor 388 Insert 461 Insertion 291-293,426, 612,631, 640 642,649-650,653-655,657659,667-669,679, 684-686, 694-695, 925-926 Instance 24, 68, 73, 128-129 Instead-of-trigger 334, 347-349
I
ICAR records 1083-1086 ID 500-502 See also Object-ID, Tuple iden tifier
1192 Integer 30, 188 Intention lock 923-925 Interest 1097 Interior node 485 Interior region 683 Intermittent failure 575-576 Interpretation of text 417-418, 535536 Intersection 39-40, 50, 207-208, 212213, 231, 265, 268, 282283, 716, 722, 727,731, 734, 737, 769, 771, 801, 990 Inverse relationship 186 Inverted index 629-631, 996 Isa relationship 136, 172 See also Subclass Isolation 2, 9 Isolation level 304 See also Read committed, Read uncommitted, Repeatable read Item 518 Iteration See Loop Iterator 707-709, 719, 818-819 See also Pipelining
J K
INDEX
Jaccard distance 1126 Jaccard similarity 1110-1114 Jagadish, H. V. 1181 James, A. P. 1091 JDBC 369, 412-416 Join 39, 43, 50, 210-212, 235-236, 259-260,536-537,829-830, 1000-1007 See also Antisemijoin, CROSS JOIN, Equijoin, Lossless join, Nat ural join, Nested-loop join, Outerjoin, Semijoin, Thetajoin, Zig-zag join Join ordering 814-825 Join selectivity 825 Join tree 815-819 Jonas, J. 1091
Kaashoek, M. 1034 Kaiser, G. E. 983 Kanellakis, P. C. 951 Karger, D. 1034 Katz, R. H. 618, 758 kd -tree 677-681 Kedem, Z. 951 Kennedy, J. M. 1091 Key 25, 34-36, 60-61, 70, 72, 148150, 154, 160, 173, 191192,311,353, 509-510,620, 634 See also Foreign key, Hash key, Primary key, Search key, Sort key, UNIQUE Kim, W. 202 Kitsuregawa, M. 758 Kleinberg, J. 1181 fc-means algorithm 1130-1131 See also BFR Algorithm Knowledge discovery in databases See Data mining Knuth, D. E. 618, 699 Ko, H.-P. 951 Korth, H. F. 951 Kossman, D. 758 Kreps, P. 13 Kriegel, H.-P. 698 Kumar, V. 882, 1140 Kung, H.-T. 951
L
Label 485 Lam, W. 1180 Lampson, B. 618, 1034 Larson, J. A. 1092 Latency 565 See also Rotational latency, Schedul ing latency LAV See Local-as-view mediator Lawrence, S. 1181
INDEX LCS See Longest common subsequence Leaf 484, 634 Least-recently used 748 Lee, S. 1180 Left outerjoin 221, 277 Left-deep join tree 816-819 Legacy database 486, 1038 Legality, of schedules 898, 906 Lerdorf, R. 423 Let-clause 530-531 Levy, A. Y. 1076, 1091 Lewis, P. M. II 984 Lexicographic order 250 Ley, M. 12 Li, C. 1092 Lightstone, S. S. 367 LIKE 250-251 Lindsay, B. G. 882, 983 Linear hashing 655-659 Linear recursion 440 Link spam 1159-1160 List 188-189, 196 Litwin, W. 699 Liu, M. 241 Livny, M. 1140 LMSS Theorem 1076-1078 Local variable 1072 Local-as-view mediator 1069-1078 Locality-sensitive hashing 1112,1116
1122
1193 Logging 7-8, 851-873,876, 878-879, 953-954, 959 See also Logical logging, Redo logging, Undo logging, Undo/ redo logging Logic See Datalog, Relational calcu lus, Three-valued logic Logical address 594-595 Logical logging 960-965 Logical query plan 702, 781-791,808 See also Plan selection Lohman, G. 367 Lomet, D. 367, 618 Long-duration transaction 975-981 Longest common subsequence 1088 Lookup 639,666-667,670,679,1024 1026 Loop 396-400, 549 Lorie, R. A. 841, 951 Lossless join 94-99 Lowell Report 12 Lozano, T. 699 LRU See Least-recently used M MacIntyre, P. 423 Mahalanobis distance 1135 Main memory 558, 561, 705, 747, 845, 1105 Majority locking 1019 Many-many relationship 130-131,186 Many-one relationship 129-131,145, 160, 187 Map 994-995 Map table 594 Map-reduce framework 993-996 Market basket 993, 1094-1096 Martin, G. N. 1180 Materialization 830-831 Materialized view 359-365 Matias, Y. 1180 Mattos, N. 340, 480 Maximum 214, 284
Lock See Global lock, Upgrading locks Lock granularity 921-926 Lock table 918-921 Locking 897-932, 941, 946-948,957959 See also Distributed locking, Ex clusive lock, Increment lock, Intention lock, Shared lock, Strict locking, Update lock Log file 851 Log manager 851 Log record 851-852
1194 m ax ln clu sive 508 McCarthy, D. R. 340 McCreight, E. M. 698 McHugh, J. 515 McJones, P. R. 881 Mean time to failure 579 Media failure 563, 575, 578-579,844, 875 Mediator 1042,1046-1047,1049-1050 See also Global-as-view media tor, Local-as-view media tor Megatron 747 (imaginary disk) 564 Melkanoff, M. A. 123 Melton, J. 309, 423 Memory address 594 Memory hierarchy 557-561 Mendelzon, A. O. 1076, 1091 Merge sort See Two-phase multiway merge sort Merging records See Entity resolution Merlin, P. M. 1091 M etadata 8 See also Schema Method 184, 445, 449, 452-453 See also Generator, M utator Middleware 5 Minhashing 1112-1115, 1121-1122 Minimal basis 80 Minimum 214, 284 m in ln c lu siv e 508 Minker, J. 241 Mirror disk 571, 579-580 Mitzenmacher, M. 1139 Model See D ata stream, E /R model, Hierarchical model, Nested relation, Network model, Ob ject-oriented model, Objectrelational model, ODL, Phys ical data model, Relational model, Semistructured data, UML, XML
IND EX Modification 18, 33, 386-387 See also Constraint modifica tion, Deletion, Insertion, Up datable view, Update Module 378 See also PSM Modulo-2 sum See Parity bit Mohan, C. 882, 983 MOLAP 467 Monotone operator 57 Monotonicity 441-443, 1103 Moores law 561 Morris, R. 1034 Moto-oka, T. 758 Motwani, R. 1139, 1180-1181 Movie database 26-27 Multidimensional index 661-686 See also Grid file, kd-tree, Multi ple-key index, Partitioned hash function, Quad tree, R-tree Multidimensional OLAP See MOLAP Multilevel index 623 See also B-tree Multipass algorithm 752-755 Multiple-key index 675-677 Multiset See Bag Multistage Algorithm 1107-1109 Multivalued dependency 67,105-120 Multi version timetamp 939-941 Multiway merge-sort See Two-phase, multiway mergesort Multiway relationship 130-131,134135, 145 Mumick, I. S. 367, 480, 1181 Mumps 378 M utator 460-461 Mutual recursion 440 MVD See Multivalued dependency
INDEX N Nadeau, T. 367 Namespace 493, 533, 544 NaN 533 Narasaya, V. R. 367 Natural join 43-45, 96, 212, 276277, 717, 722,728-731,734737, 742-745,768, 771-772, 775-777, 790-791, 797-801, 990-991 See also Lossless join Navathe, S. B. 202 Nearest-neighbor query 662, 664, 671, 677 Negation 254-255 Nested relation 446-448 Nested-loop join 718-722 Network model 3, 21 Newcombe, H. B. 1091 Nicolas, J.-M. 65 Nievergelt, J. 698-699 Node 484, 518-519 See also Element Nonquiescent archive 875-878 See also Archive Nonquiescent checkpoint 858-861 See also Checkpoint Nontrivial FD See Trivial FD Nontrivial MVD See Trivial MVD Nonvolatile storage See Volatile storage Norm See Distance measure, Euclidean space Normalization 67, 85-92 Not-null constraint 319-320 Null value 33-35,168, 252-254, 287288, 475, 605 See also Not-null constraint, Setnull policy Numeric array 418 O
1195
Object 126, 167-168, 449 Object description language See ODL Object-ID 449, 455-456 See also Tuple identifier Object-oriented model 21, 449-450 See also Object-relational model, ODL Object-relational model 20, 445-463 ODBC See CLI Odd parity 576 577 ODL 126, 183-198 Offset table 595, 612-613 OID See Object-ID OLAP 425, 464-477, 610 Olken, F. 758 OLTP 465 ONeil, E. 309 ONeil, P. 309, 699 One-one relationship 129 131, 172, 187 One-pass algorithm 709-717, 829 On-line analytic processing See OLAP On-line transaction processing See OLTP OPEN 384, 707 Opening tag 488 Operand 38 Operator 38 See also Monotone operator Optical disk 559 Optimistic concurrency control 933 See also Timestamp, Validation Optimization See Plan selection, Query opti mization Or 254-255 ORDER BY 255-256, 461 See also Ordering, Sorting Ordering 461-463, 541-543
1196 See also Join ordering, Sorting Ordille, J. J. 1091 Outerjoin 214, 219-222, 277-278 OUTPUT 849 O utput attribute 774 Overflow block 613 Overlapping subclasses 176, 180 Ozsoyoglu, M. Z. 1034 Ozsu, M. T. 984
P
IND EX PCY Algorithm 1105-1107 PEAR 419 Pedersen, J. 1180 Peer-to-peer network 4, 1020-1021 Pelagatti, G. 983 Pelzer, P. 309 Percentiles See Equal-height histogram Persistent stored modules See PSM Peterson, W. W. 699 Phantom 925-926 PHP 369, 416-421 Physical address 594 Physical data model 17 Physical query plan 702-703, 750751, 810-812, 826-838 Piatetsky-Shapiro, G. 1139 Pinned block 600-601 Pipelining 830-834 Pippenger, N. 699 Pirahesh, H. 340, 480, 882, 983 P L/1 378 Plagiarism 1111 Plan selection See Bottom-up enumeration, Cap ability-based plan selection, Cost-based plan selection, Dynamic programming, Greedy algorithm, Join ordering, Phys ical query plan, Selingerstyle enumeration, Top-down enumeration PL/SQL 423 Point assignment 1123, 1130 Pointer swizzling See Swizzling Precedence graph 892-895 Predicate 223 Prefetching 573 See also Double-buffering Prepare (a SQL statement) 389,407, 413, 421 Prepared statement 413-414 Preprocessor 764-767
Packet stream 1163 Paepcke, A. 1180 Page See Disk block Page, L. 1147, 1180-1181 PageRank 1147-1160 Palermo, F. P. 841 Papadimitriou, C. H. 951 Papakonstantinou, Y. 65, 515,10911092 Parallel computing 986-992, 1145 See also Map-reduce framework Param eter 391, 410-412, 416 Parent 522 Parity bit 576, 582 Parity block 580 Park, J. S. 1105, 1139 Parse tree 760, 781-782 Parsed character data See PCDATA Parser 760-764 Parsing 701 Partial subclasses 176 Partial-m atch query 662, 670, 676677, 680, 689 Partitioned hash function 671-673 Partitioning 1087 Pascal 378 P ath expression 519-526 Paton, N. W. 340 P attern matching See LIKE Patterson, D. A. 618 PCDATA 496
INDEX Preservation of dependencies See Dependency preservation Preservation, of value sets 797-798 Price, T. G. 841 Primary index 620 See also Dense index, Sparse index Primary key 34-36, 70, 311, 637 Primary-copy locking 1017 Prime attribute 102 Privilege 425-436 Probe relation 815 Procedure 391-392, 402 Product 39, 43, 50, 210, 235, 259260, 717, 722,731, 737, 768, 771-772, 775-777, 791 Product database 36, 52-54, 526527 Projection 39, 41, 50, 206, 208-209, 232, 246-248, 711-712, 722, 774-776, 794 See also Extended projection, Lossless join, Pushing pro jections Projection, of FD s 81-83 Projection, of MVDs 119-120 Prolog 241 Proper ancestor 522 Proper descendant 522 Pseudotransitivity rule 84 PSM 391 402 See also PL/SQL, SQL PL, Trans act-SQL Pushing projections 789 Pushing selections 789, 808 Putzolo, F. 618, 951
1197 See also Decision-support query, Lookup, Nearest-neighbor query, OLAP, Partial-match query, Physical query plan, Range query, Search query, Standing query, Where-amI query Query compiler 7,10, 701-703,759838 Query execution 701-755 Query language See CLI, Datalog, Data-manipulation language, JDBC, PHP, PSM, Relational algebra, SQL, XPath, XQuery, XSLT Query optimization 10, 18, 49, 702 See also Plan selection Query plan See Logical query plan, Physi cal query plan, Plan selec tion Query processing 5, 7, 9-10, 10001007 See also Execution engine, Query compiler Query rewriting 363-364, 701-702 See also Algebraic law Query-language heterogeneity 1040
R
Q
Quad tree 681-683 Quantifier See ALL, ANY, EXISTS, For-all, There-exists Quass. D. 241, 515, 699, 1091 Query 18, 225, 343, 413-414
Raghavan, S. 1180 RAID 578-588, 844 Rajaraman, A. 367, 1091 Ramakrishnan, R. 241, 1140 Random walker 1147, 1154 Range query 639-640,662-664,670671, 677, 680-681, 690 Raw-data cube See D ata cube, Fact table READ 849 Read committed 304-305 Read lock See Shared lock Read uncommitted 304 Read-only transaction 300-302
1198 Real number See Floating-point number Record 590-592, 1079 See also ICAR records, Log record, Sliding records, Spanned record, Tagged field, Variable-format record, Variable-length record Record address See Database address Record fragment 608 Record header 590, 604 Record structure See Structure Recoverable schedule 956, 958 Recovery 7, 855-857, 864-868,870872,878-879,953-965,10111013 Recovery manager 855 Recovery of information 93 See also Lossless join Recursion 238, 437-443, 546 Redo logging 853, 863-868 Reduce 995-996 See also Map-reduce framework Redundancy 86, 106, 113, 141 Redundant arrays of independent disks See RAID Redundant disk 579-580 Reference 446, 449, 454-455, 457458 REFERENCES 426 See also Foreign key Referential integrity 59-60,150-151, 154, 172, 313-315 See also Foreign key Reflexivity rule 81 Reina, C. 1132, 1139 Relation 18, 205, 342, 1165-1168 See also Build relation, Dimen sion table, Fact table, Probe relation, Table, View Relation instance See Instance Relation schema 22, 24, 29-36
IND EX Relational algebra 19, 38-52,59, 205221,230-238,249, 782-783 Relational atom 223 Relational calculus 241 Relational database schema 22 Relational database system 3 Relational model 3, 17-19, 21-26, 157-169,179-183,193-198, 493-494 See also Functional dependency, Multivalued dependency, Nested relation, Normaliza tion, Object-relational model Relational OLAP See ROLAP Relationship 127,134,137,142-144, 158-160, 185-188, 198 See also Binary relationship, Isa relationship, Many-many re lationship, Many-one rela tionship, Multiway relation ship, One-one relationship, Supporting relationship Relationship set 129 Relative path expression 521 Relaxation 1150 Renaming 39, 49-50 Repeatable read 304-306 Repeating field 603, 605-607 Repeat-loop 399 Replication 999, 1016-1019 Representability 1083 REQUIRED 499 Resilience 843 RESTRICT 433-436 Retained set 1133 Return statem ent 393 Return-clause 530, 533-534 Reuter, A. 881, 951 Revoking privileges 433-436 Right outerjoin 221, 277 Right-deep join tree 816-819 Rivest, R. L. 699 Robinson, J. T. 699, 951 ROLAP 467
INDEX Role 131-133, 175 Rollback 300-301, 955-959 See also Abort, Cascading roll back Roll-up 471, 476 Root 485, 489, 495, 519 Rosenkrantz, D. J. 984 Rotational latency 565 See also Latency Rothnie, J. B. Jr. 699, 951 Roussopoulos, N. 700 Row 22 See also Tuple Row-level trigger 332, 334 R-swoosh algorithm 1083-1086 R-tree 683-686 Rule 224-225 See also Safe rule Run-length encoding 691-693 S Safe rule 226 Saga 978-981 Sagiv, Y. 1076, 1091 Salem, K. 618, 983 Salton, G. 699 Satisfaction, of an FD 68, 72-73 SAX 515 Scan See Index scan, Table scan Schedule 884-889 See also ACR schedule, Legal ity, of schedules, Recover able schedule, Serial sched ule, Serializable schedule, Strict schedule Scheduler 883, 900-903, 915-921 Scheduling latency 568 Schema 483-484, 590 See also Database schema, Global schema, Relation schema, Relational database schema, Star schema Schema heterogeneity 1040-1041 Schkolnick, M. 1180
1199 Schneider, R. 698 Schwarz, P. 882, 983 Search engine 1141-1160 Search key 619-620, 637 Search query 620 Second normal form 103 Secondary index 620, 624-628 See also Inverted index Secondary storage 558-559 See also Disk Second-chance algorithm See Clock algorithm Sector 562-563 Seeger, B. 698 Seek time 564 Seidman, G. 1180 SELECT 244-246, 426 See also Single-row select Selection 39, 42, 50, 209, 232 234, 248-250,711-712,722, 740742, 770,772-774,777, 783, 790, 794-797,827-829,835, 989 See also Filter, Pushing selec tions, Two-argument selec tion Selection, of indexes 352-358 Selectivity See Join selectivity Selector 509 Self 522 Selinger, P. G. 841 See also Griffiths, P. P. Selinger-style enumeration 811-812 Sellis, T. K. 700 Semantic analysis See Preprocessor Semijoin 58, 1001, 1005-1007 Semilattice 1083, 1088 Semistructured data 18-20, 483-487 See also XML Sensor 1163 Sequence 505-506, 518, 535 Sequential file 621, 661 Serial schedule 885-886, 958
1200 Serializability 296-298,387-388,884, 953-965 See also Conflict-serializability Serializable schedule 886-887, 901903, 958 Server 375, 593 See also Application server, Data base server, Web server Session 377 Set 188-189,195-196, 209, 294,301, 304, 377, 445, 770 Set difference See Difference Set-null policy 314-315 Sevcik, K. 699 Shapiro, L. D. 758 Shared disk 988 Shared lock 905-907, 920 Shared memory 986-987 Shared variable 381-383 Shared-nothing machine 988-989 Shaw, D. E. 1034 Sheth, A. P. 1092 Shingle 1111-1112 Shivakumar, N. 1139 Shortest common subsequence 1088 Sibling 522 Signature 1113 See also Locality-sensitive hash ing, Minhashing Silberschatz, A. 951, 1181 Similarity graph 1084 Similarity, of records 1079-1087 Similarity, of sets 1110-1115 Simon, A. R. 309 Simple type 503, 507-509 Simplicity 142 Single-row select 383, 395-396 Single-value constraint See Functional dependency, Manyone relationship Skeen, D. 1034 Skelley, A. 367 Slicing 469-472 Sliding window 1164, 1169-1171
IN D EX SMART 367 Smith, J. M. 841 Smyth, P. 1139 Snodgrass, R. T. 700 Solution 1070, 1076-1077 Sort key 726 Sorted file See Sequential file Sorted index 743-745 Sorting 214, 219, 704, 723-731, 738, 752-754, 829, 835 See also ORDER BY, Ordering, Twophase multiway merge sort Source capabilities 1056-1057 Spam 1148 See also Link spam Spam farm 1159 Spam mass 1160 Spanned record 608-609 Sparse index 622-623, 637 Spider trap 1150-1153 Spindle 562 Splitting rule 73-74, 109 SQL 3, 29-36, 243-444,451-463,475477, 530 SQL agent 378 SQL PL 423 SQL state 381, 385 Srikant, R. 1139 Srivastava, D. 1076, 1091 Stable storage 577-578 Standing query 1162 Star schema 467-469 State 845, 979 See also Consistent state Statement 405, 413-415 Statement-level trigger 332 Static hash table 651 Statistics 8, 705-706, 807 See also Histogram Stearns, R. E. 984 Steinbach, M. 1140 Stemming 632 Stoica, I. 1034
INDEX Stonebraker, M. 13, 618, 758, 1034, 1180 Stop word 632 Storage manager 7-8 Stored procedure 375 See also PSM Strict locking 957-958 Strict schedule 958 String 30, 188, 417 See also Bit string Stripe 665 Striping 570 Strong, H. R. 699 Structure 185, 189, 194-195, 445 Structured address 595-596 Sturgis, H. 618, 1034 Stylesheet 544 Su, Q. 1091 Subclass 135-138,165-170,172,176, 180-181 See also Isa relationship Subgoal 224, 1062 Subquery 268-275, 395, 783-788 See also Correlated subquery Subrahmanian, V. S. 700 Subsequence 1087 Suciu, D. 515 Sum 214, 284, 1170 Sunter, A. B. 1091 Superkey 71, 88, 102 Support 1095 1096, 1100 Supporting entity set 154 Supporting relationship 154-155 Swarni, A. 1139 Swizzling 596-600 Syntactic category 760, 762 Syntax analysis See Parser Synthesis algorithm for 3NF 103104 System failure 845 SYSTEM GENERATED 455-456 System R 12, 308, 841 Szegedy, M. 1180
T
1201
Table 18, 29, 342 See also Relation Table scan 703-704, 706-708 Tableau 97 Tag 488, 493 Tagged field 607 Tan, P.-N. 1140 Tanaka, H. 758 Tatbul, N. 1180 Tatroe, K. 423 Taxation rate 1153, 1156 See also Teleportation Teleport set 1156-1157 Teleportation 1154 Template 544-548, 1050 Temporal database 24 Temporary table 30 Teorey, T. 367 Tertiary storage 559 Thalheim, B. 202 There-exists 539-540 Theta-join 45-47, 769, 777, 790-791 Thomas, R. H. 1034 Thomas write rule 936 Thomasian, A. 951 3NF 102-104, 113 Three-tier architecture 369-372 Three-valued logic 253-255 Thuraisingham, B. 951 Time 31, 251-252 Timeout 967 Timestamp 252, 590, 933-941, 946948, 970-974 See also Multiversion timetamp Tombstone 596, 614, 694 Top-down enumeration 810-811 Topic-specific PageRank 1156-1160 TPMMS See Two-phase multiway merge sort Track 562 Traiger, I. L. 951 Transaction 7, 296-306,845-851, 887-
1202 889 See also Consistency, Incomplete transaction, Long-duration transaction Transaction manager 883 Transaction processing See Concurrency, Deadlock, Lock ing, Logging, Scheduling Transact-SQL 423 Transfer time 565 Transition m atrix of the Web 993, 1148-1149 Transitive rule 73, 79-81, 108 Translation table 597 Tree See Balanced tree, B-tree, Bushy tree, Expression tree, Join tree, kd- tree, Left-deep join tree, Parse tree, Quad tree, Right-deep join tree, R-tree Tree protocol 927-932 Triangle inequality 1125 Triangular m atrix 1101-1102 Trigger 332-337, 426 Trivial FD 74-75, 88 Trivial MVD 108 TrustRank 1160 Truth value 253-255 Tuning 357-358, 364-365 Tuple 22-23,449,458-459,706,1164 See also Dangling tuple Tuple identifier 445-446 See also Object-ID Tuple relational calculus See Relational calculus Tuple variable 261-262 Tuple-based check 321-323, 331 Tuple-based nested-loop join 719 Two-argument selection 783-785 Two-pass algorithm 723-738 Two-phase commit 1009-1013 Two-phase locking 900-902, 906 Two-phase multiway merge sort 723725 Type
IND EX See Collection type, Complex type, D ata type, Simple type, User-defined type Type constructor 188, 449 U UDT 451-463 Ullman, J . D. 13, 122-123, 241, 367, 480, 1091-1092,1139 UML 125, 171-183 Unary operation 711, 830, 991 UNDER 426 Underwood, L. 1181 Undo logging 851-862 Undo/redo logging 853, 869-873 Unicode transformation format See UTF Unified modeling language See UML Union 39-40, 206-207,212-213, 231, 265-266,268,282-283,715716, 722, 726-727, 731, 734, 737, 768, 771, 775, 801, 990, 1067-1068 UNIQUE 34-35, 312 UNKNOWN 253-255 Updatable view 345-348 Update 294, 413-414, 426, 615, 695 Update anomaly 86 Update lock 909-910 Upgrading locks 908-909, 921 USAGE 426 User-defined type See UDT UTF 489 Uthurusamy, R. 1139 V Valduriez, P. 984 Valentin, G. 367 Valid XML 489 See also DTD Validation 942-948 Value count 706, 793
INDEX v a lu e -o f 545-546 Variable 38-39, 223, 232, 417, 534535 See also Local variable, Tuple variable Variable-format record 607 Variable-length record 603-608 Variable-length string 30 Vassalos, V. 1091 Vianu, V. 12 View 29, 341-349, 765-767, 1070 See also Materialized view View maintenance 360-362 Virtual memory 560-561, 593, 747 Virtual view See View Vitter, J. S. 618 Volatile storage 560, 845
W
1203 Window See Sliding window Winograd, T. 1181 WITH 437 Wong, E. 13, 841 Wood, D. 758 Workflow 976 See also Long-duration trans action World-Wide-Web Consortium 65, 515, 554 Wound-wait 971-974 Wrapper 1049-1054 Wrapper generator 1051-1052 WRITE 849 Write failure 575 Write lock See Exclusive lock Write-ahead logging rule See Redo logging W3Schools 515, 554
X
Wade, B. W. 480 Wait-die 971-974 Waits-for graph 967-969 Walker See Random walker Wall, A. 1181 Warehouse 1042-1046, 1049 Warning lock 922-926 Warning protocol 922-926 Weak entity set 152-156, 161-163, 181-183 Web crawler 1142-1145 Web server 370 Weiner, J. L. 515 Well-formed XML 489-490 Wesley, G. 1180 Whang, K.-Y. 1180 Whang, S. E. 1091 W HERE 244-246, 461 Where-am-I query 662-663, 684 Where-clause 530, 533 While-loop 399 Widom, J. 65, 340, 367, 515, 1091, 1180 Wiederhold, G. 618, 1092
XML 3-4, 19-20, 488-551, 630 XML Schema 502-512, 523, 533 XPath 510, 517-526, 530, 545 XQuery 517, 528, 530-543 XSLT 517, 544 551
Y
Zaniolo, C. 123, 700 Zdonik, S. 1180 Zhang, T. 1140 Zicari, R. 700 Zig-zag join 743-745 Zilio, S. 367 Zipfian distribution 795 Zuliani, M. 367