Copy Detection Mechanisms For Digital Documents
Copy Detection Mechanisms For Digital Documents
classes of OOT's see chunking strategy D of Section 4.2 (A) One chunk equals one unit. Here every unit (e.g. ev-
and Section 6 (consider the seed for the random number ery sentence) is a chunk. This yields the smallest
generator as a parameter). chunks. As with units, small chunks tend to make
Finally, notice that the security measures we have pre- the freq of an OOT smaller. The major weakness
sented here do not address \authorization" issues. For is the high storage cost jrj hash table entries are
example, when a user registers a document, how does the required for a document. However, it is the most se-
system ensure the user is who he claims to be and that he cure scheme SEC(o r) is bounded by jrj. That is,
actually \owns" the document? When a user checks for depending on the decision function, it may be neces-
violations, can we show him the matching documents, sary to alter up to jrj characters (one per chunk) to
or do we just inform him that there were violations? subvert the OOT.
Should the owner of a document be noti ed that some- (B) One chunk equals k nonoverlapping units. In this
one was checking a document that violates his? Should strategy, we break the document up into sequences
the owner be given the identity of the person submitting of k consecutive units and use these sequences as our
the test document? These are important administrative chunks. It uses (1=k)th the space of Strategy A but
questions that we do not attempt to address in this pa- is very insecure since altering a document by adding
per. a single unit at the start will cause it to have no
2
This assumes a decision function which doesn't ag a matches with the original. We call this e ect \phase
violation if there are no matches (a reasonable condition).
For instance, if o1 (d r) is always true, no matter if there are 3
For our discussion we assume that documents do not have
matches or not, then our statement does not hold. signicant numbers of repeating units.
strat summary example on ABCDEF (k = 3) space # units SEC
A 1 unit A,B,C,D,E,F jrj 1 jr j
B k units, 0 over ABC,DEF jrj=k k 1
C k units, k-1 over ABC,BCD,CDE,DEF jrj k jrj=k
D hashed breakpoints AB,CDEF jrj=k k jr j
In this section we address the eciency and scalability of where RANDOM-SELECT picks N chunks at random. 4
OOTs. For copy detection to scale well, we require that The chunking function for insertions is not changed, i.e.,
it can operate with very large collections of registered INS-CHUNKS2 = INS-CHUNKS1.
documents, as well as the ability to quickly test many The DECIDE1 function of o1 selects documents r where
new documents. One e ective way to achieve scalability the number of matching chunks COUNT(r, MATCH) is
is to use sampling. greater than SIZE. For o2, only N chunks are tested
To illustrate, say we have an OOT with a DECIDE (not SIZE), so the threshold number of chunks is N.
function that tests whether more than 15 percent of the Thus, DECIDE2 selects documents r where the number of
chunks of a document d match. Instead of checking all matching chunks COUNT(r, MATCH) is greater than N.
chunks in d, we could simply take say 20 random chunks
and check whether more than 3 of them matched (15% of 6.1 Accuracy of Randomized OOTs
the 20 samples). We would expect that this new OOT Now we wish to determine how di erent o2 is from o1.
based on sampling approximates the original OOT. If As in Section 3.2, let D be our distribution of input doc-
the average test document contains 1000 chunks, we will uments and let R be the distribution of registered docu-
have reduced our evaluation time by a factor of 50. The ments. Let X be a random document that follows D and
cost, of course, is in the lost accuracy and that is ana- Y be a random document that follows R. Let m(X Y )
lyzed in Section 6.1. be the proportion of chunks (according to o1 's chunking
Another sampling option is to sample registered docu- function) in X which match chunks in Y . Then let W(x)
ments. The idea here is to only insert in our hash table a be the probability density functionR that m(X Y ) = x,
random sample of chunks for each registered document. i.e., P (x1 m(X Y ) x2) = xx12 W(x)dx. Using
For example, say that only 10% of the chunks are hashed.
Next, suppose that we are checking all 100 chunks of this we can compute Alpha(o1 o2 ), Beta(o1 o2), and
a new document and nd 2 matches with a registered Error(o1 o2). The details of the computation are in
document. Since the registered document was sampled, Appendix A the results are as follows:
these 2 matches should be equivalent to 20 under the
original OOT. Since 20/100 exceed the 15% threshold, R1
the document would be agged as a violation. In this
Alpha(o1 o2) = W(x)Q(x)dx
case, the savings would be storage space: the hash table R1
will have only 10% of the registered chunks. A smaller W (x)dx
hash table also makes it possible to distribute it to other R
W(x)(1 ; Q(x))dx
sites, so that copy detection can be done in a distributed Beta(o1 o2) = 0 R
fashion. Again, the cost is a loss of accuracy. o W(x)dx
A third option is to combine the two of these tech- Z 1
niques without sacri cing accuracy (any more than ei- Error(o1 o2) = W(x)Q(x)dx
ther one alone) by sampling based on the hash numbers
of the chunks 10]. For example, if in our test document, Z
we sample exactly those chunks whose hash number is 0 + W (x)(1 ; Q(x))dx
mod 10, then there is no need to store the hash values N
0
not 0 mod 10 since there could never be a collision oth- where Q(x) = j x (1 ; x)N j ;