0% found this document useful (0 votes)
7 views

Copy Detection Mechanisms For Digital Documents

Copy Detection Mechanisms for Digital Documents

Uploaded by

Daniel Rebelo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Copy Detection Mechanisms For Digital Documents

Copy Detection Mechanisms for Digital Documents

Uploaded by

Daniel Rebelo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Copy Detection Mechanisms for Digital Documents 

Sergey Brin, James Davis, Hector Garcia-Molina


Department of Computer Science
Stanford University
Stanford, CA 94305-2140
e-mail: [email protected]
Abstract letin board. The danger of illegal copies is not new, of
In a digital library system, documents are available in course however, it is much more time consuming to re-
digital form and therefore are more easily copied and produce and distribute paper, CDs or videotape copies
their copyrights are more easily violated. This is a very than on-line documents.
serious problem, as it discourages owners of valuable in- Current technology does not strike a good balance be-
formation from sharing it with authorized users. There tween protecting the owners of intellectual property and
are two main philosophies for addressing this problem: giving access to those who need the information. At one
prevention and detection. The former actually makes extreme are the open sources on the Internet, where ev-
unauthorized use of documents dicult or impossible erything is free, but valuable information is frequently
while the latter makes it easier to discover such activ- unavailable because of the dangers of unauthorized dis-
ity. tribution. 1 At the other extreme are closed systems,
In this paper we propose a system for registering doc- such as the one that the IEEE currently uses to dis-
uments and then detecting copies, either complete copies tribute is papers in CD-ROM. This a completely stand-
or partial copies. We describe algorithms for such detec- alone system where users can look for speci c articles,
tion, and metrics required for evaluating detection mech- view them, and print them, but cannot move any data
anisms (covering accuracy, eciency, and security). We in electronic form out of the system, and cannot add any
also describe a working prototype, called COPS, describe of his or her data.
implementation issues, and present experimental results Clearly, one would like to have an infrastructure that
that suggest the proper settings for copy detection pa- gives users access to a wide variety of digital libraries
rameters. and information sources, but that at the same time gives
information providers good economic incentives for o er-
1 Introduction ing their information. In many ways, we believe this is
the central issue for future digital informationand library
Digital libraries are a concrete possibility today because systems.
of many technological advances in areas such as storage In this paper we present one component of the infor-
and processor technology, networks, database systems, mation infrastructure that addresses this issue. The key
scanning systems, and user interfaces. In many aspects, idea is quite simple: provide a copy detection service
building a digital library today is just a matter of \doing where original documents can be registered, and copies
it." However, there is a real danger that such a digital can be detected. The service will detect not just ex-
library will either have relatively few documents of in- act copies, but also documents that overlap is signi cant
terest, or will be a patchwork of isolated systems that ways. The service can be used (see Section 2) in a va-
provide very restricted access. riety of ways by information providers and communica-
The reason for this danger is that the electronic tions agents to detect violations of intellectual property
medium makes it much easier to illegally copy and dis- laws. Although the copy detection idea is simple, there
tribute information. If an information provider gives a are several challenging issues we address here involving
document to a customer, the customer can easily dis- performance, storage capacity, and accuracy that need to
tribute it on a large mailing list or can post it on a bul- be resolved. Furthermore, copy detection is relevant to

This research was sponsored by the Advanced Research the \database community" since its central component
Projects Agency (ARPA) of the Department of Defense un- is a large database of registered documents.
der Grant No. MDA972-92-J-1029 with the Corporation for We stress that copy detection is not the complete so-
National Research Initiatives (CNRI). lution by any means it is simply a helpful tool. There
are a number of other important \tools" that will also
assist in safeguarding intellectual property. For exam-
1
As just one example, Knight-Ridder Tribune recently
(June 23, 1994) ceased publishing on ClariNet the Dave
Barry and the Mike Royko columns because subscribers re-
distributed the articles on large mailing lists.
ple, good encryption and authorization mechanisms are Unlike the case with watermarks, it is not easy for a
needed in some cases. It is also important to have mech- user to automatically subvert the system, i.e., to make
anisms for charging for access to information. The arti- an undetectable copy. For example, if the decomposi-
cles in 5, 7, 9] discuss a variety of other topics related to tion units are sentences, a user would have to change a
intellectual property. These other tools and topics will large number of sentences in the document. This involves
not be covered in this paper. more than just adding a blank space between words (as-
suming that the hashing scheme ignores spaces). Of
2 Safeguarding Intellectual Property course, a determined user could change all sentences,
but our goal is to make it hard to copy documents, not
How can we ensure that a document is only seen and to make it impossible. This makes it hard to rapidly
used by a person who is authorized (e.g., has paid) to see distribute copies of documents.
it? One possibility lies in preventing violations from oc- This basic scheme has much in common with sif, a tool
curring. Some schemes along these lines have been sug- for nding similar les in a le system, created by Udi
gested, such as secure printers with cryptographically se- Manber 10]. However, there are a number of di erences
cure communications paths 12] and active documents 6] in nding similar les versus nding similar sections of
where users may interact with documents only through text which COPS addresses. First, since we are deal-
a special program. ing with text, we operate on a syntactic level and hash
The problems with all such techniques is that they are syntactic units as opposed to xed length strings. We
cumbersome (requiring special software or hardware), re- also consider the security of copy detection (Section 3.3)
strictive (limiting access means to the document), and and attempt to maximize its exibility by dealing with
not completely secure (from someone with an OCR pro- violations of varying granularities (Section 4). One of
gram for example). The alternative is to use detection the most important di erences is that it is much more
techniques. That is, we assume most users are honest, dicult to test a system like COPS since there are no
allow them access to the documents, and focus on de- databases of actual copyright violations (Section 5).
tecting those that violate the rules. Many software ven- The copy detection server can be used in a variety
dors have found this approach to be superior (protection of ways. For example, a publisher is legally liable for
mechanisms get in the way of honest users, and sales publishing materials the author does not have copyright
may actually decrease). on thus, it may wish to check if a soon-to-be-published
One possible direction is \watermark" schemes where document is actually an original document. Similarly,
a publisher incorporates a unique subtle signature (iden- bulletin-board software may automatically check new
tifying the user) in a document when it is given to the postings in this fashion. An electronic mail gateway may
user so that when an unauthorized copy is found, the also check the messages that go through (checking for
source will be known 13, 3, 4, 2]. One problem is that \transportation of stolen goods"). Program committee
these schemes can easily be defeated by users who de- members may check if a submission overlaps too much
stroy the watermarks. For example, slightly varied pixels with an author's previous paper. Lawyers may want to
in an image would be lost if it is passed through a lossy check subpoenaed documents to prove illegal behavior.
JPEG encoder. (Copy detection can also be used for computer programs
A second approach, and one that we advocate in this 11], but we only focus on text in this paper.) There are
paper (for text documents), is that of a copy detection also applications that do not involve detection of unde-
server 1, 11]. The basic idea is as follows: When an sirable behavior. For example, a user that is retrieving
author creates a new work, he or she registers it at the documents from an information retrieval system or who
server. The server could also be the repository for a is reading electronic mail, may want to ag duplicate
copyright recordation and registration system, as sug- items (with a given overlap threshold). Here the \regis-
gested in 8]. As documents are registered, they are tered" documents are those that have been seen already
broken into small units, for now say sentences. Each the \copies" represent messages that are retransmitted
sentence is hashed and a pointer to it is stored in a large or forwarded many times, di erent editions or versions
hash table. of the same work, and so on. Of course, potential dupli-
Documents can be compared to existing documents in cates should not be deleted automatically it is up to the
the repository, to check for plagiarism or other types of user to decide if he wants to view possible duplicates.
signi cant overlap. When a document is to be checked, In summary, we think that detecting copies of text
it is also broken into sentences. For each sentence, we documents is a fundamental problem for distributed in-
probe the hash table to see if that particular sentence has formation or database systems. And there are many
been seen before. If the document and a previously regis- issues that need to be addressed. For instance, should
tered document share more than some threshold number the decomposition units be paragraphs or something else
of sentences, then a violation is agged. The threshold instead of sentences? Should we take into account order
can be set depending on the desired checks, smaller if we of the units (paragraphs or sentences), e.g., by hashing
are looking for copied paragraphs, larger if we only want sequences of units? Is it feasible to only hash a fraction
to check if documents share large portions. A human of the sentences of registered documents? This would
would then have to examine both documents to see if it make the hash table smaller, hopefully still making it
was truly a violation. very likely that we will catch major violations. If the
hash table is relatively small, it can be cloned. Our mail (with the same notation as violation tests), that approx-
gateway above could then perform its checks locally, in- imate the desired violation tests. For instance, consider
stead of having to contact a remote copy detection server the operating test t1 (d r) that holds if 90% of the sen-
for each message. There are also implementation issues tences in d are contained in r. This test may be con-
that need to be addressed. For example, how are sen- sidered an approximation to the Subset test described
tences extracted from say Latex or Word documents? above. If the system ags t1 violations, then a human
Can one extract them from Postscript documents, or can check if they are indeed Subset violations.
from bit maps via OCR?
These and other questions will be addressed in the rest 3.1 Ordinary Operational Tests
of this paper. We start in Sections 3 and 4 by de ning the In the rest of this paper we will focus on a speci c class
basic terms, evaluation metrics, and options for copy de- of operational tests, ordinary operational tests (OOTs),
tection. Then in Section 5 we describe our working pro- that can be implemented eciently. We believe they can
totype, COPS, and report on some initial experiments. accurately approximate many violation tests of interest,
A sampling technique that can reduce the storage space such as Subset, Overlap, and Plagiarism.
of registered documents or can speed up checking time Before we describe OOTs we need to de ne some prim-
is presented and analyzed in Section 6. itives for specifying the level of detail at which we look at
the documents. As mentioned in Section 3, documents
3 General Concepts contain some structural information. In particular, doc-
uments can be divided into well de ned parts, consistent
In this section we de ne some of the basic concepts for with the underlying structure such sections, paragraphs,
copy detection and for evaluating mechanisms that im- sentences, words, or characters. We call each of these
plement it. (As far as we know, text copy detection types of divisions a unit type and particular instances of
has not been formally studied, so we start from basics.) these unit types are called units.
The starting point is the concept of a document, a body We de ne a chunk as a sequence of consecutive
of text from which some structural information (such units in a document of a given unit type. A doc-
as word and sentence boundaries) can be extracted. In ument may be divided into chunks in a number of
an initial phase, formatting information and non-textual ways since chunks can overlap, may be of di er-
components are removed from documents (see Section ent sizes, and need not completely cover the docu-
5). The resulting canonical form document consists of ment. For example, let us assume we have a docu-
a string of ascii characters with whitespace separating ment ABCDEFG where the letters represent sentences
words, punctuation separating sentences and possibly or some other units. Then it can be organized into
a standard method of marking the beginning of para- chunks as follows : A,B,C,D,E,F,G or AB,CD,EF,G
graphs. or AB,BC,CD,DE,EF,FG or ABC,CD,EFG or A,D,G.
A violation occurs when a document infringes upon A method of selecting chunks from a document divided
another document in some way (e.g., by duplicating por- into units is a chunking strategy. It is important to note
tions of text). There are a number of violation types that unlike units, chunks have no structural signi cance
which can occur including plagiarism of a few sentences, to the document and so chunking strategies cannot use
exact replication of the entire document, and many steps structural information about the document.
in between. The notion of checking for a particular type An OOT, o, uses hashing to detect matching chunks
of violation between two documents is captured by a vi- and is implemented by the set of procedures in Figure 1.
olation test. If t is a violation test and t(d r) holds, The code is intended to convey key concepts, not an ef-
then document d violates document r according to the cient or complete implementation. (Section 5 describes
particular test. For example, Plagiarism(d r) is true if our actual prototype system.) First there is the pre-
document d has plagiarized from document r. We also processing operation, PREPROCESS(R, H), that takes as
extend this notation to include checking against a set of input a set of registered documents R and creates the
documents: t(d R) is true if and only if t(d r) holds for hash table, H. Second, there are procedures for on-the-y
some document r 2 R. adding documents to H (registering new documents) and
Most of the violation tests we are interested in are not for removing them from H (un-registering documents).
well de ned and require a decision by a human being. Third is the function EVALUATE(d, H) that computes
For example, plagiarism is particularly dicult to test o(d R).
for. For instance, the sentence \The proof is as follows" To insert documents in the hash table, procedure
may occur in many scienti c papers and would not be INSERT uses a function INS-CHUNKS(r) to break up a
considered plagiarism if it occurred in two documents, document r into its chunks. The function returns a set
while this sentence most certainly would. If we consider of tuples. Each <t,l> tuple represents one chunk in r,
a test Subset that detects if a document is essentially a where t is the text in the chunk, and l is the location of
subset of another one, we again need to consider if the the chunk, measured in some unit. An entry is stored in
smaller document makes any signi cant contributions. the hash table for every <t,l> chunk in the document.
This again, requires human evaluation. Procedure EVALUATE(d, H) tests a given document d
The goal of a copy detection system is to implement for violations. The procedure uses EVAL-CHUNKS func-
well de ned algorithmic tests, termed operating tests tion to break up d. The reason why we use a di erent
PREPROCESS(R,H) r. The copy detection system may also store separately
CREATETABLE(H) the registered documents. (Our COPS prototype does
for each r in R INSERT(r,H) this.) This can be useful for showing a user the match-
ing documents and highlighting the matching chunks.
INSERT(r,H)
C = INS-CHUNKS(r) /* OOT dependent */ 3.2 Measuring Accuracy
for each <t, l> in C
As described earlier, OOTs (and operational tests in gen-
h = HASH(t)
INSERTCHUNK(<h,r,l>, H)
eral) are intended for approximating violation tests such
as Plagiarism and Subset. It is therefore important to
DELETE(r,H) evaluate how well an OOT approximates some other test.
C = INS-CHUNKS(r) It is also important to evaluate the security of OOTs, i.e.,
for each <t, l> in C how hard it is to subvert the copy detection, as well as
h = HASH(c) their eciency, i.e., what computational resources they
DELETECHUNK(<h,r,l>, H) require. Accuracy and security are discussed in the rest
EVALUATE(d,H) of this section eciency is addressed in Section 6.
C = EVAL-CHUNKS(d) Assume a random registered document Y chosen from
SIZE = |C|
a distribution of registered documents R. That is, the
MATCHES = {} /* empty set */
for each <t, ld> in C
probability that Y is a particular document r1 out of a
h = HASH(t) population of registered documents is R(r1). Similarly,
assume a random test document X is selected from a
/* get all <r, lr> with matching h */ distribution of test documents D. We can then de ne the
SS = LOOKUP(h, H) following accuracy metrics, each implicitly parametrized
for each <r, lr> in SS by R and D.
MATCHES += <|t|, ld, r, lr>
Denition 3.1 For a test t, we dene freq(t) =
return DECIDE(MATCHES,SIZE) /* OOT dependent */ P (t(X Y )). (\P" stands for \probability.")
Intuitively, freq measures how frequently a test is true.
Figure 1: Pseudo-code for OOT If an operating test approximates a violation test well,
then their freq's should be close but the converse is not
chunking function at evaluation time will become appar- true since they can accept on disjoint sets. If the freq of
ent in Section 6. For now, we can assume that both the operating test is small compared to the violation test
INS-CHUNKS and EVAL-CHUNKS are identical and we use it is approximating, then it is being too conservative. If
CHUNKS to refer to them. it is too large then the operating test is too liberal.
After chunking, procedure EVALUATE then looks up Suppose we have an operating test t2 and a violation
the chunks in the hash table H, producing a set of tu- test t1 . Then we de ne the following measures for accu-
ples MATCH. Each <s,ld,r,lr> in MATCH represents a racy. (Note that these can also be applied between two
match: a chunk of size s at location ld in document d operating tests and in general between any two tests).
matches (has same hash key) as a chunk at location lr Denition 3.2 The Alpha metric corresponds to a
in registered document r. The MATCH set is then given measure of false negatives, i.e., Alpha(t1 t2) =
to function DECIDE(MATCH, SIZE) (where SIZE is the P (:t2(X Y ) j t1(X Y )). Note Alpha is not symmet-
number of chunks in d) that returns the set of matching ric. A high Alpha(t1  t2) value indicates that operating
registered documents. If the set is non-empty, then there test t2 is missing too many violations of t1.
was a violation, i.e., o(d R) holds. Denition 3.3 The Beta metric is analogous to Alpha
Note that an instance of an OOT is speci ed sim- except that it measures false positives, i.e., Beta(t1  t2) =
ply by its INS-CHUNKS, EVAL-CHUNKS and DECIDE func- P (t2(X Y ) j :t1(X Y )). Beta is not symmetric either.
tions. That is, this is the only way in which OOTs di er. A high Beta(t1  t2) value indicates that t2 is nding too
In particular, in Section 5 we will start by considering many violations not in t1 .
an OOT where both its CHUNKS functions extract sen-
tences, and its DECIDE function selects registered docu- Denition 3.4 The Error metric is the combination
ments that exceed some threshold fraction  of matching of Alpha and Beta in that it takes into account both
chunks. That is, let COUNT(r, MATCH) be the number of false positives and false negatives and is dened as
tuples of the form <-,-,r,-> in MATCH. Then document Error(t1 t2) = P(t1 (X Y ) 6= t2 (X Y )). It is symmet-
r will be selected if COUNT(r, MATCH) is greater than ric. A high Error value indicates that the two tests are
SIZE. For example, if  = 0:4 and the document to dissimilar.
check has 100 sentences, then registered documents with
41 or more matching sentences will be selected. We call 3.3 Security
this DECIDE function the match ratio function. So far we have assumed that the author of a test doc-
In the code of Figure 1 we only store the ids of regis- ument does not know how our copy detection system
tered documents in H, not the full documents. That is, works and does not intend to sabotage it. However, an-
for a tuple <h,r,l> in H, r is simply the name or id of other important measure for an OOT is how hard it is
for a malicious user to break it. We measure this no- 4 Taxonomy of OOTs
tion of security in terms of how many changes need to
be made to a registered document so that it will not be The units selected, the chunking strategy, and the deci-
identi ed by the OOT as a copy. sion function can a ect the accuracy and the security of
Denition 3.5 The security of an OOT o (also ap- an OOT. In this section we consider some of the options
and the tradeo s involved.
plicable to any operating test) on a given document r,
SEC (o r), is the minimum number of characters that 4.1 Units
must be inserted, deleted, or modied in r to produce a
new document r such that o(r  r) is false. The higher a To
0 0 determine how documents are to be divided into
chunks we must rst choose the units. One key factor
SEC (o r) value is, the more secure o is. to consider is the number of characters in a unit. Larger
We can use this notion to evaluate and compare OOTs. units (all else being equal) will tend to generate fewer
For example, consider an OOT o1 that considers the en- matches and hence will have a smaller freq and be more
tire document as a single chunk. Then SEC (o1  r) = 1 selective. This, of course, can be compensated by chang-
for all r, because changing a single character makes the ing the chunk selection strategy or decision function.
document not detectable as a copy. 2 Another important factor in the choice of a unit type
As another example consider OOT o2 that uses sen- is the ease of detecting the unit separators. For exam-
tences as chunks and a match ratio decision function. ple, words that are separated by spaces and punctuation
Then SEC (o2  r) = (1 ; )SIZE where SIZE is the num- are easier to detect than paragraphs which can be dis-
ber of sentences in r. For instance, if  = 0:6 and our tinguished in many ways.
document has 100 sentences, we need to change at least Perhaps the most important factor in unit selection
40 of them. As a third example, consider an OOT o3 is the violation test of interest. For instance, if it
that uses pairs of overlapping sentences as chunks. For is more meaningful that sequences of sentences were
instance, if the document has sentences A, B, C, ... , copied rather than sequences of words (e.g., sentence
o3 considers chunks AB, BC, CD, ... . Here we need fragments), then sentences and not words should be used
to modify half as many sentences as before (roughly), as units.
since each modi cation can a ect two chunks. Thus,
SEC (o3  r) is approximately equal to SEC (o2  r)=2, i.e., 4.2 Chunks
o3 is approximately half as secure as o2 .
Note that our security de nition is weak because it as- There are a number of strategies for selecting chunks.
sumes the adversary knows all about our OOT. However, involved, thethem
To contrast we can consider the number of units
by keeping certain information about our OOT secret we quired for a document,ofand
number hash table entries that are re-
an upper bound for the se-
can enhance security. We can model this by having a curity SEC(o r). 3 See Table 1 for a summary of the
large class of OOTs, O, that vary only by some param- four strategies we consider. (There are also many vari-
eters, and then secretly choosing one OOT from O. We ations not covered here.) In the table,
assume that the adversary does not know which OOT number of units in the document r beingjrjchunked, refers to the
and
we have chosen and thus needs to subvert all of them. k is a parameter of the strategies. The \space" column
For this model we de ne SEC (O r) as the number of
characters that must be inserted, deleted, or modi ed to gives
\#
the number of hash table entries need for r, while
units" gives the chunk size.
make o(r  r) false for all o 2 O. For examples of using
0

classes of OOT's see chunking strategy D of Section 4.2 (A) One chunk equals one unit. Here every unit (e.g. ev-
and Section 6 (consider the seed for the random number ery sentence) is a chunk. This yields the smallest
generator as a parameter). chunks. As with units, small chunks tend to make
Finally, notice that the security measures we have pre- the freq of an OOT smaller. The major weakness
sented here do not address \authorization" issues. For is the high storage cost jrj hash table entries are
example, when a user registers a document, how does the required for a document. However, it is the most se-
system ensure the user is who he claims to be and that he cure scheme SEC(o r) is bounded by jrj. That is,
actually \owns" the document? When a user checks for depending on the decision function, it may be neces-
violations, can we show him the matching documents, sary to alter up to jrj characters (one per chunk) to
or do we just inform him that there were violations? subvert the OOT.
Should the owner of a document be noti ed that some- (B) One chunk equals k nonoverlapping units. In this
one was checking a document that violates his? Should strategy, we break the document up into sequences
the owner be given the identity of the person submitting of k consecutive units and use these sequences as our
the test document? These are important administrative chunks. It uses (1=k)th the space of Strategy A but
questions that we do not attempt to address in this pa- is very insecure since altering a document by adding
per. a single unit at the start will cause it to have no
2
This assumes a decision function which doesn't ag a matches with the original. We call this e ect \phase
violation if there are no matches (a reasonable condition).
For instance, if o1 (d r) is always true, no matter if there are 3
For our discussion we assume that documents do not have
matches or not, then our statement does not hold. signicant numbers of repeating units.
strat summary example on ABCDEF (k = 3) space # units SEC 
A 1 unit A,B,C,D,E,F jrj 1 jr j
B k units, 0 over ABC,DEF jrj=k k 1
C k units, k-1 over ABC,BCD,CDE,DEF jrj k jrj=k
D hashed breakpoints AB,CDEF jrj=k k jr j

Table 1: Properties of Chunking Strategies


dependence". This e ect also leads to high Alpha the test and the registered document is above a certain
errors. value k. This would be useful for detecting violations
(C) One chunk equals k units, overlapping on k ; 1 units. such as Plagiarism. One might also consider using or-
Here, we take every sequence of k consecutive units dered matches which tests whether there are more than
in our document as our chunks. Therefore we do not a certain number of matches occurring the same order
su er from phase dependence as in Strategy B but in both documents. This would be useful if unordered
unfortunately the space cost is equivalent to Strat- matches are likely to be coincidental.
egy A. Comparing an OOT oA that uses Strategy
A, and an OOT oC that is the same except for its 5 Prototype and Preliminary Results
use of Strategy C, one can see that Alpha(o oC )  We have built a working OOT prototype to test our
Alpha(o oA ) and Beta(o oC )  Beta(o oA ) for any ideas and to understand how to select good CHUNKS and
violation test o. This is because oC (d r) being true DECIDE functions. The prototype is called COPS (COpy
implies that oA (d r) is true. Thus Strategy C is Protection System) and Figure 2 shows its major mod-
prone to higher Alpha errors (but lower Beta errors). ules. Documents can be submitted via email in TEX (in-
Also, Strategy C is relatively insecure (though more cluding LaTEX), DVI, tro and ASCII formats. New doc-
secure than B) in that modifying every kth unit of a uments can be either registered in the system or tested
registered document is sucient to fool the system. against the existing set of registered documents. If a
(D) Use nonoverlapping units, determining break points new document is tested, a summary is returned listing
by hashing units. We start by hashing the rst unit the registered documents that it violates.
in the document. If the hash value is equal to some
constant x modulo k, then the rst chunk is simply
the rst unit. If not, we consider the second unit. If Tex to ASCII DVI to ASCII troff to ASCII
its hash value equals x modulo k, the the rst chunk
is the rst two units. If not we consider the third
unit, and so on until we nd some unit that hashs Sentence Identification and Hashing
to x modulo k, and this identi es the end of the rst
chunk. We then repeat the procedure to identify the
following nonoverlapping chunks. Document Processing Query Processing
It can be shown that the expected number of units
in each chunk will be k. Thus, Strategy D is similar
to B in its hash table requirements. However, unlike
B, it is not a ected by phase dependence since sim- Database
ilar text will have the same break points. Strategy
D, like C, has higher Alpha (and lower Beta) errors
as compared to A. Furthermore, all else being the Figure 2: Modules in COPS implementation.
same, freq should be only slightly less than that of
C because signi cant portions of duplicated text will COPS allows modules to be easily replaced, permit-
be caught just as in C. ting experimentation with di erent strategies (e.g., dif-
The key advantage of Strategy D is that it is very ferent INS-CHUNKS, EVAL-CHUNKS and DECIDE functions).
secure. (It is really a family of strategies with a secret We will begin our explanation with the simplest case
parameter see Section 3.3.) Without knowing the (sentence chunking for both insertion and evaluation,
hash function, one must change every unit of a test and a match ratio decision function) and later discuss
document to be sure it will get through the system possible improvements. A document that has been sub-
without warnings. mitted to the system is given a unique document ID.
This ID is used to index a table of document informa-
4.3 Decision Functions tion such as title and author. To register the document,
There are many options for choosing decision functions. rst it must be converted into the canonical form, i.e.,
The match ratio function (Section 3.1) can be useful for plain ASCII text. The process by which this occurs is
approximating Subset and Overlap violation tests. An- dependent upon the document format. A TEX document
other simple decision function is matches (with parame- can be piped through the Unix utility detex, while a doc-
ter k) that simply tests if the number of matches between ument with tro formatting commands can be converted
with nro . Similarly DVI and other document formats column and then the right one.
have lters to handle their conversion to plain ASCII An input format COPS can not handle in general is
text. After producing plain ASCII we are ready to de- Postscript. Since Postscript is actually a programming
termine and hash the document's individual sentences. language, it is very dicult to convert its layout com-
Using periods, exclamation points, and question marks mands to plain ASCII text. Some Postscript genera-
as sentence delimiters, we hash each sentence into a nu- tors such as dvips, enscript, and Microsoft Word pro-
meric key. The current document's unique ID is then duce relatively simple Postscript from which text can
stored in a permanent hash table, once for each sentence. be extracted. However, others such as Interleaf produce
When we wish to check a new document against the Postscript code which would require the generation of
existing set of registered documents, we use a very sim- page bit maps. These could be scanned with OCR (op-
ilar procedure. We generate the plain ASCII, determine tical character recognition) to analyze and reconstruct
sentences, and generate a list of hash keys, and look the text. This process is dicult and error prone.
them up in the hash table (see Section 3.1). If more In summary, the approach we have taken with the
than SIZE sentences match with any given registered COPS converters is to do a reasonable job converting
document we report a possible violation. to ASCII, but not necessarily perfect. Most matching
sentences that are not translated identically, will still be
5.1 Conversion to ASCII found by the system, since enhancements discussed later
The procedure described above is the ideal case. In prac- attempt to negate the e ects of common translation mis-
tice a number of interesting diculties arise. Let us rst interpretations. Even if some matching sentences are
consider some of the challenges associated with the con- missed, there should be enough other matches in over-
version to ASCII text. The most important is that no ex- lapping documents so that COPS can still ag the vi-
act objective method of reducing a formatted document olations. Later, we present experimental results that
to ASCII exists. Documents are formatted using TEX con rm this.
or tro precisely because there is some \value added"
over plain text. This extra formatting cannot be rep- 5.2 Sentence Identication and Hashing
resented in ASCII, and so will be lost. For example, Dicult problems also arise in the sentence identi ca-
embedded graphs have no ASCII equivalent. We can re- tion and hashing module. In particular, even if we are
tain any text items or labels associated with the graph, given correctly translated plain ASCII, it is not always
but the primary structure is not translatable. Equations clear how to extract sentences. As a rst approxima-
and tables are diculties as well. In our implementation tion, we can identify a sentence by merely taking all
we discard graphs, equations, tables, pictures, and all words up to a period or question mark. However, sen-
other pieces of information that cannot be represented tences that contain \e.g." or other abbreviations will
naturally in ASCII. We also choose to discard all text be broken into multiple parts because of the embedded
formatting commands that e ect the presentation, but periods. An extension to our simple model explicitly
not the content, of the document. For example, com- watches for and eliminates common abbreviations such
mand sequences to produce italic type and change font as \e.g." and \i.e." so that sentences will not be bro-
are removed and ignored. ken in this way. Nevertheless, unexpected abbreviations
The conversion process is not perfect. If the document will still cause diculties. For example, given the ac-
input format is DVI, then it is sometimes impossible to tual sentences, \I am a U.S. citizen." and \The U.S.
distinguish \equations" from \plain text". Consider the is large." our system will identify the following set of
sentence, \Let X+Y equal the answer." This sentence sentences. \I am a U.", \S.", \citizen.", \The U.", \S.",
will be translated to ASCII exactly as it is shown. How- and \is large." Notice that the sentence \S." is identi-
ever, if we begin with TEX, then the equation will be ed twice. The system will ag this as a match, even
discarded, leaving the sentence \Let equal the answer." though the actual sentences are not the same. To reduce
Since the conversion to plain ASCII produced di erent this sort of error we can disregard sentences composed
sentences, our system would be unable to recognize that of a single word however, other similar errors may still
a sentence match occurred. Later in this section we will occur. For example, title and author names at the head
discuss some system enhancements that allows us to de- of a document are also dicult to extract as sentences,
tect matching sentences, despite imperfect translations. since they rarely end with punctuation. We discuss later
Another complication with DVI is that it gives direc- some further improvements to the simple algorithm we
tions for placing text on a page but it does not specify have described here. Note that paragraph detection, if
what text is part of the main body, and what is part it were needed, would involve similar issues. COPS cur-
of subsidiary structures like footnotes, page headers and rently does not detect paragraphs.
bibliographies. Our DVI converter does not attempt to The units used by COPS' OOT are words and sen-
rearrange text it simply considers the text in the order tences (see Section 3.1). COPS rst converts each word
it appears on the page. However, one case it does handle in the text to a hash key. The result is a sequence of hash
is that of two column format. Instead of reading char- keys with interspersed end-of-sentence markers. The
acters left to right, top to bottom (which would corrupt chunking of this sequence is done by calling a procedure
most sentences in a two column format), the converter COMBINE(N-UNITS, STEP, UNIT-TYPE), where N-UNITS
detects the inter-column gap and reads down the left is the number of units to be combined into the next
chunk, STEP is the number of units to advance for Group Self Related Unrelated
the next chunk, and UNIT-TYPE indicates what should (Anity) (Noise)
be considered a unit. For example, repreatedly call- A 100% 71.9% 0.6%
ing COMBINE(1, 1, WORD) creates a chunk for each B 100% N/A 0.9%
word in the input sequence. Calling COMBINE(1, 1, C 100% 38.6% 0.9%
SENTENCE) creates a chunk for each sentence. Us- D 100% 42.9% 0.3%
ing COMBINE(3, 3, WORD) takes every three words as E 100% 38.4% 0.2%
a chunk, while COMBINE(3, 1, WORD) produces over- F 100% 63.0% 0.8%
lapping three word chunks. COMBINE(2, 1, SENTENCE) G 100% 66.0% 0.4%
would produce overlapping two sentence chunks. Thus, H 100% 38.4% 0.4%
we can see that this scheme gives us great exibility for I 100% 93.3% 1.3%
experimenting with di erent CHUNKS functions. How- TotalAvg 100% 52.9% 25.16% 0.6% 2.1%
ever, it should be noted that once a CHUNKS function is
chosen, it must be used consistently for all documents. Table 2: Average number of matching sentences.
That is, the exibility just described is useful only in an
experimental setting. 100COUNT(r, MATCH)/SIZE for all other documents r
in the group, and average the results. We refer to values
5.3 Exploratory Tests in the second column as \anity" values since they rep-
To evaluate the accuracy of the system, we conducted resent how close documents are. For the third column,
some exploratory experiments using a set of ninety two we compare each d in a group against all r in others
Latex, ASCII, and DVI technical documents (i.e., pa- groups. We refer to number in this column as \noise"
pers like this one). These experiments are not intended since they represent undesired matches. The numbers
to be comprehensive our goal is simply to understand reported at the bottom of Table 2 are the averages over
how many matching chunks real documents might be all document comparisons performed for that column.
expected to have, and how well our converters work. We also report the standard deviation between individ-
The documents average approximately 7300 words and ual tests to illustrate the spread of values.
450 sentences in length. Approximately half of these Ideally, one wants anity values that are as high as
documents are grouped into nine topical sets (labeled possible, and noise values that are as low as possible.
A, B, ..., I in the tables). The two or three documents This makes it possible for a threshold value that is be-
within each group are closely related, usually multiple tween the anity and noise levels to distinguish between
revisions of a conference or journal paper describing the related and unrelated documents. Table 2 reports that
same work. The documents in separate topical groups related documents have on average 53% matching sen-
are unrelated except for the author's aliation with our tences, while unrelated ones have 0.6%. The reason why
research group at Stanford. The remaining half of the anity is relatively low is that the notion of \Related"
documents not in any topical group are drawn from out- documents we have used here is very broad. For exam-
side Stanford and not related to any document in our ple, often the journal version and the conference version
collection. of the same work are quite di erent.
All of these documents were registered in COPS, and The noise level of 0.6%, equivalent to 2 or 3 sentences,
then each was queried against the complete set. Our is larger than what we expected. The discrepancy is
goal is to see if COPS can determine the closely related caused by several things. A few sentences, such as, \This
documents. Using the terminology of Section 3, we are work partially supported by the NSF" are quite com-
considering a violation test Related(d, r) that evaluates mon in journal articles, so that even unrelated docu-
to true if d and r are in the same group. This will be ments might both contain it. Other sentences may also
approximated by an OOT that computes the percentage be exact replicas by coincidence. Hash collisions may
of matching sentences in d and r. If the number if high, be another factor, especially when there are large num-
the documents will be assumed to be related. bers of registered documents, but are not an issue in
Table 2 shows results from our exploration. Instead our experiments. Also note the relatively large variance
of reporting the number of violations that a particu- reported in the table. In particular, some unrelated doc-
lar match ratio would yield, we show the percentage of uments had on the order of 20 matching sentences.
matching sentences in each case. This gives us more in- The process by which a document is translated to
formation regarding the closeness of documents. ASCII also has some e ect on the noise level. For ex-
The rst result column in Table 2 gives the precent ample, the translation we use to convert TEX documents
matches of each document against itself. That is, for produces somewhat less noise than does our translation
each document d in a group, we compute 100COUNT(d, from DVI. This is caused by di erences in the inclusion
MATCH)/SIZE (see Section 3.1), average the values and of references. Many unrelated documents cite the same
report it in the row for that group. The fact that all references, possibly generating matching sentences. Our
values in the rst column are 100% simply con rms that TEX lter does not include references in its output (they
COPS is working properly. are in separate \bib" les), so noise is reduced. The
The numbers in the second column are computed as di erences in noise generated by ASCII translation be-
follows. For each document d in a group, we compute come less signi cant when the enhancements discussed
later are added to our system. the enhancements of Table 3). The solid line shows the
The larger the noise level, the harder it is to detect average noise as a function of the number of overlapping
plagiarism of small passages (e.g., a paragraph or two). sentences in a chunk. As we see, the noise decreases dra-
If we set the threshold  at say 5/SIZE sentences, the matically as the number of overlapping sentences grows.
OOT would have a high Beta error rate (too many unre- This is bene cial since it decreases the minimum amount
lated documents agged as Plagiarism violations), while of plagiarism detectable. Figure 3 shows an \e ective
if we set it higher, say 10/SIZE, we would miss actual noise" curve that is the average noise plus three stan-
violations (high Alpha error). Thus, it is important to dard deviations. If we assume that noise is a normally
reduce the noise level as much as possible. distributed variable, we can interpret the e ective noise
curve as a lower bound for the threshold in order to elimi-
5.4 Enhancements nate 99% of the false positives due to noise. For example,
However, we need to decrease the noise without sacri c- if we use three sentence chunks and set our threshold at
ing anity. If anity is too low, it makes it hard to ap-  = 0:01, then the Beta error will be less than 1%.
proximate the Related target test (again leading to high However, as described in Section 4.2, the Alpha error
Alpha or Beta errors). With this goal in mind, we have will increase as we combine sentences in chunks. This
considered a series of enhancements to the basic COPS mean that, for instance, we will be unable to detect pla-
algorithms. The results are summarized in Table 3. The giarism of multiple, non-contiguous sentences. Also, the
rst line represent the base case each additional line of security of the system is reduced (Section 4.2): it takes
the table represents an independent enhancement. The fewer changes to a document to make it pass as a new
reported values are averages over all document groups one.
(i.e., equivalent to the last row of Table 2).
The e ect of chunk size on document noise
Self Related Unrelated 7
(Anity) (Noise) 6 Average noise
Simple Method 100% 53.0% 0.61% 2.08 5 E ective noise
No Common Chunks 100% 53.4% 0.06% 0.30
Drop Numbers 100% 54.1% 0.47% 1.34 4
No Short Sentences 100% 51.8% 0.04% 0.21 3
No Short Words 100% 54.4% 0.36% 0.90
All Enhancements 100% 53.6% 0.03% 0.20 2
1
Table 3: COPS Enhancements.
0
In the \no common chunks" enhancement, chunks oc- 1 2 3 4 5
curring in our hash table more than ten times are elimi- Number of sentences per chunk
nated by the LOOKUP function (see Figure 1). This keeps
legitimate common phrases and passages from causing Figure 3: Noise as a function of number of overlapping
a document violation. For example, the sentence \This sentences.
work supported by the NSF," which is present in many
documents, will not be reported as a match. The last
three enhancements remove the indicated occurrence 5.5 Eect of Converters
from the input stream. For \drop numbers," any word A nal issue we investigate is the impact of di erent
with a numeric digit is dropped \short sentences" are input converters. For example, say a Latex document
arbitrarily de ned to have three or fewer words \short is initially registered in COPS. Later, the DVI version
words" are de ned to have three or fewer characters. of the same document (produced by running the origi-
These enhancements were motivated by our discovery nal through the Latex processor) is submitted for test-
that numbers, short sentences, and short words were ing. We would like to nd that (a) the DVI copy clearly
sometimes involved in incorrect matches. (Recall the matches the registered Latex original, and (b) the DVI
problem with abbreviations like \U.S." described in Sec- copy has a similar number of matches with other docu-
tion 5.2.) ments as the original would have had.
The last row of Table 3 shows the e ect of using all Table 4 explores this issue. The rst row is for the
enhancements at once. One can see that the combined basic COPS algorithm the second row is for the ver-
enhancements are quite e ective at reducing the noise sion that includes all the enhancements of Table 3. The
while keeping the anity at roughly the same levels. rst, third, and fth columns are as before and are only
We note that the parameter values we used for the en- included for reference. The \Altered Self" column re-
hancements (e.g., the number of occurrences that makes ports the average precent of matching sentences when
a chunk \common") worked well for our collection, but a DVI document is compared against its Latex original.
probably have to be adjusted for larger collections. The \Altered Related" column gives the average percent
In Figure 3 we study the e ect of increasing the num- matching sentences when a DVI document is compared
ber of overlapping sentences per chunk (without any of to all of the related Latex documents. Although the re-
sults are far from perfect, there seem to remain enough sider the rst option, sampling for testing. However,
matches so that the DVI can be agged as related to its note that the analysis for the sampling at registration
original and to documents its original was related to. time and at both is very similar to what we will present
We believe that the results presented in this section, here, and the results are analogous.
although not de nitive, provide some insight into the We start by giving a more precise de nition of the sam-
selection of a good threshold value for COPS, at least pling at testing strategy. We are given an OOT o1 with
for the Related target test. A threshold value of say  = any chunking functions INS-CHUNKS1 = EVAL-CHUNKS1,
0:05 (25 out of 500 sentences) seems to identify the vast and the match ratio DECIDE1 function with threshold 
majority of related documents, while not triggering false (Section 3.1). We de ne a second OOT, o2 , intended
violations due to noise. We also conclude that detecting to approximate o1. Its chunking function for evaluation,
plagiarism of about 10 or less sentences (roughly 2% of EVAL-CHUNKS2 is simply
documents) will be quite hard, without either high Alpha
or Beta errors. EVAL-CHUNKS2(r)
C = EVAL-CHUNKS1(r)
6 Approximating OOTs return RANDOM-SELECT(N, C)

In this section we address the eciency and scalability of where RANDOM-SELECT picks N chunks at random. 4
OOTs. For copy detection to scale well, we require that The chunking function for insertions is not changed, i.e.,
it can operate with very large collections of registered INS-CHUNKS2 = INS-CHUNKS1.
documents, as well as the ability to quickly test many The DECIDE1 function of o1 selects documents r where
new documents. One e ective way to achieve scalability the number of matching chunks COUNT(r, MATCH) is
is to use sampling. greater than SIZE. For o2, only N chunks are tested
To illustrate, say we have an OOT with a DECIDE (not SIZE), so the threshold number of chunks is N.
function that tests whether more than 15 percent of the Thus, DECIDE2 selects documents r where the number of
chunks of a document d match. Instead of checking all matching chunks COUNT(r, MATCH) is greater than N.
chunks in d, we could simply take say 20 random chunks
and check whether more than 3 of them matched (15% of 6.1 Accuracy of Randomized OOTs
the 20 samples). We would expect that this new OOT Now we wish to determine how di erent o2 is from o1.
based on sampling approximates the original OOT. If As in Section 3.2, let D be our distribution of input doc-
the average test document contains 1000 chunks, we will uments and let R be the distribution of registered docu-
have reduced our evaluation time by a factor of 50. The ments. Let X be a random document that follows D and
cost, of course, is in the lost accuracy and that is ana- Y be a random document that follows R. Let m(X Y )
lyzed in Section 6.1. be the proportion of chunks (according to o1 's chunking
Another sampling option is to sample registered docu- function) in X which match chunks in Y . Then let W(x)
ments. The idea here is to only insert in our hash table a be the probability density functionR that m(X Y ) = x,
random sample of chunks for each registered document. i.e., P (x1  m(X Y )  x2) = xx12 W(x)dx. Using
For example, say that only 10% of the chunks are hashed.
Next, suppose that we are checking all 100 chunks of this we can compute Alpha(o1  o2 ), Beta(o1  o2), and
a new document and nd 2 matches with a registered Error(o1 o2). The details of the computation are in
document. Since the registered document was sampled, Appendix A the results are as follows:
these 2 matches should be equivalent to 20 under the
original OOT. Since 20/100 exceed the 15% threshold, R1
the document would be agged as a violation. In this
Alpha(o1  o2) =  W(x)Q(x)dx
case, the savings would be storage space: the hash table R1
will have only 10% of the registered chunks. A smaller  W (x)dx
hash table also makes it possible to distribute it to other R
W(x)(1 ; Q(x))dx
sites, so that copy detection can be done in a distributed Beta(o1  o2) = 0 R 
fashion. Again, the cost is a loss of accuracy. o W(x)dx
A third option is to combine the two of these tech- Z 1
niques without sacri cing accuracy (any more than ei- Error(o1  o2) = W(x)Q(x)dx
ther one alone) by sampling based on the hash numbers 
of the chunks 10]. For example, if in our test document, Z
we sample exactly those chunks whose hash number is 0 + W (x)(1 ; Q(x))dx
mod 10, then there is no need to store the hash values N  
0

of any registered documents' chunks whose hash value is X b


N j
c

not 0 mod 10 since there could never be a collision oth- where Q(x) = j x (1 ; x)N j ;

erwise. However, this scheme has the drawback that one j =0


must always sample a xed fraction of the documents'
chunks rather than, say, a xed number of them. 4
This is not the most ecient way to sample. The code is
Due to space limitations, in this paper we only con- just for explanation purposes.
Match Self Altered Self Related Group Altered Rel. Unrelated
Simple 100% 60.9% 52.9% 36.0% 0.50%
Enhanced 100% 76.5% 53.6% 46.2% 0.03%
Table 4: Results for mechanically altered documents.
pa = 0:8 a = 0:1 pb = 0:2 b = 0:3 b = 0:8 pa = 0:95 a = 0:02 pb = 0:05 b = 0:3 b = 0:8  = 0:4
6 0.12
5 W(x) 0.1 Alpha
Beta
4 0.08 Error
3 0.06
2 0.04
1 0.02
0 0
0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 30
x N
Figure 4: An Exaggerated W Figure 5: The E ect of the Number of Sample Points on
Accuracy
6.2 Results
Before we can evaluate our expressions, we need to know pa = 0:95 pb = 0:05 b = 0:3 b = 0:8  = 0:4 N = 20
the W(x) distribution. Recall that W(x) tells us how 0.1
likely it is to have a proportion of x matches between a 0.09 Error
test and a registered document. One option would be to 0.08
measure W(x) for a given body of documents, but then 0.07
our results would be speci c to that particular body. In- 0.06
stead, we use a parametrized function that lets us con- 0.05
sider a variety of scenarios. 0.04
Using the observations of Section 5, we arrive at the 0.03
following W(x) function. With a very high probability 0.02
pa , the test document will be unrelated to the registered 0.01
one. In this case, there can still be noise matches, which 0
we model as normally distributed with mean 0 and stan- 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
a
dard deviation a (which will probably be very small).
With probability pb = 1 ; pa the test document is unre- Figure 6: The E ect of a on Error.
lated to the registered one. In this case we assume that
the number of matching chunks is normally distributed Alpha(o1 o2 ), Beta(o1  o2), and Error(o1  o2) values as
with mean b and standard deviation b . We would a function of N, for  = 0:4. Recall that the  value of 0.4
expect b to be large since, as we have seen, related doc- means that o1 is looking for registered documents whose
uments tend to have widely varying numbers of matches. chunks that match 40% of the chunks of the test docu-
Thus, our W(x) function is the weighted sum of two nor- ment. This value may have been picked, say, because we
mal (truncated
R at 0 and 1) distributions, normalized to are interested in a Subset target test. The parameters
make 01 W (x) = 1. for W (x) are given in the gure.
Figure 4 shows a sample W(x) function with exagger- Note that the values in Figure 5 are not simply mono-
ated parameters to make its form more apparent. The tonically decreasing. For example, the Alpha and Error
area under the curve in the range 0  x  0:2 represents values increase as N goes from 9 to 10. Rounding error
the likelihood of noise matches, while the rest of the is the cause for this. For example, for N = 9, o2 selects
range represents mainly matches of related documents. documents with COUNT (the number of matching chunks)
In practice, of course, we would expect pa to be much greater than 3.6 (=N), i.e., with 4 or more matches. For
closer to 1 (most comparisons will be between unrelated N = 10, documents with COUNT greater than 4 (i.e., 5 or
documents) and a to be much smaller. more) are selected. Consider now a test document that
Given a parametrized W(x), we can present results matches with say 40% to 50% of the chunks of a reg-
that show how good an approximation o2 is to o1 . An istered document (hence is selected by o1 ). It is more
important rst issue to study is the number of sam- likely that o2 with N = 9 will select it since it only has
ples N required for accurate results. Figure 5 shows the to get 4 hits. With N = 10, o2 is less likely to select it
because with only one extra sample, it has to get 5 hits. ter their documents: Without a substantial body of doc-
This e ect leads to the higher Alpha error for N = 10. uments, the service will not be very useful. We believe
In spite of the non-monotonicity, it is important to they can, especially if one starts with the documents of a
note how overall the Error decreases very rapidly as N particular community (e.g., netnews users, or SIGMOD
increases. For N > 10, the Error stays well below 0.01. authors). But regardless of the success of COPS and
This shows that o2 can approximate o1 well with a rela- copy detection, we believe it is essential to explore and
tively small number of sampled chunks. understand solutions for safeguarding intellectual prop-
Note, however, that the Alpha error does not decrease erty in digital libraries. Their success hinges on nding
as rapidly, but this is not as serious. The Alpha error at least one approach that works.
for N beyond say 20 is mainly caused by test documents
whose match ratio is slightly higher than  = 0:4. (The References
area under the W (x) curve in the vicinity to the right 1] C. Anderson. Robocops: Stewart and Feder's mech-
of 0.4 gives the probability of getting one of these doc- anized misconduct search. Nature, 350(6318):454{455,
uments.) In these cases, the sampling OOT may not April 1991.
muster enough hits to trigger a detection. However, in 2] J. Brassil, S. Low, N. Maxemchuk, and L.O'Gorman.
this case the original OOT o1 may not very good at ap- Document marking and identication using both
proximating the violation test of interest either. In other line and word shifting. Technical report, AT&T
words, in the percent of matches is close to 40%, it may Bell Labratories, 1994. May be obtained from
not be clear if the documents are related on not. Thus, ftp://ftp.research.att.com/dist/brassil/docmark2.ps.
the fact that o1 detects a violation but o2 does not is not 3] J. Brassil, S. Low, N. Maxemchuk, and L.O'Gorman.
as serious, we believe. Electronic marking and identication techniques to dis-
Our results are sensitive to the W(x) parameters used. courage document copying. Technical report, AT&T
For example, in Figure 6 we demonstrate the e ect of a . Bell Labratories, 1994.
We can see from Figure 6 that the Error stays very low 4] A. Choudhury, N. Maxemchuk,
as long as a is not near  = 0:4. If a is close to , we S. Paul, and H. Schulzrinne. Copyright protection for
get more documents in the region where o2 has trouble electronic publishing over computer networks. Techni-
identifying documents selected by o1. Similarly, we nd cal report, AT&T Bell Labratories, 1994. Submitted to
that error keeps very low in the high pa range, which is IEEE Network Magazine June 1994.
where we expect it to be in practice. 5] J. R. Garrett and J. S. Alen. Toward a copyright man-
In summary, using sampling in OOTs seems to work agement system for digital libraries. Technical report,
very well under good conditions (when  is far from the Copyright Clearance Center, 1991.
bulk of the match ratios). There is a large gain in ef- 6] G. N. Griswold. A method for protecting copyright on
ciency with only a small loss of accuracy. As stated networks. In Joint Harvard MIT Workshop on Technol-
earlier, the sample at registration OOT can be analyzed ogy Strategies for Protecting Intellectual Property in the
almost identically to what we have done here, and can Networked Multimedia Environment, April 1993.
be shown to substantially reduce the storage costs. 7] M. B. Jensen. Making copyright work in electronic pub-
lishing models. Serials Review, 18(1{2):62{66, 1992.
7 Conclusions 8] R. E. Kahn. Deposit, registration and recordation in
In this paper we have proposed a copy detection ser- an electronic copyright management system. Technical
vice that can identify partial or complete overlap of doc- report, Corporation for National Research Initiatives,
Reston, Virginia, August 1992.
uments. We described a prototype implementation of 9] P. A. Lyons. Knowledge-based systems and copyright.
this service, COPS, and presented experimental results Serials Review, 18(1{2):88{91, 1992.
that suggest the service can indeed detect violations of
interest. We also analyzed several important variations, 10] U. Manber. Finding similar les in a large le system. In
including ones for breaking up a document into chunks, USENIX, pages 1{10, San Francisco, CA, January 1994.
and for sampling chunks for detecting overlap. 11] A. Parker and J. O. Hamblen. Computer algorithms for
It is important to note that while we have described plagiarism detection. IEEE Trasnactions on Education,
copy detection as a centralized function, there are many 32(2):94{99, May 1989.
ways to distribute it. For example, copies of the regis- 12] G.J. Popek and C.S. Kline. Encryption and secure com-
tered document hash table can be distributed to permit puter networks. ACM Computing Surveys, 11(4):331{
checking for duplicates at remote sites. If the table con- 356, December 1979.
tains only samples (Section 6) it can be relatively small 13] D. Wheeler. Computer networks are said to oer new
and distributable more easily. Also, document registra- opportunities for plagarists. The Chronicle of Higher
tion can also be performed at a set of distributed reg- Education, pages 17, 19, June 1993.
istration services. These services could periodically ex-
change information on new registered documents they
have seen.
Perhaps the most important question regarding copy
detection is whether authors can be convinced to regis-

You might also like