CAPTCHA
CAPTCHA
The term "CAPTCHA" (based upon the word capture) was coined in 2000 by Luis von
Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford (all of Carnegie Mellon
University). It is a contrived acronym for "Completely Automated Public Turing test
to tell Computers and Humans Apart." Carnegie Mellon University attempted to
trademark the term, but the trademark application was abandoned on 21 April 2008.
Characteristics
A check box in a form that reads "check this box please" is the simplest (and perhaps
least effective) form of a CAPTCHA. CAPTCHAs do not have to rely on difficult
problems in artificial intelligence, although they can.
In the short term, this has the benefit of distinguishing humans from computers. In
the long term, it creates an incentive to advance the state of AI.
Applications
CAPTCHAs are used to prevent automated software from performing actions which
degrade the quality of service of a given system, whether due to abuse or resource
expenditure. CAPTCHAs can be deployed to protect systems vulnerable to e-mail
spam, such as the webmail services of Gmail, Hotmail, and Yahoo! Mail.
CAPTCHAs are used to stop automated posting to blogs, forums and wikis, whether as
a result of commercial promotion, or harassment and vandalism. CAPTCHAs also
serve an important function in rate limiting. Automated usage of a service might be
desirable until such usage is done to excess and to the detriment of human users. In
such cases, administrators can use CAPTCHA to enforce automated usage policies
based on given thresholds. The article rating systems used by many news web sites
are another example of an online facility vulnerable to manipulation by automated
software.
As of 2010, most CAPTCHAs display distorted text that is difficult to read by character
recognition software. The alternative implementations may include various tests, such
as identifying an object that does not belong in a particular set of objects, locating the
center of a distorted image, or identifying distorted shapes.
Accessibility
Because CAPTCHAs rely on visual perception, users unable to view a CAPTCHA (for
example, due to a disability or because it is difficult to read) will be unable to perform
the task protected by a CAPTCHA. Therefore, sites implementing CAPTCHAs may
provide an audio version of the CAPTCHA in addition to the visual method. The official
CAPTCHA site recommends providing an audio CAPTCHA for accessibility reasons, but
it is not usable for deafblind people or for users of text web browsers. This
combination is not universally adopted, with most websites (including Wikipedia)
offering only the visual CAPTCHA, with or without providing the option of generating
a new image if one is too difficult to read.
Even audio and visual CAPTCHAs will require manual intervention for some users,
such as those who have disabilities. There have been various attempts at creating
more accessible CAPTCHAs, including the use of JavaScript, mathematical questions
("how much is 1+1") and common sense questions ("what color is the sky on a clear
day"). However, these types of CAPTCHAs do not meet the criteria for a successful
CAPTCHA. They are not automatically generated and they do not present a new
problem or test for each attack.
Circumvention
There are a few approaches to defeating CAPTCHAs:
Insecure Implementation
Like any security system, design flaws in a system implementation can prevent the
theoretical security from being realized. Many CAPTCHA implementations, especially
those which have not been designed and reviewed by experts in the fields of security,
are prone to common attacks.
Some CAPTCHA protection systems can be bypassed without using OCR simply by re-
using the session ID of a known CAPTCHA image. A correctly designed CAPTCHA does
not allow multiple solution attempts at one CAPTCHA. This prevents the reuse of a
correct CAPTCHA solution or making a second guess after an incorrect OCR attempt.
Other CAPTCHA implementations use a hash (such as an MD5 hash) of the solution as
a key passed to the client to validate the CAPTCHA. Often the CAPTCHA is of small
enough size that this hash could be cracked. Further, the hash could assist an OCR
based attempt. A more secure scheme would use an HMAC.
Finally, some implementations use only a small fixed pool of CAPTCHA images.
Eventually, when enough CAPTCHA image solutions have been collected by an
attacker over a period of time, the CAPTCHA can be broken by simply looking up
solutions in a table, based on a hash of the challenge image.
Computer Character Recognition
A number of research projects have attempted (often with success) to beat visual
CAPTCHAs by creating programs that contain the following functionality:
Steps 1 and 3 are easy tasks for computers. The only step where humans still
outperform computers is segmentation. If the background clutter consists of shapes
similar to letter shapes, and the letters are connected by this clutter, the
segmentation becomes nearly impossible with current software. Hence, an effective
CAPTCHA should focus on the segmentation.
Several research projects have broken real world CAPTCHAs, including one of Yahoo's
early CAPTCHAs called "EZ-Gimpy" and the CAPTCHA used by popular sites such as
PayPal, LiveJournal, phpBB, and other services. In January 2008 Network Security
Research released their program for automated Yahoo! CAPTCHA recognition.
Windows Live Hotmail and Gmail, the other two major free email providers, were
cracked shortly after.
In February 2008 it was reported that spammers had achieved a success rate of 30%
to 35%, using a bot, in responding to CAPTCHAs for Microsoft's Live Mail service and a
success rate of 20% against Google's Gmail CAPTCHA. A Newcastle University
research team has defeated the segmentation part of Microsoft's CAPTCHA with a
90% success rate, and claim that this could lead to a complete crack with a greater
than 60% rate.
Human Solvers
CAPTCHA is vulnerable to a relay attack that uses humans to solve the puzzles. One
approach involves relaying the puzzles to a group of human operators who can solve
CAPTCHAs. In this scheme, a computer fills out a form and when it reaches a
CAPTCHA, it gives the CAPTCHA to the human operator to solve.
Spammers pay about $0.80 to $1.20 for each 1,000 solved captchas to companies
employing human solvers in Bangladesh, China and India.
Another approach involves copying the CAPTCHA images and using them as
CAPTCHAs for a high-traffic site owned by the attacker. With enough traffic, the
attacker can get a solution to the CAPTCHA puzzle in time to relay it back to the target
site. In October 2007, a piece of malware appeared in the wild which enticed users to
solve CAPTCHAs in order to see progressively further into a series of striptease
images. A more recent view is that this is unlikely to work due to unavailability of
high-traffic sites and competition by similar sites.
These methods have been used by spammers to set up thousands of accounts on free
email services such as Gmail and Yahoo!. Since Gmail and Yahoo! are unlikely to be
blacklisted by anti-spam systems, spam sent through these compromised accounts is
less likely to be blocked.
Legal concerns
The circumvention of CAPTCHAs may violate the anti-circumvention clause of the
Digital Millennium Copyright Act (DMCA) in the United States. In 2007, Ticketmaster
sued software maker RMG Technologies for its product which circumvented the ticket
seller's CAPTCHAs on the basis that it violated the anti-circumvention clause of the
DMCA. In October 2007, an injunction was issued stating that Ticketmaster would
likely succeed in making its case. In June 2008, Ticketmaster filed for Default
Judgment against RMG. The Court granted Ticketmaster the Default and entered an
$18.2M judgment in favor of Ticketmaster.
Image-recognition CAPTCHAs
Some researchers (e.g., Professor James Z. Wang of Penn State University) promote
image recognition CAPTCHAs as a possible alternative for text-based CAPTCHAs. In
1995, the Penn State research team published a research paper on their
IMAGINATION CAPTCHA system (demo). The system uses carefully-designed
randomized distortions of images to prevent automatic attacks based on broad-
concept image recognition systems such as the ALIPR (Automatic Linguistic Indexing
of Pictures - Real Time) system. The idea is that computer-based recognition
algorithms require the extraction of color, texture, shape, or special point features,
which cannot be correctly extracted after the designed distortions. However, with the
imagination power of human beings, we can still recognize the original concept
depicted in the images even with these distortions.
A recent example of image recognition CAPTCHA is to present the website visitor with
a grid of random pictures and instruct the visitor to click on specific pictures to verify
that they are not a bot (such as “Click on the pictures of the airplane, the boat and the
clock”).
Image recognition CAPTCHAs face many potential problems which have not been fully
studied. It is difficult for a small site to acquire a large dictionary of images which an
attacker does not have access to and without a means of automatically acquiring new
labelled images, an image based challenge does not usually meet the definition of a
CAPTCHA. KittenAuth, by default, only had 42 images in its database. Microsoft's
"Asirra," which it is providing as a free web service, attempts to address this by means
of Microsoft Research's partnership with Petfinder.com, which has provided it with
more than three million images of cats and dogs, classified by people at thousands of
US animal shelters. Researchers claim to have written a program that can break the
Microsoft Asirra CAPTCHA. The IMAGINATION CAPTCHA, however, uses a sequence of
randomized distortions on the original images to create the CAPTCHA images. Their
original images can be made public without risking image-retrieval or image-
annotation based attacks.
Human solvers are a potential weakness for strategies such as Asirra. If the database
of cat and dog photos can be downloaded, then paying workers $0.01 to classify each
photo as either a dog or a cat means that almost the entire database of photos can be
deciphered for $30,000. Photos that are subsequently added to the Asirra database
are then a relatively small data set that can be classified as they first appear. Causing
minor changes to images each time they appear will not prevent a computer from
recognizing a repeated image as there are robust image comparator functions (e.g.,
image hashes, color histograms) that are insensitive to many simple image
distortions. Warping an image sufficiently to fool a computer will likely also be
troublesome to a human.
Many users of the phpBB forum software (which has suffered greatly from spam)
have implemented an open source image recognition CAPTCHA system in the form of
an addon called Kitten Authwhich in its default form presents a question requiring the
user to select a stated type of animal from an array of thumbnail images of assorted
animals. The images (and the challenge questions) can be customized, for example to
present questions and images which would be easily answered by the forum's target
userbase. Furthermore, for a time, RapidShare free users had to get past a CAPTCHA
where they had to only enter letters attached to a cat, while others were attached to
dogs. This was later removed because (legitimate) users had trouble entering the
correct letters.
reCAPTCHA
reCAPTCHA supplies subscribing websites with images of words that optical character
recognition (OCR) software has been unable to read. The subscribing websites (whose
purposes are generally unrelated to the book digitization project) present these
images for humans to decipher as CAPTCHA words, as part of their normal validation
procedures. They then return the results to the reCAPTCHA service, which sends the
results to the digitization projects.
The system is reported to solve over 100 million captchas every day (as of October
2010[update]), and among its subscribers are such popular sites as Facebook,
TicketMaster, Twitter, 4chan, CNN.com and StumbleUpon. Craigslist began using
reCAPTCHA in June 2008. The U.S. National Telecommunications and Information
Administration also uses reCAPTCHA for its digital TV converter box coupon program
website as part of the US DTV transition.
Origin
The reCAPTCHA program originated with Guatemalan computer scientist Luis von
Ahn, aided by a MacArthur Fellowship. An early CAPTCHA developer, he realized "he
had unwittingly created a system that was frittering away, in ten-second increments,
millions of hours of a most precious resource: human brain cycles.
Operation
reCAPTCHA tests are taken from the central site of the reCAPTCHA project, which
supplies the words to be deciphered. This is done through a JavaScript API with the
server making a callback to reCAPTCHA after the request has been submitted. The
reCAPTCHA project provides libraries for various programming languages and
applications to make this process easier. reCAPTCHA is a free service (that is, the
CAPTCHA images are provided to websites free of charge, in return for assistance with
the decipherment), but the reCAPTCHA software itself is not open source.
reCAPTCHA offers plugins for several web-application platforms, like ASP.NET or PHP,
to ease the implementation of the service.
Mailhide
reCAPTCHA has also created project Mailhide which protects email addresses on Web
pages from being harvested by spambots. By default, the email address is converted
into a format that does not allow a crawler to see the full email address. For example,
“[email protected]” would be converted to “[email protected]”. The visitor
would then click on the “...” and solve the CAPTCHA in order to obtain the full email
address. One can also edit the popup code so that none of the address is visible.