Vijeta
Vijeta
ON
CAPTCHA
ABSTRACT
CAPTCHA is not a word itself, it is an abbreviation which stands for Completely Automated Public Turing-test to tell Computers and Humans Apart. This full form well defines the purpose of CAPTCHA. Actually CAPTCHA is used as a simple puzzle hurdle, which restricts various automated programs to sign-up E-mail accounts, cracking passwords, spam sending, privacy violation etc. This CAPTCHA actually challenges a particular automated program, which is trying to access some private zone. So, CAPTCHA helps in preventing access of personal mail accounts by some un-authorized automated spamming programs. This challenge of CAPTCHA is not a complicated problem. Actually CAPTCHA is simply some visual test or some simple puzzle. Any mentally sound human can crack this challenge very easily, but an automated program will not be able to crack the challenge and hence it will not be able to gain access. So this CAPTCHA basically distinguishes a human and an automated program and it restricts automated programs to do violation of some private area The CAPTCHA tests include some codes inform of some images, letters, alphabets and numbers that are intersected or overlapped over each other. And somebody who wants to gain the access will need to read the code and rewrite it in specified pattern. These codes are easy to understand and rewrite for any human being. One example of CAPTCHA is gimpy. Gimpy select random words from some dictionary and then it create 7 puzzles by placing the chosen words in a particular pattern. Pix is also a type of CAPTCHA, it uses pictures in order to create some puzzle. This system uses around 6 pictures as a puzzle. All these pictures will have relation with some common topic and a user will be supposed to identify that topic or issue. CAPTCHA is now almost a standard security technology, and has found widespread application in commercial websites. A common type of CAPTCHA requires the user to type letters or digits from a distorted image that appears on the screen, and such tests are commonly used to prevent unwanted internet bots from accessing websites.
TABLE OF CONTENTS
1 INTRODUCTION
1.1 1.2 Motivation History 1 2 2-3 4 5 5 5 6 6 7 7 7 8 9 10 11 11 11-12 12 13 14 15 16 17 17 18-20 21 21 22 23
2 3
4 5
BREAKING CAPTCHA
6.1 6.2 Breaking visual CAPTCHA Breaking Ez-Gimpy CAPTCHA
7 8 9
10
CONCLUSION REFERENCES
24
CHAPTER 1
1. INTRODUCTION
When you are trying to sign up for a free email service offered by Gmail or Yahoo. Before you can submit your application, you first have to pass a test. It is not a hard test in fact, that is the point. For you, the test should be simple and straightforward. But for a computer, the test should be almost impossible to solve. This sort of test is a CAPTCHA. They are also known as a type of Human Interaction Proof (HIP). You have probably seen CAPTCHA tests on lots of Web sites. The most common form of CAPTCHA is an image of several distorted letters. It's your job to type the correct series of letters into a form. If your letters match the ones in the distorted image, you pass the test. CAPTCHAs are short for Completely Automated Public Turing test to tell Computers and Humans Apart. The term "CAPTCHA" was coined in 2000 by [1] Luis Von Ahn, Manuel Blum, Nicholas J. Hopper (all of Carnegie Mellon University, and John Langford (then of IBM). They are challenge-response tests to ensure that the users are indeed human. The purpose of a CAPTCHA is to block form submissions from spam bots automated scripts that harvest email addresses from publicly available web forms. A common kind of CAPTCHA used on most websites requires the users to enter the string of characters that appear in a distorted form on the screen.
CAPTCHAs are used because of the fact that it is difficult for the computers to extract the text from such a distorted image, whereas it is relatively easy for a human to understand the text hidden behind the distortions. Therefore, the correct response to a CAPTCHA challenge is assumed to come from a human and the user is permitted into the website. Why would anyone need to create a test that can tell humans and computers apart? It's because of people trying to game the system -- they want to exploit weaknesses in the computers running the site.While these individuals probably make up a minority of all the people on the Internet, their actions can affect millions of users and Web sites.
1.1 MOTIVATION
Programs (bots and spiders) are being created to steal services and to conduct fraudulent transactions. Some examples: Free online accounts are being registered automatically many times and are being used to distribute stolen or copyrighted material. For example, ebay, a famous auction website allows users to rate a product. Abusers can easily create bots that could increase or decrease the rating of a specific product, possibly changing peoples perception towards the product. Spammers register themselves with free email accounts such as those provided by Gmail or Hotmail and use their bots to send unsolicited mails to other users of that email service. Online polls are attacked by bots .This gives unfair mileage to those that benefit from it. In light of the above listed abuses and much more, a need was felt for a facility that checks users and allows access to services to only human users. It was in this direction that such a tool like CAPTCHA was created.
1.2 HISTORY
The need for CAPTCHAs rose to keep out the website / search engine abuse by bots.[2] In 1996, AltaVista sought ways to block and discourage the automatic submissions of URLs into their search engines. Andrei Broder ,Chief Scientist of AltaVista, and his colleagues developed a filter. Their method was to generate a printed text randomly that only humans could read and not machine readers. Their approach was so effective that in an year, spam-add-ons were reduced by 95% and a patent was issued in 2001. In 2000,Yahoos popular Messenger chat service was hit by bots which pointed advertising links to annoying human users of chat rooms. Yahoo, along with Carnegie Mellon University, developed a CAPTCHA called EZ-GIMPY.
Which chose a dictionary word randomly and distorted it with a wide variety of image occlusions and asked the user to input the distorted word. In November 1999, slashdot.com released a poll to vote for the best CS college in the US. Students from the Carnegie Mellon University and the Massachusetts Institute of Technology created bots that repeatedly voted for their respective colleges. This incident created the urge to use CAPTCHAs for such online polls to ensure that only human users are able to take part in the polls.
CHAPTER 2
CHAPTER 3
3. TYPES OF CAPTCHAS
CAPTCHAs are classified based on what is distorted and presented as a challenge to the user. They are:
Such questions are very easy for a human user to solve, but its very difficult to program a computer to solve them. These are also friendly to people with visual disability such as those with colour blindness. Other text CAPTCHAs involves text distortions and the user is asked to identify the text hidden. The various implementations are:
3.1.2 GIMPY Gimpy is a very reliable text CAPTCHA built by CMU in collaboration with Yahoo for their Messenger service. Gimpy is based on the human ability to read extremely distorted text and the inability of computer programs to do the same. Gimpy works by choosing ten words randomly from a dictionary, and displaying them in a distorted and overlapped manner. Gimpy then asks the users to enter a subset of the words in the image. The human user is capable of identifying the words correctly, whereas a computer program cannot.
Fig 3.1.2 Gimpy CAPTCHA 3.1.3 EZ GIMPY This is a simplified version of the Gimpy CAPTCHA, adopted by Yahoo in their signup page. Ez Gimpy randomly picks a single word from a dictionary and applies distortion to the text. The user is then asked to identify the text correctly.
3.1.4 MSN CAPTCHA Microsoft uses a different CAPTCHA for services provided under MSN umbrella. These are popularly called MSN Passport CAPTCHAs. They use eight characters (upper case) and digits. Foreground is dark blue, and background is grey. Warping is used to distort the characters, to produce a ripple effect, which makes computer recognition very difficult.
Fig 3.1.4 MSN Passport CAPTCHA 3.2 GRAPHIC CAPTCHAS Graphic CAPTCHAs are challenges that involve pictures or objects that have some sort of similarity that the users have to guess. They are visual puzzles. Computer generates the puzzles and grades the answers, but is itself unable to solve it. 3.2.1 BONGO BONGO is named after M.M. Bongard, who published a book of pattern recognition problems in the 1960s. BONGO asks the user to solve a visual pattern recognition problem. It displays two series of blocks, the left and the right. The blocks in the left series differ from those in the right, and the user must find the characteristic that sets them apart. A possible left and right series is shown in Figure 3.2.1
These two sets are different because everything on the left is drawn with thick lines and those on the right are in thin lines. After seeing the two blocks, the user is presented with a set of four single blocks and is asked to determine to which group the each block belongs to. The user passes the test if s/he determines correctly to which set the blocks belong to. We have to be careful to see that the user is not confused by a large number of choices. 3.2.2 PIX PIX is a program that has a large database of labeled images. All of these images are pictures of concrete objects (a horse, a table, a house, a flower). The program picks an object at random, finds six images of that object from its database, presents them to the user and then asks the question what are these pictures of? Current computer programs should not be able to answer this question, so PIX should be a CAPTCHA. However, PIX, as stated, is not a CAPTCHA: it is very easy to write a program that can answer the question
Fig:3.2.2
Dogs
Swimming pool
what are these pictures of? Remember that all the code and data of a CAPTCHA should be publicly available; in particular, the image database that PIX uses should be public. Hence, writing a program that can answer the question what are these pictures of? is easy: search the database for the images
3.3 AUDIO CAPTCHA The final example we offer is based on sound. The program picks a word or a sequence of numbers at random, renders the word or the numbers into a sound clip and distorts the sound clip; it then presents the distorted sound clip to the user and asks users to enter its contents. This CAPTCHA is based on the difference in ability between humans and computers in recognizing spoken language. Nancy Chan of the City University in Hong Kong was the first to implement a sound-based system of this type.
Fig 3.3 Audio CAPTCHA The idea is that a human is able to efficiently disregard the distortion and interpret the characters being read out while software would struggle with the distortion being applied, and need to be effective at speech to text translation in order to be successful. This is a crude way to filter humans and it is not so popular because the user has to understand the language and the accent in which the sound clip is recorded.
CHAPTER 4
Fig:-4 RE-CAPTCHA But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is th en asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
10
CHAPTER 5
5. CONSTRUCTING CAPTCHAS
CAPTCHA can be constructed by using following things that he should be remember by the programmer while he constructing the CAPTCHA. 5.1 THINGS TO KNOW The first step to create a CAPTCHA is to look at different ways humans and machines process information. Machines follow sets of instructions. A CAPTCHA designer has to take this into account when creating a test. For example, its easy to build a program that looks at metadata the information on the Web thats invisible to humans but machines can read. If you create a visual CAPTCHA and the images metadata includes the solution, your CAPTCHA will be broken in no time[5]. An undistorted series of characters isnt very secure. Many computer programs can scan an image and recognize simple shapes like letters and numbers. One way to create a CAPTCHA is to pre-determine the images and solutions it will use Other CAPTCHA applications create random strings of letters and numbers. You arent likely to ever get the same series twice. Using randomization eliminates the possibility of a brute-force attack the odds of a bot entering the correct series of random letters are very low. The longer he string of characters, the less likely a bot will get lucky.Designers can also create puzzles or problems that are easy for humans to solve. Some CAPTCHAs rely on pattern recognition.
5.2 IMPLEMENTATION
Embeddable CAPTCHAs: The easiest implementation of a CAPTCHA to a Website would be to insert a few lines of CAPTCHA code into the Websites HTML code, from an open source CAPTCHA builder, which will provide the authentication services remotely. Most such services are free. Popular among them is the service provided by www.captcha.net s RE-CAPTCHA project.
11
Custom CAPTCHAs: These are less popular because of the extra work needed to create a secure implementation. Anyway, these are popular among researchers who verify existing CAPTCHAs and suggest alternative implementations. There are advantages in building custom CAPTCHAs: A custom CAPTCHA can fit exactly into the design and theme of your site. It will not look like some alien element that does not belong there. We want to take away the perception of a CAPTCHA as an annoyance, and make it convenient for the user. Because a custom CAPTCHA, unlike the major CAPTCHA mechanisms, obscure you as a target for spammers. Spammers have little interest in cracking a niche implementation. Because we want to learn how they work, so it is best to build one ourselves.
Systems where a solution to the same CAPTCHA can be used multiple times (this makes the CAPTCHA vulnerable to so-called "replay attacks").Most CAPTCHA scripts found freely on the Web are vulnerable to these types of attacks.
13
CHAPTER 6
6. BREAKING CAPTCHAS
The challenge in breaking a CAPTCHA isn't figuring out what a message says - after all, humans should have at least an 80 percent success rate. In many cases, people who break CAPTCHAs concentrate not on making computers smarter, but reducing the complexity of the problem posed by the CAPTCHA [6]. A programmer wishing to break this CAPTCHA could approach the problem in phases. He or she would need to write an algorithm -- a set of instructions that directs a machine to follow a certain series of steps. In this scenario, one step might be to convert the image in greyscale. That means the application removes all the color from the image, taking away one of the levels of obfuscation the CAPTCHA employs. Next, the algorithm might tell the computer to detect patterns in the black and white image. The program compares each pattern to a normal letter, looking for matches. If the program can only match a few of the letters, it might cross reference those letters with a database of English words. Then it would plug in likely candidates into the submit field. This approach can be surprisingly effective. It might not work 100 percent of the time, but it can work often enough to be worthwhile to spammers. What about more complex CAPTCHAs? The Gimpy CAPTCHA displays 10 English words with warped fonts across an irregular background. The CAPTCHA arranges the words in pairs and the words of each pair overlap one another. Users have to type in three correct words in order to move forward. How reliable is this approach? As it turns out, with the right CAPTCHA-cracking algorithm, it's not terribly reliable. Greg Mori and Jitendra Malik published a paper detailing their approach to cracking the Gimpy version of CAPTCHA. One thing that helped them was that the Gimpy approach uses actual words rather than random strings of letters and numbers. With this in mind, Mori and Malik designed an algorithm that tried to identify words by examining the beginning and end of the string of letters.
14
6.1 BREAKING A VISUAL CAPTCHA Greg Mori and Jitendra Malik of University of California at Berkeleys Computer Vision Group evaluate image based CAPTCHAs [7]for reliability. They test whether the CAPTCHA can withstand bots who masquerade as humans. Approach: The fundamental ideas behind our approach to solving Gimpy are the same as those we are using to solve generic object recognition problems. Our solution to the Gimpy CAPTCHA is just an application of a general framework that we have used to compare images of everyday objects and even find and track people in video sequences. The essences of these problems are similar. Finding the letters "T", "A", "M", "E" in an image and connecting them to read the word "TAME" is like to finding hands, feet, elbows, and faces and connecting them up to find a human. 6.2 BREAKING AN EZ-GIMPY CAPTCHA Algorithm consists of 3 main steps:
Fig 6.2 Breaking CAPTCHAs Locate possible (candidate) letters at various locations: The first step is to hypothesize a set of candidate letters in the image. This is done using our shape matching techniques. The comparison is done in a way that is very robust to background clutter and deformation of the letters. The process usually results in 3-5 candidate letters per actual letter in the image. In the example, the "p" of profit matches well to both an "o" or a "p", the border between the "p" and the "r" look a bit like a "u", and so forth. At this stage we keep many candidates, to be sure we don't miss anything for later steps.
15
Construct graph of consistent letters: Next, we analyze pairs of letters to see whether or not they are "consistent", or can be used consecutively to form a word.
Look for plausible words in the graph: There are many possible paths through the graph of letters constructed in the previous step. However, most of them do not form real words. We select out the real words in the graph, and assign scores to them based on how well their individual letters match the image.
16
CHAPTER 7
Fig. 7.1Distortion in captcha Content is an issue when the string length becomes too long or when the string is not a dictionary word. Care should be taken not to include offensive words. Presentation should be in such a way as to not confuse the users. The font and colour chosen should be user friendly.
17
CHAPTER 8
8. APPLICATIONS
CAPTCHAs are used in various Web applications to identify human users and to restrict access to them. Some of them are[8]: Protecting Web Registration: Several companies offer free email and other services. These service providers suffered from a serious problem bots. These bots would take advantage of the service and would sign up for a large number of accounts. This often created problems in account management and also increased the burden on their servers. CAPTCHAs can effectively be used to filter out the bots and ensure that only human users are allowed to create accounts. Preventing comment spam: Most bloggers are familiar with programs that submit large number of automated posts that are done with the intention of increasing the search engine ranks of that site. CAPTCHAs can be used before a post is submitted to ensure that only human users can create posts. A CAPTCHA won't stop someone who is determined to post a rude message or harass an administrator, but it will help prevent bots from posting messages automatically. Search engine bots: It is sometimes desirable to keep web pages unindexed to prevent others from finding them easily. There is an html tag to prevent search engine bots from reading web pages. The tag, however, doesn't guarantee that bots won't read a web page; it only serves to say "no bots, please." However, in order to truly guarantee that bots won't enter a web site, CAPTCHAs are needed. E-Ticketing: Ticket brokers like Ticket Master also use CAPTCHA applications. These applications help prevent ticket scalpers from bombarding the service with massive ticket purchases for big events.It's possible for a scalper to use a bot to place hundreds or thousands of ticket orders in a matter of seconds.Scalpers then try to sell the tickets above face value. While CAPTCHA applications don't prevent scalping; they do make it more difficult to scalp tickets on a large scale. 18
Email spam: CAPTCHAs also present a plausible solution to the problem of spam emails. All we have to do is to use a CAPTCHA challenge to verify that a indeed a human has sent the email.
Preventing Dictionary Attacks: CAPTCHAs can also be used to prevent dictionary attacks in password systems. The idea is simple: prevent a computer from being able to iterate through the entire space of passwords by requiring it to solve a CAPTCHA after a certain number of unsuccessful logins. This is better than the classic approach of locking an account after a sequence of unsuccessful logins, since doing so allows an attacker to lock accounts at will.
As a tool to verify digitized books: This is a way of increasing the value of CAPTCHA as an application. An application called RE-CAPTCHA harnesses users responses in CAPTCHA[9] fields to verify the contents of a scanned piece of paper. Because computers arent always able to identify words from a digital scan, humans have to verify what a printed page says. Then its possible for search engines to search and index the contents of a scanned document. This is how it works: The application already recognizes one of the words. If the visitor types that word into a field correctly, the application assumes the second word the user types is also correct. That second word goes into a pool of words that the application will present to other users. As each user types in a word, the application compares the word to the original answer. Eventually, the application receives enough responses to verify the word with a high degree of certainty. That word can then go into the verified pool.
19
20
CHAPTER 9
9. FUTURE OF CAPTCHA
Google researchers released a new approach to CAPTCHAs . Instead of the user entering a string of letters this approach asks users to orient a picture into an upright-facing position.
9.1 3D-CAPTCHA
The idea behind these recent 3D CAPTCHA designs is that a human can recognize an object and manipulate it spatially in his mind.[10] This is a step beyond character recognition/repetition and involves an additional level of understanding. 3D-Captcha is the "captcha nice to humans, bad to machines". A new approach to captchas , using human's abilities to differentiate them from machines.
21
22
CHAPTER 10
10. CONCLUSION
We have now explored the many incremental improvements made to CAPTCHAs Over time. Herein, we propose another CAPTCHA system that builds upon known good techniques used in RE-CAPTCHA. CAPTCHAs are an effective way to counter bots and reduce spam.They serve dual purpose and help in advance AI knowledge. Applications are varied from stopping bots to character recognition & pattern matching. Some issues with current implementations represent challenges for future improvements.
23
REFERENCES
[1] Luis von Ahn, M Blum and J Langford. Telling Humans and Computer Apart Automatically, CACM, V47, No2, 2004. [2] [3] [4] [5] Luis von Ahn, Personal Communications, Oct 2007. http://www.w3.org/TR/turingtest/ http://recaptcha.net/ K Chellapilla, K Larson, P Simard and M Czerwinski,Building Segmentation Based Human-friendly Human Interaction Proofs, 2nd Intl Workshop on Human Interaction Proofs, Springer-Verlag, LNCS 3517, 2005. [6] R. Santamarta. Breaking Captcha, http://blog.wintercore.com/?p=11, 2008. [7] Greg Mori and Jitendra Malik. Recognising Objects inAdversarial Clutter: Breaking a Visual CAPTCHA, IEEEConference on Computer Vision and Pattern Recognition (CVPR'03), Vol 1, June 2003, pp.134-141. [8] Lindsay W. MacDonald. . IEEE CAPTCHA and its Applications, July/August 1999. [9] www.captcha.net
24