Why didn’t I think of this?

I saw an interesting article on Yahoo! the other day. It appears that someone at Carnegie Mellon came up with the idea to use a CAPTCHA script to mask e-mail addresses. Basically, they provide you with special link code to put in place of your normal mailto link. You can use the link anyway you want, but the script generates some HTML code automatically that looks similar to:

user<a href="http://mailhide.recaptcha.net/d?k=01quxf658CWzRNQC34kj75Ug==&c=5Sc_I7orZzXLkOX6E7fontrPSKXj6NS2QC0a-5mV5Gk=" onclick="window.open('http://mailhide.recaptcha.net/d?k=01quxf658CWzRNQC34kj75Ug==&c=5Sc_I7orZzXLkOX6E7fontrPSKXj6NS2QC0a-5mV5Gk=', '', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=500,height=300'); return false;" title="Reveal this e-mail address">...</a>@example.com

In the example above, the whole e-mail address would actually be [email protected], but the rest of the username is obscured by the script. Of course, they also supply you with just the address to the CAPTCHA page, so that you can build your own link any way you want.

The link, in turn, leads to a page with a CAPTCHA script. Once you correctly answer the CAPTCHA question, you’re lead to another page with the complete e-mail address. The script is called Mailhide, and it appears to be a completely free utility. It’s such a simple idea that I can’t believe no one thought of this before.

That’s not the whole story, though. The real story behind the CAPTCHA script used within this utility is actually extremely interesting on its own.

The project is reCAPTCHA, and it’s actually an effort to digitize books. Basically, according to the reCAPTCHA Web site, an entire archive of books has been digitized using OCR technology. However, as anyone that’s used OCR in the past knows, it’s far from perfect.

The goal behind reCAPTCHA is to use the answers from those CAPTCHA scripts to correct any mistakes made by the OCR technology.

Basically, what it does is provide you with two digitized words from a scan of an actual book. One of the words is a word that the computer correctly identified. The other is a word that was incorrectly converted to text when the OCR was applied to the book.

If the human interacting with the CAPTCHA script identifies the known word correctly, the computer assumes that the human also identified the unknown word correctly, and assigns the user-typed value to that image. Then, as more and more people identify the same image as that word, the computer increases its confidence in that identification. They even offer pre-built plug-ins for WordPress and Mediawiki, as well as some script you can use in any PHP application.

The whole concept is really quite extraordinary, and it actually has me considering adding their CAPTCHA script to some of my Web sites rather than installing a local copy of someone else’s CAPTCHA.