How My Robot Fooled Captcha

Here’s How My Robot Bypassed Captcha, and Why It Shouldn’t

We estimate that 50% of today’s internet traffic come from Web Robots

In this article, I will show you how Captcha systems block web robots and what are the techniques used to bypass detection. This article is solely for educational purposes. Our main goal is to reflect on how Captcha systems can be improved to protect businesses and individuals.

Web Bots (robots or web scrapers) have existed since the beginning of the web. The majority of them have one of these two goals:
1- Extract relevant information from websites: this could be anything like the stock of an online store, the Facebook information of your friends, or news from well-known websites. We usually call this category of robots scrappers.
2- Automate certain processes on the web: signing in to a platform, sending emails automatically, or updating your status on Facebook or Twitter. Automation can also be used for testing a web app before deployment.

Moreover, robots have made our lives simpler in many different ways. Remember the last time you tried to check for the cheapest flight or hotel reservation? or the last time you tried to compare which equipment or insurance policy you should buy? There is a great chance that the websites you used employ bots to retrieve relevant information for you. Google itself uses bots to index website content so they can appear in your search results.

However, like any piece of technology, bots are not inherently good or bad. It is all related to the goal for which software developers or general users use them. Bots can add tremendous value for companies and individuals as we have seen in the previous examples but they can also be detrimental to the life of other businesses and critical to the privacy of internet users.

Therefore, it is primordial to have systems that can be implemented to block malicious bots when needed.

Historically, the first solution that emerged to this problem was IP blacklisting. This means implementing a system on the server, or as a proxy to the server, that detects massive requests from a single IP address. If these requests are not separated by a logical duration, the system considered it a red flag. The reason is that usually people take a couple of seconds to look at a certain page before proceeding to interact with it, and thus to launch other requests from the same IP. The robot is inherently very fast, thus fires requests milliseconds apart from each other. This makes robots easily detectable.

You can test this using python, the ‘selenium’ library and Chromium, the backbone of a Chrome browser.

https://medium.com/media/1e5e62ef15e0a28b2e94cce7683ee2af/href

If the website you use to test is protected against bots, it will normally trigger a Captcha or some warning or error.

It is very easy to bypass this Captcha if it is using a ‘dumb’ system like the one we previously described. All you need to do is to programmatically introduce a time lag to emulate human behavior. In python for example you can use the ‘time’ library and add time.sleep(5) to your code.

Another reason that makes robots way faster than human beings is that they use Headless browsers. Headless browsers are browsers than run without the visual interface. In fact, since you’re using bots, there is not a single need for loading the visual representation of the HTML file returned by the server, you only need to have access to the HTML file. The developer will write code to interact with the HTML.

In the previous code, you can add the option to run the browser ‘headlessly’ by adding the following line before instantiating the driver instance:
# op.add_argument(‘headless’)

On some systems, especially if you are using a docker container to run the code, you need to add another two lines for it to work:
# op.add_argument(" — no-sandbox")
# op.add_argument(“ — disable-dev-shm-usage”)

New systems have developed other, more efficient ways, to detect robots. I will list some of them in the next paragraph and propose methods that can be used to bypass these new redflag detectors.

1- Advanced non-human-like behavior:
It is not entirely true that programmatically adding time lag will completely mimic human behavior. When we enter a certain website, we usually scroll down and up the page, we hover over elements, and we click other elements. Modern detectors can look for such behavior, which makes waiting for 5 or 10 seconds only in the code useless.
To bypass such detectors, you can introduce scrolling for example every time you open the page. This can be easily done using selenium and chromium:
# driver.execute_script(f”window.scrollTo({some_element_X_location}, {some_element_Y_location})”)
Even better, you can implement random scrolling by choosing any two number between 0 and page.length or page.width respectively.
You can also easily locate text in page, <p>paragraph tags</p> for example and induce a click on the text, which won’t trigger any requests to the server but mimic human behavior.

2- Same User Agent:
Some pages will rely on the user-agent property in your GET or POST requests to check if you’re browsing the website from a ‘legit’ browser like Chrome or Safari or from some kind of weird browser like ‘bad-browser-for-bots’ — I hope this browser doesn’t exist XD — . You can change your user-agent easily in python with the help of the ‘fake_useragent’ library. You can add this code to your bot so it spoofs your user-agent property with some well known user-agent values:
# ua = UserAgent()
# userAgent = ua.random
# options.add_argument(f’user-agent={userAgent}’)

Even better, you can spoof a different user-agent at every request like this:

https://medium.com/media/304138a6ddb9a5ea1d66cc48483fbfa3/href

The last line will spawn a new browser with a different user-agent property. This is the method I like best when combined with some human-like behavior.

3- Same IP address :
Some detectors go very aggressive and block consecutive requests from the same IP address despite human-like behavior. This usually degrades the user experience on the website. Imagine that every time you want to buy something from Amazon for example you need to look for bridge or car pictures in Captchas.
However, such detectors can still be bypassed with the use of proxies and VPNs. Some free VPNs have libraries in python that you can use easily. I can think of VPNGate for example.

Modern Web Bots are becoming increasingly dominant on the web. It is very important for us to understand how these robots work and try to limit harmful or abusing bots; both for the software developers community and for the general public.


How My Robot Fooled Captcha was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Related Articles

Responses

Your email address will not be published. Required fields are marked *

Receive the latest news

Subscribe To Our Weekly Newsletter

Get notified about chronicles from TreatMyBrand directly in your inbox