Regular internet users sometimes run into an annoying pop-up called a CAPTCHA. Solving a Google reCAPTCHA when you’re simply browsing the internet is easy. When it comes to bots, however, getting the same pop-up can be devastating as they are intended to stop automation software.
Triggering CAPTCHAs is often the worst thing that can happen to a bot. Its user will have to either solve it manually, forget about automation, or use a CAPTCHA solving service. Luckily, there are some ways to avoid CAPTCHAs entirely.
What is CAPTCHA?
CAPTCHA or Completely Automated Public Turing Test to Tell Computers and Humans Apart is an anti-bot solution commonly implemented in websites. There have been various iterations of it over the years, but the most common one in use now is Google reCAPTCHA.
Early versions of the test had users type in scrambled words, solve equations, or do other mundane tasks. Over time, sophisticated bots such as search engine spiders learned to solve CAPTCHAs, making the test less useful.
Google took it upon themselves to create new CAPTCHA types that would be significantly harder for automation software to solve. Nowadays, the tests no longer request you to solve an equation or input a word or two. Triggering CAPTCHAs today will meet you with various picture-based puzzles.
These puzzles aren’t accidental either. Google is using them to train machine learning and AI models to recognize pictures. As users, most commonly, have to identify things like trains, planes, bridges, etc., the bot can then learn to do so as well.
Additionally, there are a few other types of CAPTCHAs in use. There’s the aptly named “Invisible CAPTCHA”, which does exactly what it says in the name. You can meet websites where it’s integrated and never receive it.
Invisible CAPTCHAs track your mouse movements, clicking patterns, and other data to decide whether the person browsing is a human or a bot. Since bots, at least back in day, would have highly unusual patterns (instant mouse movements, no scrolling, quick browsing), the invisible CAPTCHA would be able to catch them without burdening a regular user.
Another common method that might be classified as a CAPTCHA are honeypots. These are hidden links or other elements in CSS or the source code of the website that are invisible to the user. Bots, however, can find them with ease and, once clicked, will be presented with a CAPTCHA.
Finally, there are sound-based CAPTCHAs. While these are generally added to some of the regular tests for those with vision impairments, sometimes you’ll get a purely audio CAPTCHA. These will often have you type in numbers or words according to the sound file.
6 ways to avoid CAPTCHAs
Note that none of these methods are mutually exclusive, so use as many as you can in combination with each other. Having all of them used at once will completely minimize the amount of CAPTCHAs you get while using bots.
Finally, there are ways to solve CAPTCHA automatically such as using specific services that would do that for you. These, however, aren’t usually worth the hassle. Some of the methods here will let you avoid the test once you get one, making the services that solve CAPTCHA automatically less useful than you might think.
1. Change user-agents
If you use any web scraping solution that you’ve created yourself, it will likely have some default user-agent (UA). Since it is sent automatically with each request, it’s something that can be used to track your activity.
Additionally, some default user-agents can be often blocked by websites, since they’re a dead giveaway that someone’s using a bot. So, get a list of legitimate user-agents and implement them on a rotating basis. Experiment with them to find out how often you should switch a user agent to solve CAPTCHAs by having them not trigger at all.
2. Use rotating proxies
Your IP address is another way that you can get tracked by most websites. If the same IP address keeps sending connection requests en masse, they know you’re botting or using other automation software.
Rotating proxies are the solution to the issue. They give you a pool of IP addresses that can be changed after every request. Additionally, rotating proxies often come from devices that are located in regular households (as opposed to business-grade servers), so the connection seems genuine and legitimate.
3. Randomize request delay
Sending requests on a consistent basis without any delay is the oldest trick in the book and websites are privy to it. If you keep changing pages or going to URLs at set intervals that never change, that’s clearly a bot doing things for you.
As such, one of the most important and easiest methods to avoid CAPTCHAs is to add randomization to request times. If you’ve coded your own scraper, adding randomization is a piece of cake.
4. Avoid direct links
Another way websites frequently discover bots is that they most often go through a set library of URLs. People, however, will often visit the homepage and then browse around somewhat randomly. As such, many websites have implemented homepage cookies that would help them discover bots.
So, try to collect URLs on the go instead of constantly going through direct links on websites. It helps if you also use a headless browser to collect cookies along the way instead of simply sending direct requests to the website.
6. Avoid honeypots
Avoiding honeypots can be a bit simpler than it may seem at first glance. Since these elements have to be invisible to regular users, they will often have tags such as “hidden” or their visibility being set to “off”.
So, check the elements and source code of the page. Pay special attention to all the URLs, since these will often hold the honeypot. If there’s an URL that’s hidden with visibility set to off, you can be almost sure that it’s going to be a honeypot.