探测“隐形”网络爬虫

有什么方法可以检测不想被检测到的网络爬虫?

(我知道列出检测技术可以让聪明的隐形爬虫程序员做出更好的蜘蛛,但是我不认为我们能够阻止聪明的隐形爬虫,只有那些犯错误的

我不是在说像 Googlebot 和 Yahoo! Slurp 这样的好爬虫。 我认为机器人不错,如果它:

  1. 在用户代理字符串中将自身标识为 bot
  2. 读取 robots.txt(并服从它)

我说的是 很糟糕爬虫,躲在普通用户代理后面,使用我的带宽,从不给我任何回报。

有一些活板门,可以建造 更新列表(感谢 Chris,gs):

  1. 添加在 robots.txt中仅列出(标记为 disallow)的目录,
  2. 添加不可见链接(可能标记为 rel = “ nofollow”?) ,
    • Style = “ display: none;”在链接或父容器上
    • 放置在另一个具有较高 z 指数的元素下面
  3. 发现不了解 CaPiTaLiSaTioN 的人,
  4. 检测谁尝试发布回复,但总是失败的验证码。
  5. 检测对仅 POST 资源的 GET 请求
  6. 检测请求之间的间隔
  7. 检测请求的页面顺序
  8. 检测谁(一贯地)通过 HTTP 请求 HTTPS 资源
  9. 检测不请求图像文件的用户(这与已知具有图像能力的浏览器的用户代理列表相结合,效果惊人地好)

有些陷阱会被“好”和“坏”的机器人同时触发。 你可以把它们和白名单结合起来:

  1. 会触发陷阱
  2. 它要求 robots.txt
  3. 它不会触发另一个陷阱,因为它遵守 robots.txt

另一件重要的事情是: 请考虑盲人使用屏幕阅读器: 让人们联系你,或解决一个(非图像)验证码继续浏览。

有什么方法可以自动检测到网络爬虫试图把自己伪装成正常的人类访问者。

问题不是 我怎么才能抓住。而是 我怎样才能最大化发现爬虫的机会。

有些蜘蛛真的很棒,能够解析和理解 HTML、 xhtml、 CSS JavaScript、 VBScript 等等。 我没有幻想,我不可能打败他们。

然而,你会惊讶于一些爬行动物是多么愚蠢。最愚蠢的例子(在我看来)是: 在请求所有 URL 之前将它们转换为小写。

还有一大堆爬行动物,它们“不够好”,无法避开各种各样的活板门。

27080 次浏览

One thing you didn't list, that are used commonly to detect bad crawlers.

Hit speed, good web crawlers will break their hits up so they don't deluge a site with requests. Bad ones will do one of three things:

  1. hit sequential links one after the other
  2. hit sequential links in some paralell sequence (2 or more at a time.)
  3. hit sequential links at a fixed interval

Also, some offline browsing programs will slurp up a number of pages, I'm not sure what kind of threshold you'd want to use, to start blocking by IP address.

This method will also catch mirroring programs like fmirror or wget.

If the bot randomizes the time interval, you could check to see if the links are traversed in a sequential or depth-first manner, or you can see if the bot is traversing a huge amount of text (as in words to read) in a too-short period of time. Some sites limit the number of requests per hour, also.

Actually, I heard an idea somewhere, I don't remember where, that if a user gets too much data, in terms of kilobytes, they can be presented with a captcha asking them to prove they aren't a bot. I've never seen that implemented though.

Update on Hiding Links

As far as hiding links goes, you can put a div under another, with CSS (placing it first in the draw order) and possibly setting the z-order. A bot could not ignore that, without parsing all your javascript to see if it is a menu. To some extent, links inside invisible DIV elements also can't be ignored without the bot parsing all the javascript.

Taking that idea to completion, uncalled javascript which could potentially show the hidden elements would possilby fool a subset of javascript parsing bots. And, it is not a lot of work to implement.

An easy solution is to create a link and make it invisible

<a href="iamabot.script" style="display:none;">Don't click me!</a>

Of course you should expect that some people who look at the source code follow that link just to see where it leads. But you could present those users with a captcha...

Valid crawlers would, of course, also follow the link. But you should not implement a rel=nofollow, but look for the sign of a valid crawler. (like the user agent)

It's not actually that easy to keep up with the good user agent strings. Browser versions come and go. Making a statistic about user agent strings by different behaviors can reveal interesting things.

I don't know how far this could be automated, but at least it is one differentiating thing.

See Project Honeypot - they're setting up bot traps on large scale (and have DNSRBL with their IPs).

Use tricky URLs and HTML:

<a href="//example.com/"> = http://example.com/ on http pages.
<a href="page&amp;&#x23;hash"> = page& + #hash

In HTML you can use plenty of tricks with comments, CDATA elements, entities, etc:

<a href="foo<!--bar-->"> (comment should not be removed)
<script>var haha = '<a href="bot">'</script>
<script>// <!-- </script> <!--><a href="bot"> <!-->

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically as needed by checking claimed user-agent strings and, if the client claimed to be a legitimate spider but not on the whitelist, it performed DNS/reverse-DNS lookups to verify that the source IP address corresponds to the claimed owner of the bot. As a failsafe, these actions were reported to the admin by email, along with links to black/whitelist the address in case of an incorrect assessment.

I haven't talked to that client in 6 months or so, but, last I heard, the system was performing quite effectively.

Side point: If you're thinking about doing a similar detection system based on hit-rate-limiting, be sure to use at least one-minute (and preferably at least five-minute) totals. I see a lot of people talking about these kinds of schemes who want to block anyone who tops 5-10 hits in a second, which may generate false positives on image-heavy pages (unless images are excluded from the tally) and will generate false positives when someone like me finds an interesting site that he wants to read all of, so he opens up all the links in tabs to load in the background while he reads the first one.

One simple bot detection method I've heard of for forms is the hidden input technique. If you are trying to secure a form put a input in the form with an id that looks completely legit. Then use css in an external file to hide it. Or if you are really paranoid, setup something like jquery to hide the input box on page load. If you do this right I imagine it would be very hard for a bot to figure out. You know those bots have it in there nature to fill out everything on a page especially if you give your hidden input an id of something like id="fname", etc.

Untested, but here is a nice list of user-agents you could make a regular expression out of. Could get you most of the way there:

ADSARobot|ah-ha|almaden|aktuelles|Anarchie|amzn_assoc|ASPSeek|ASSORT|ATHENS|Atomz|attach|attache|autoemailspider|BackWeb|Bandit|BatchFTP|bdfetch|big.brother|BlackWidow|bmclient|Boston\ Project|BravoBrian\ SpiderEngine\ MarcoPolo|Bot\ mailto:craftbot@yahoo.com|Buddy|Bullseye|bumblebee|capture|CherryPicker|ChinaClaw|CICC|clipping|Collector|Copier|Crescent|Crescent\ Internet\ ToolPak|Custo|cyberalert|DA$|Deweb|diagem|Digger|Digimarc|DIIbot|DISCo|DISCo\ Pump|DISCoFinder|Download\ Demon|Download\ Wonder|Downloader|Drip|DSurf15a|DTS.Agent|EasyDL|eCatch|ecollector|efp@gmx\.net|Email\ Extractor|EirGrabber|email|EmailCollector|EmailSiphon|EmailWolf|Express\ WebPictures|ExtractorPro|EyeNetIE|FavOrg|fastlwspider|Favorites\ Sweeper|Fetch|FEZhead|FileHound|FlashGet\ WebWasher|FlickBot|fluffy|FrontPage|GalaxyBot|Generic|Getleft|GetRight|GetSmart|GetWeb!|GetWebPage|gigabaz|Girafabot|Go\!Zilla|Go!Zilla|Go-Ahead-Got-It|GornKer|gotit|Grabber|GrabNet|Grafula|Green\ Research|grub-client|Harvest|hhjhj@yahoo|hloader|HMView|HomePageSearch|http\ generic|HTTrack|httpdown|httrack|ia_archiver|IBM_Planetwide|Image\ Stripper|Image\ Sucker|imagefetch|IncyWincy|Indy*Library|Indy\ Library|informant|Ingelin|InterGET|Internet\ Ninja|InternetLinkagent|Internet\ Ninja|InternetSeer\.com|Iria|Irvine|JBH*agent|JetCar|JOC|JOC\ Web\ Spider|JustView|KWebGet|Lachesis|larbin|LeechFTP|LexiBot|lftp|libwww|likse|Link|Link*Sleuth|LINKS\ ARoMATIZED|LinkWalker|LWP|lwp-trivial|Mag-Net|Magnet|Mac\ Finder|Mag-Net|Mass\ Downloader|MCspider|Memo|Microsoft.URL|MIDown\ tool|Mirror|Missigua\ Locator|Mister\ PiX|MMMtoCrawl\/UrlDispatcherLLL|^Mozilla$|Mozilla.*Indy|Mozilla.*NEWT|Mozilla*MSIECrawler|MS\ FrontPage*|MSFrontPage|MSIECrawler|MSProxy|multithreaddb|nationaldirectory|Navroad|NearSite|NetAnts|NetCarta|NetMechanic|netprospector|NetResearchServer|NetSpider|Net\ Vampire|NetZIP|NetZip\ Downloader|NetZippy|NEWT|NICErsPRO|Ninja|NPBot|Octopus|Offline\ Explorer|Offline\ Navigator|OpaL|Openfind|OpenTextSiteCrawler|OrangeBot|PageGrabber|Papa\ Foto|PackRat|pavuk|pcBrowser|PersonaPilot|Ping|PingALink|Pockey|Proxy|psbot|PSurf|puf|Pump|PushSite|QRVA|RealDownload|Reaper|Recorder|ReGet|replacer|RepoMonkey|Robozilla|Rover|RPT-HTTPClient|Rsync|Scooter|SearchExpress|searchhippo|searchterms\.it|Second\ Street\ Research|Seeker|Shai|Siphon|sitecheck|sitecheck.internetseer.com|SiteSnagger|SlySearch|SmartDownload|snagger|Snake|SpaceBison|Spegla|SpiderBot|sproose|SqWorm|Stripper|Sucker|SuperBot|SuperHTTP|Surfbot|SurfWalker|Szukacz|tAkeOut|tarspider|Teleport\ Pro|Templeton|TrueRobot|TV33_Mercator|UIowaCrawler|UtilMind|URLSpiderPro|URL_Spider_Pro|Vacuum|vagabondo|vayala|visibilitygap|VoidEYE|vspider|Web\ Downloader|w3mir|Web\ Data\ Extractor|Web\ Image\ Collector|Web\ Sucker|Wweb|WebAuto|WebBandit|web\.by\.mail|Webclipping|webcollage|webcollector|WebCopier|webcraft@bea|webdevil|webdownloader|Webdup|WebEMailExtrac|WebFetch|WebGo\ IS|WebHook|Webinator|WebLeacher|WEBMASTERS|WebMiner|WebMirror|webmole|WebReaper|WebSauger|Website|Website\ eXtractor|Website\ Quester|WebSnake|Webster|WebStripper|websucker|webvac|webwalk|webweasel|WebWhacker|WebZIP|Wget|Whacker|whizbang|WhosTalking|Widow|WISEbot|WWWOFFLE|x-Tractor|^Xaldon\ WebSpider|WUMPUS|Xenu|XGET|Zeus.*Webster|Zeus [NC]

Taken from: http://perishablepress.com/press/2007/10/15/ultimate-htaccess-blacklist-2-compressed-version/

I currently work for a company that scans web sites in order to classify them. We also check sites for malware.

In my experience the number one blockers of our web crawler (which of course uses a IE or Firefox UA and does not obey robots.txt. Duh.) are sites intentionally hosting malware. It's a pain because the site then falls back to a human who has to manually load the site, classify it and check it for malware.

I'm just saying, by blocking web crawlers you're putting yourself in some bad company.

Of course, if they are horribly rude and suck up tons of your bandwidth it's a different story because then you've got a good reason.

You can also check referrals. No referral could raise bot suspition. Bad referral means certainly it is not browser.

Adding invisible links (possibly marked as rel="nofollow"?),

* style="display: none;" on link or parent container
* placed underneath another element with higher z-index

I would'nt do that. You can end up blacklisted by google for black hat SEO :)

People keep addressing broad crawlers but not crawlers that are specialized for your website.

I write stealth crawlers and if they are individually built no amount of honey pots or hidden links will have any effect whatsoever - the only real way to detect specialised crawlers is by inspecting connection patterns.

The best systems use AI (e.g. Linkedin) use AI to address this.
The easiest solution is write log parsers that analyze IP connections and simply blacklist those IPs or serve captcha, at least temporary.

e.g.
if IP X is seen every 2 seconds connecting to foo.com/cars/*.html but not any other pages - it's most likely a bot or a hungry power user.

Alternatively there are various javascript challenges that act as protection (e.g. Cloudflare's anti-bot system), but those are easily solvable, you can write something custom and that might be enough deterrent to make it not worth the effort for the crawler.

However you must ask a question are you willing to false-positive legit users and introduce inconvenience for them to prevent bot traffic. Protecting public data is an impossible paradox.

short answer: if a mid level programmer knows what he's doing you won't be able to detect a crawler without affecting the real user. Having your information publicly you won't be able to defend it against a crawler... it's like the 1st amendment right :)