Hacker News

Wikipedia is struggling with voracious AI bot crawlers

64 points by bretpiatt - 5 comments

diggan [3 hidden]5 mins ago

This has to be one of strangest targets to crawl, since they themselves make database dumps available for download (https://en.wikipedia.org/wiki/Wikipedia:Database_download) and if that wasn't enough, there are 3rd party dumps as well (https://library.kiwix.org/#lang=eng&category=wikipedia) that you could use if the official ones aren't good enough for some reason.

Why would you crawl the web interface when the data is so readily available in a even better format?

graemep [3 hidden]5 mins ago

Wikipedia provides dumps. Probably cheaper and easier than crawling it. Given the size of Wikipedia it would be well worth a little extra code. it also avoids the risk of getting blocked, and is more reliable.

It suggest to me that people running AI crawlers are throwing resources at the problem with little thought.

perching_aix [3 hidden]5 mins ago

I thought all of Wikipedia can be downloaded directly if that's the goal? [0] Why scrape?

[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download

skydhash [3 hidden]5 mins ago

The worst thing about that is that wikipedia has dumps of all its data which you can download.

wslh [3 hidden]5 mins ago

Wouldn't downloading the publicly available Wikipedia database (e.g. via Torrent [1]) be enough for AI training purposes? I get that this doesn't actually stop AI bots, but captchas and other restrictions would undermine the open nature of Wikipedia.

[1] https://en.wikipedia.org/wiki/Wikipedia:Database_download