Why would you crawl the web interface when the data is so readily available in a even better format?
graemep [3 hidden]5 mins ago
Wikipedia provides dumps. Probably cheaper and easier than crawling it. Given the size of Wikipedia it would be well worth a little extra code. it also avoids the risk of getting blocked, and is more reliable.
It suggest to me that people running AI crawlers are throwing resources at the problem with little thought.
perching_aix [3 hidden]5 mins ago
I thought all of Wikipedia can be downloaded directly if that's the goal? [0] Why scrape?
The worst thing about that is that wikipedia has dumps of all its data which you can download.
wslh [3 hidden]5 mins ago
Wouldn't downloading the publicly available Wikipedia database (e.g. via Torrent [1]) be enough for AI training purposes? I get that this doesn't actually stop AI bots, but captchas and other restrictions would undermine the open nature of Wikipedia.
Why would you crawl the web interface when the data is so readily available in a even better format?
It suggest to me that people running AI crawlers are throwing resources at the problem with little thought.
[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download
[1] https://en.wikipedia.org/wiki/Wikipedia:Database_download