I’ve recently updated my robots.txt file to block crawler bots used to train AI systems. It uses a master list from here, which I found thanks to Kevin. The idea is that I am asking for my content not to be used to train large language models such as ChatGPT.
I don’t mind my content being re-used – all of my blog posts carry a Creative Commons license, after all. But it’s the Attribution, Share Alike license, and this is important to note. If an AI was to generate a derivative work based on one of my blog posts, then to comply with the license, it must:
- Include an attribution or citation, stating that I wrote it.
- Ensure that the derivative work is also made available under the same license.
AI models don’t do really this – at least not at present. Any text is just hoovered up and combined with all the billions of other sources. Until such a time that these AI models can respect the terms of the license that my content is published under, they’ll be told to go away in the robots.txt file.
I haven’t yet gone as far as blocking these bots entirely. After all, robots.txt is essentially asking nicely; it’s not enforcement, and many bots ignore it. I used to use a WordPress plugin called Bad Behavior to block such bots, but it seems to have been abandoned.
Incidentally, my robots.txt file isn’t a flat file – I’m using the DB robots.txt WordPress plugin to generate it dynamically. This is why it has many other lines in it, instructing other crawlers about what they can and can’t access.