June 25, 2024 - Neil Turner's Blog

I’ve recently updated my robots.txt file to block crawler bots used to train AI systems. It uses a master list from here, which I found thanks to Kevin. The idea is that I am asking for my content not to be used to train large language models such as ChatGPT.

I don’t mind my content being re-used – all of my blog posts carry a Creative Commons license, after all. But it’s the Attribution, Share Alike license, and this is important to note. If an AI was to generate a derivative work based on one of my blog posts, then to comply with the license, it must:

Include an attribution or citation, stating that I wrote it.
Ensure that the derivative work is also made available under the same license.

AI models don’t do really this – at least not at present. Any text is just hoovered up and combined with all the billions of other sources. Until such a time that these AI models can respect the terms of the license that my content is published under, they’ll be told to go away in the robots.txt file.

I haven’t yet gone as far as blocking these bots entirely. After all, robots.txt is essentially asking nicely; it’s not enforcement, and many bots ignore it. I used to use a WordPress plugin called Bad Behavior to block such bots, but it seems to have been abandoned.

Incidentally, my robots.txt file isn’t a flat file – I’m using the DB robots.txt WordPress plugin to generate it dynamically. This is why it has many other lines in it, instructing other crawlers about what they can and can’t access.

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30