Blocking AI crawlers

I’ve recently updated my robots.txt file to block crawler bots used to train AI systems. It uses a master list from here, which I found thanks to Kevin. The idea is that I am asking for my content not to be used to train large language models such as ChatGPT.

I don’t mind my content being re-used – all of my blog posts carry a Creative Commons license, after all. But it’s the Attribution, Share Alike license, and this is important to note. If an AI was to generate a derivative work based on one of my blog posts, then to comply with the license, it must:

  1. Include an attribution or citation, stating that I wrote it.
  2. Ensure that the derivative work is also made available under the same license.

AI models don’t do really this – at least not at present. Any text is just hoovered up and combined with all the billions of other sources. Until such a time that these AI models can respect the terms of the license that my content is published under, they’ll be told to go away in the robots.txt file.

I haven’t yet gone as far as blocking these bots entirely. After all, robots.txt is essentially asking nicely; it’s not enforcement, and many bots ignore it. I used to use a WordPress plugin called Bad Behavior to block such bots, but it seems to have been abandoned.

Incidentally, my robots.txt file isn’t a flat file – I’m using the DB robots.txt WordPress plugin to generate it dynamically. This is why it has many other lines in it, instructing other crawlers about what they can and can’t access.

Enjoyed reading?

You can sign up to receive a weekly email with new blog posts - just pop your email in below. You can unsubscribe at any time.