Proof-reading, with a bit of help from LLMs
What have I learnt over recent weeks, preparing a corrected version of Introducing Category Theory, invoking – at various stages – my friends Claude, Gemini and ChatGPT as proof-reading assistants? Not very helpfully, the take-home message is that LLMs are both extremely useful and also surprisingly bad at proof-reading. To explain …
Let me backtrack. At the beginning of the year I called a halt to tinkering with a revised version of my notes Introducing Category Theory, and set-up a new Amazon print-on-demand paperback, officially as a third edition. In putting together that edition I had made use of all three of those LLMs, uploading the LaTeX files a chapter at a time, asking to hear in particular about (1) definite typos, definite grammatical errors, (2) obscure/ill-written sentences, and (3) mathematical and other errors. I made it clear that the relaxed prose style was entirely intentional, which didn’t stop the LLMs giving a lot of stylistic advice which I largely ignored.
I was (foolishly it turns out) fairly confident that the book was in good shape as far as typos and elementary thinkos were concerned. And so I’d planned to set up a hardback-for-libraries version as well. But (for reasons that don’t at all matter here) this got delayed. But eventually, in early April, I found myself starting to re-check the third edition of Introducing Category Theory before making a hardback edition. Initially I returned to Claude (and remember it was looking at files that had already been checked over by three LLMs, including an earlier version of itself). Rather depressingly, it found more errors. Speculatively, I then asked Gemini, giving it the same uncorrected files, which found more errors (with markedly less than 50% overlap). Hey ho! Off we go again …
So over recent weeks the whole book was checked again by both Claude and Gemini. On the positive side, they both spotted errors that were not exactly trivial (if a human proof-reader had spotted them, you would say that showed some level of understanding of what was supposed to be going on). On the downside, the failure of the two LLMs to spot the same mistakes wasn’t exactly confidence-inspiring.
So (neurotically?) when I’d cross-checked to the end of the book again, I decided to spot sample a couple of chapters with ChatGPT. The prompt was brisk, basically “Please let me know about any definite typos, grammatical mistakes, logical or mathematical errors.” And back came a list of yet more errors missed by Claude and Gemini. So we went through the whole book yet again. Sigh. (Oddly, ChatGPT couldn’t stably settle on a format for its chapter-by-chapter reports – successive reports could be arranged quite differently. Claude, I think, wins the prize for elegant usability.)
A list of the worst mistakes in the original printing of the third edition, meaning mistakes that could have led a reader astray or caused puzzlement, is online here. That final trawl using ChatGPT is responsible for over a quarter of the long list, so – tedious though it was – I am glad I made that effort.
Along with those listed mistakes, our LLMs also found about three times as many trivial mistakes – missing punctuation, inverted word order, etc., the kind of mistake which it is so easy for a human reader to miss and which will cause no trouble. But remember, Claude and co. were working from a source text which they had already checked over previously. How come they had missed these errors before?
I am embarrassed (a little, anyway) that that printed third edition had so many errors. But at least my LLM friends found few outright thinkos. So there you have it. The LLMs were indeed very useful and found lots of typos, and also those few thinkos, mistakes that I had repeatedly missed (or which I had introduced by making other changes). And some of their reports of unnecessary obscurities were helpful too, leading to expositional improvements. But each of the three LLMs was quite surprisingly unreliable at finding even bog basic elementary typos: their pattern-matching skills are far from stunningly accurate.
My advice, then, if you are faced with a proof-reading task, is: use all three LLMs (certainly don’t rely on just one). And if it is a book-length and/or maths-heavy task, do get a Pro-level subscription to each for a month or so.
And, after all that, a new cleaner version of ICT (paperback, hardback and free PDF) will be out shortly. I can then get back to real logical matters …
Rowsety Moid comments: When “Claude and co. were working from a source text which they had already checked over previously”, was it the very same, character-by-character identical, text, or had some changes been made (such as fixing the errors they’d found earlier)? (If think it would be an interesting experiment to give them the very same, unaltered, text, if you haven’t already done that.)
Even when they’re given the very same text, I wouldn’t expect them to produce exactly the same output (or find exactly the same errors) as before. They’re stochastic parrots, after all, not deterministic ones.
In any case, I think that behaviour such as missing errors, then finding them on a later attempt, or being “unreliable at finding even bog basic elementary typos”, might seem less mysterious if, instead of thinking of them as having an error-finding or pattern-matching ability, you think of them just as generating text based on the input you’ve given them along with everything that’s in their model, with some randomness involved in the process.
Perhaps that just makes their error-finding behaviour mysterious in a different way, but I think that’s the mystery we actually have.
Pattern-matching is a form of Good Old-Fashioned AI (GOFAI). I don’t think it will be the best way to think about what LLMs are doing.
PS replies: You are right! Of course stochastic parrots will be … stochastic. After posting this, I had the conversation with Gemini and Claude and ChatGPT that I should have had before, about the chancy nature of their error-finding which makes it unsurprising that they might miss errors in one pass and find them in another. I’ll write up what I learnt some time, if only to save other people from making the same mistake of expecting the LLMs to be more reliable editorial assistants than they are.