This month we have a contributed article from George Grätzer in which he describes (and demonstrates!) the use of ChatGPT for writing. As you probably know, this chatbot is based on a large language model (LLM). These models are mathematically interesting, and they succeed better than one might expect in producing natural-looking text.
I’m not an expert in LLMs, but what I do understand is perhaps most simply explained by analogy with a much smaller and simpler language model, dating back to the 20th century, sometimes known as “Dissociated Press.” This algorithm, easily implemented by small programs, is essentially a Markov process that repeatedly extends a string of words or characters A_o A_1\cdots A_n based on the probabilities that the substring A_{n-M}\cdots A_{n-1}A_{n} in some source text will be followed by various choices for A_{n+1} (where M is fixed and n increases).
At the letter level, this yields alphabet soup for M=0 or M=1. As M increases, the output becomes pronounceable, and eventually mostly recognizable as English (assuming that to be the language of the source text.) For still larger values the grammar becomes mostly correct; and eventually the output is a sort of manifold built from patches of the source text, like a train running on rails with only very occasional switching points. The word level version evolves similarly and somewhat faster. In each case, there is a “sweet spot” where the output is pleasantly surreal, somewhat in the style of works from the mathematically-influenced Ouvroir de Littérature Potentielle (OuLiPo).
Large language models do something like this, but with more sophisticated modelling algorithms, stacked many layers deep. They are normally trained not on one source text, but on as much writing as the creators can get their hands on. And here’s the first controversy. Using pirated material for this purpose, as for any purpose, is certainly unethical and illegal. But what about material, still in copyright, that has been made available to the public? Is the LLM plagiarizing it? It’s important to understand that an LLM does not store a copy of its training material: rather, it stores a host of “observations” about it, at various levels of abstraction. As this is basically what a human reader does, it might seem reasonable that the LLM should be permitted to do it too, unless the work was released with specific restrictions on its use.
This is complicated, though, by reports of generative AI spitting out half-chewed but recognizable versions of training material, a sort of “cryptomnesia.” It’s not entirely clear to me why this happens, but it might be a matter of certain words or names appearing only in one source work, leaving the model with the “belief” that if a string of text contains, say, a character called Humbert Humbert, there are not very many ways for it to go, and reassembling pieces of Nabokov’s novel. Somebody who had never read “Lolita” might have great difficulty spotting this: as a result, creative writing produced by a generative AI should be published (if at all) with great caution!
These generative AIs tend to write good sentences, which often fit together into plausible paragraphs. ChatGPT will probably not tell you that “colorless green ideas sleep furiously” unless it’s been asked to quote Noam Chomsky. However, as the scale gets larger, the attempt to emulate Searle’s “Chinese room” often starts to fall apart. That so-persuasive prose may state something that is completely wrong: this is known as “hallucination”. It should not be surprising that this happens: the program knows nothing about the world! We can imagine a Dissociated Press program, trained on (say) the text of Oliver Twist, that uses enough letters in its lookup table to yield mostly-grammatical sentences involving the familiar characters, but not enough to capture the plot of the book. Users have got into serious trouble over this. There was a case recently where generative AI was used to write a legal document. Unfortunately, the AI made up some of the sources that it cited; the judge was not amused, and the lawyer was disciplined. In such places an old-fashioned template document would probably be much safer.
George (as you will see) is an enthusiastic early adopter, and will tell you something about the fun of using it. You will have to decide for yourself how much to use it for more serious purposes.