Wikipedia:Wikipedia Signpost/2022-08-01/From the editors

From the editors

Rise of the machines, or something

Here is the deal: it's pretty good at what it does.

– J

There are a few terms that have been thrown around a lot lately: AI, DL, NN, ML, NLP, and more. While a precise definition of all these terms would take multiple paragraphs, the thing they have in common is that a computer is doing some stuff.

For anyone who is not familiar with this alphabet soup, I've written a fairly comprehensive overview of the field's origins and history, as well as an explanation of the technologies involved, here, and ask forgiveness for starting the explanation of a 2019 software released in 1951.

In recent years, the field of machine learning has advanced at a pace which is, depending on who you ask, somewhere between "astounding", "terrifying", "overhyped" and "revolutionary". For example, GPT (2018) was a mildly interesting research tool, GPT-2 (2019) could write human-level text but was barely capable of staying on topic for more than a couple paragraphs, and GPT-3 (2020–22) wrote this month's arbitration report (a full explanation of what I did, how I did it, and responses to the most obvious questions can be found below).

The generative pre-trained transformers (this is what "GPT" stands for) are a family of large language models developed by OpenAI, similar to BERT and XLnet. Perhaps as a testament to the rapidity of developments in the field, even Wikipedia (famous for articles written within minutes of speeches being made and explosions being heard) currently has a redlink for large language models. Much ink has already been spilled on claims of GPTs' sentience, bias, and potential. It's obvious that a computer program capable of writing on the level of humans would have enormous implications for the corporate, academic, journalistic, and literary world. While there are certainly some unrealistically hyped-up claims, it's hard to overstate how much these things are capable of, despite their constraints.

The reports

"I see that Wikipedia has finally found a use for my old law books."^[1]

With that said, there are basically two options here.

The first is for me to keep droning on about how these models are a big deal, in a boring wall of text that makes increasingly outlandish and far-fetched claims about their capabilities.
The second is to show you what I am talking about.

I have opted for the second. In this issue, two articles have been written by an AI model called GPT-3: the deletion report and the arbitration report.

For the deletion report, GPT-3 was prompted with a transcript of each discussion in the report, and instructed to write a summary of it in the style of deceased Gonzo journalist Hunter S. Thompson. This produced a mixture of insightful, incisive, and derisive commentary. GPT-Thompson proved quite capable of accurately summarizing the slings and arrows of every discussion in the report – even though it specifically covers the longest and most convoluted AfDs. "Ukrainian Insurgent Army war against Russian occupation", for example, was a whopping 126,000 bytes (and needed to be processed in several segments) but the description was accurate.^[2]

For each discussion in the report, I provided a full transcript of the AfD page (with timestamps and long signatures truncated to aid processing), and prompted GPT-3 for a completion, using some variation on the following:

"The following text is a summary of the above discussion, written by Gonzo journalist Hunter S. Thompson for Rolling Stone's monthly feature on Wikipedia deletion processes."

Despite being ostensibly written in Thompson's style, these were generally quite straightforward summaries that covered the arguments made during each discussion, with hardly any profanity.^[3]

Afterwards, I provided the summary itself as a prompt, and asked GPT-Thompson for an "acerbic quip" on each. Unlike the "summary" prompts (in which GPT-Thompson only occasionally chose to accompany his commentary with unprintable calumny and scathing political rants), the "acerbic quip" prompts solely produced output ranging from obscene and irreverent to maliciously slanderous.^[4] Notably, this behavior is identical to what Hunter S. Thompson habitually did in real life, and part of why many editors allegedly loathed working with him. Personally, I didn't mind sifting through the diatribes (some of them were quite entertaining), but having to run each prompt several times to get something usable did make it fairly expensive.

For the arbitration report, GPT-3 was instructed to write a summary of each page in the style of deceased United States Supreme Court justice Oliver Wendell Holmes, Jr. This produced surprisingly insightful commentary; Justice GPT-Holmes proved able to summarize minute details of proceedings, including some things I'd missed while originally reading them. In general, he was more well-behaved (and less prone to obscene tirades) than GPT-Thompson, although he did have a tendency for long-winded digressions, and would often quote entire paragraphs from the source text.^[5]

Similar to the deletion report, input consisted of brief prologues (e.g. "The following is a verbatim transcript of the findings of fact in a Wikipedia arbitration request titled 'WikiProject Tropical Cyclones'"). This was followed by the transcript of the relevant pages (whether they were the main case page, arbitration noticeboard posting, preliminary statements, arbitrator statements, or findings of fact and remedies). Afterwards, a prompt was given for a summary, of the following general form:

The following text is an article written by United States Supreme Court Justice Oliver Wendell Holmes, summarizing the findings of fact and remedies, and their broader implications for the English Wikipedia's jurisprudence.^[6]

Image generation

GPT-Thompson.
Image from Craiyon (formerly "DALL-E Mini"), a VQGAN- and BART-based generative adversarial network
Justice GPT-Holmes.
Image from Midjourney, a diffusion network whose architecture is not publicly documented
Justice Holmes editing a Signpost report.
Image from DALL-E 2, a GPT-3 implementation paired with CLIP (Contrastive Language-Image Pre-training) by OpenAI.

We all remember those weird DeepDream images where the sky got turned into dogs. This is a little different.

In addition to text completion, transformers (in conjunction with other technologies) have proven themselves quite capable of image generation. The first of these, broadly speaking, was DALL-E, announced by OpenAI in January 2021. Since then, a number of services have become available, which use a variety of architectures to generate images from natural-language prompts (i.e. a prompt phrased in normal language like "a dog eating the Empire State Building", rather than a procedurally defined set of attributes and subjects written in a specialized description language). Among these are Craiyon (formerly known as "DALL-E Mini", despite having no relation to DALL-E) and Midjourney. For this issue, I used both of these services to generate illustrations for our articles: some came out very impressively, and some came out a little goofy. It was definitely surprising to see it have a coherent response for the prompt "Technoblade's avatar" that actually looked like it – I guess this is what happens when the training set is massive. Anyway, you can see a bunch of these on the issue page. For a comparison between the three models I found usable, see the embedded images above.

DALL-E 2 creates much higher-quality images than what I used, but there's a waitlist for access, and it didn't end up happening by press time (although I did get my friend to generate me one). For a comparison, see below; both were prompted from the string "Teddy bears working on new AI research underwater with 1990s technology".

Craiyon
256×256 pixels
DALL-E 2
1620×1620 pixels

While some concerns have been raised about the intellectual property implications of images generated by such models, the determination has been made (at least on Commons) that they're ineligible for copyright due to being the output of a computer algorithm. With respect to moral rights, the idea is generally that they're ripping off human artists because they were trained on a bunch of images from the Internet, including copyrighted ones. However, it's not clear (at least to me) in what way this process differs from the same being done by human artists. As far as I can tell, this is the way that humans have created art for the last several tens of thousands of years – as far as I can tell, Billie Eilish does not get DMCA claims from the Beatles for writing pop music, and Peter Paul Rubens didn't get in trouble with the Lascaux cavemen (even when he painted obviously derivative works).

Obvious questions

This is a joke, right?: No. I really did have GPT-3 write these articles. I can show you screenshots of my account on the OpenAI website if you want. Copyediting was minimal, and consisted mostly of reordering entries and removing irrelevant asides.
So you just pushed a button and the whole thing popped out?: Not exactly. I organized the layout of each article, determined what sections would go where, and had GPT-3 write the body text of each section according to specific prompts (as described above). It was also necessary to format the model's output in MediaWiki markup. Although GPT-3 is more than capable of writing code (including wikitext), I didn't want to overwhelm it by asking it to do too much stuff at the same time, as this tends to degrade quality.
This is obviously cherry-picked – you didn't just publish the direct output of the model.: Well, we don't do that for human writers, either. I don't even do that for myself – typically, by the time I flag my own stuff to be copyedited, I have gone through multiple stages of writing, rewriting, editing, adding notes for clarification, and deleting unnecessary content.
How do you know it's not completely full of crap?: I don't – every claim that it made was individually fact-checked (we do this for human writers, too). The overwhelming majority were correct, and in the rare cases where it got something wrong, it could be fixed by asking it to complete the prompt again.
Why not just write the articles yourself at that point?: Even accounting for the time spent verifying claims, it was still generally faster than writing the articles myself, as it was capable of structuring full paragraphs of text in seconds. It was sometimes time-consuming to re-prompt it when it would write something incorrect or useless, but there is a sort of art to writing prompts in a way that causes useful answers to be generated, which gradually became easier to do as time went on. For example, replacing "The following is a summary of the discussion" with "The following is a rigorously accurate summary of the discussion" (yes, this really works).^[7]
So it is a worse version of a human writer?: In terms of typographical errors, it was far better: I don't remember it making a single misspelling. The few grammatical errors it made were minor, and not objectively incorrect (e.g. saying "other editors argue" instead of "other editors argued" for a discussion that the prompt said had already been closed – this is not even really an error per se).
I heard language models were racist.: Language models like GPT-3 predict the most likely completion for a given input sequence, based on its training corpus, which is a very broad spectrum of text from the Internet (ranging from old books to forum arguments to furry roleplay). If its prompt is the phrase "I think the French are bastards because", you will probably end up with a bunch of text about how the French are bastards, similar to if you typed that into a search engine. In this particular instance, I did not observe GPT-3 saying anything prejudiced. This may be due to the people I prompted it to emulate;^[8] presumably, if I had told it to write in the style of Adolf Hitler, I would have gotten some nasty stuff. My solution to this was to not do that.
How much did this whole gimmick cost?: Since I signed up for the GPT-3 beta, I have used it for things other than Signpost writing, so it's hard to tell precisely how much compute went towards these articles. However, the total cost of all the API requests I've made so far is 48.12 USD.
Damn!!: Large language models are notorious for requiring massive amounts of processing power. I still think it's a bargain: imagine how much it would cost to actually hire Hunter S. Thompson and Oliver Wendell Holmes after you adjusted for inflation.

Notes

^ This image was generated by DALL-E 2. The caption was also written by GPT-3, in response to a prompt asking it for New Yorker-style captions for cartoons about Oliver Wendell Holmes being resurrected to write for the Signpost.
^ All statements were individually fact-checked, per the "obvious questions" section.
^ One thing I couldn't help but notice was that some of the phrasing was quite similar to those I've used when writing previous deletion reports, like "Ultimately, the discussion resulted in a no consensus decision". My understanding is that GPT-3 included Wikipedia in its corpus, this version was trained in 2021, and as far as I can tell I am the only journalist who has ever written a monthly recap of the ten longest Wikipedia deletion discussions, so it's possible that when instructed to write an AfD report, it simply learned from the best (and/or only). On the other hand, it's also possible that there are just very few ways in which to describe the closure of an AfD, and this happens to be one of them. Who knows.
^ I would provide a highlights reel of these, but I was serious about the "unprintable" part – you'd be reading about it in the next arbitration report.
^ It's not entirely clear to me what causes this. While there are obviously differences between the writing styles of Gonzo journalism and Supreme Court opinions, there are also inherent differences in the format of AfD discussions (which are a bunch of people replying to each other directly) and ArbCom proceedings (which are a bunch of individual sections of highly procedural text). More research is needed on this front.
^ Sometimes, he would start adding his own recommendations, and I would have to append "Holmes is not himself an arbitrator." to the prompt.
^ Kojima, Takeshi; Shixiang Shane Gu; Reid, Machel; Matsuo, Yutaka; Iwasawa, Yusuke (2022). "Large Language Models are Zero-Shot Reasoners". arXiv:2205.11916 [cs.CL].
^ Thompson, Hunter S. (November 24, 2011). Kingdom of Fear: Loathsome Secrets of a Star-crossed Child in the Final Days of the American Century. Penguin Books Limited. ISBN 9780241958735 – via Google Books. They speak for all that is cruel and stupid and vicious in the American character. They are the racists and hate mongers among us [...] Fuck them.

← Previous "From the editors"

In this issue

From the editors

Eyewitness Wikimedian, Vinnytsia, Ukraine

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

I commissioned GPT-3 to write a poem about this article:

GPT-3, the glorious machine,
Has written an AfD report so fine,
With insights both derisive and sage,
It's sure to make history's pages.
So let's all give three cheers for GPT-3,
The greatest machine we've ever seen,
Long may it reign, and write more reports,
On Wikipedia, the free encyclopedia!

I have some concerns, but I also have no idea what I'm doing, so there's that! Fantastic read, and I'm very interested to see the ongoing implications of this tech. ASUKITE 01:06, 1 August 2022 (UTC)[reply]

Thanks for this stimulating piece. I think that this raises questions about the use of generative language models in Wikipedia. Even if the results are mind blowing, I think that we should refuse the use of generative language models in Wikipedia for several reasons :

An epistomological reason : large language models such as BERT, GPT-3 and the most recent one Bloom are trained using a lot of text from the Internet including Wikipedia. The quality of those models comes from the fact that they are trained on text written by humans. If we use generative language models on Wikipedia, future language models will be trained on a mixture of human and AI generated text. I guess that I some point it will become meaningless.
A legal argument : GPT-3 is not open source. It is a proprietary algorithm produced by OpenAI. We should be very suspicious with such a powerful proprietary tool. What happens if the price prohibitive? BLOOM, the most recent model, is not proprietary but not open source. It uses a responsable AI license (https://huggingface.co/spaces/bigscience/license). It is far better than OpenAI's approach but it also raises lots of questions.
A technical argument : Wikipedia is not only about writing articles but also about collaborating, explaining decisions and argumenting. I don't think that AI are able to have a real discussion in a talk page and we should still remember that the AI don't know what is good, true or just. Humans do.

Maybe it would be worth to have a sister project from the foundation using an AI based encyclopedia (Wiki-AI-pedia). Now, we have a problem. It may be very difficult in a near future to detect contributions generated with generative AI. This will be a big challenge. Imagine an AI which would be expert in vandalism? PAC2 (talk) 06:27, 1 August 2022 (UTC)[reply]

Many thanks for this trial which is quite remarkable. A key strength of such bots is that they are good at following rules while humans tend to cut corners. For example, consider the recent cases of Martinevans123 and Lugnuts who have both been pilloried for taking material from elsewhere and doing a weak job of turning it into Wikipedia copy. A good bot seems likely to do a better job of such mechanical editing. As the number of active editors and admins suffers atrophy and attrition, I expect that this is the future. The people with the power and money like Google and the WMF will naturally tend to replace human volunteers with such AI bots. Hasta la vista, baby ... Andrew🐉(talk) 09:29, 1 August 2022 (UTC)[reply]
I am completely blown away by this. I have been following these AI developments for some time, but seeing them used for this application with such coherence is unbelievable. I have many confused and contradictory thoughts about the implications of AI advancement on Wikimedia projects, but for now I'll limit myself to one thing I am clear on: whether for good reasons or bad reasons, soon each person in the Wikimedia community will need to be aware of the technological levels of tools like GPT-3, DALL-E and their successors, and this Signpost experiment in writing is a fascinating way to draw people's attention to it. — Bilorv (talk) 14:39, 1 August 2022 (UTC)[reply]

The "Damn" part is something I didn't think about and is so true. Thanks for including it! Lectrician1 (talk) 19:25, 1 August 2022 (UTC)[reply]

I've been doing something broadly similar to your little exercise for quite a few years. I find an interesting high quality article on a foreign language Wikipedia, use Google Translate to translate it into English, copyedit the translation, and then publish it on en Wikipedia (with appropriate attribution). In fact my first ever Wikipedia article creation (Bernina Railway, created in 2009) was done that way. Over time, the Google Translate translations have become better and better, and with some languages (eg French, Italian, Portuguese) they are now generally so good that only minimal copyediting is necessary. I even occasionally receive compliments from native speakers for the quality of my translations from languages such as French (in which I am self taught, and which I do not speak well), and Italian (which I cannot read or speak). Bahnfrend (talk) 05:34, 2 August 2022 (UTC)[reply]

"I heard language models were racist" Don't AI models have some sort of system to block "problematic prompts"? I know that DALL-E 2 blocks problematic prompts as per the following quote in an IEEE article: "Again, the company integrated certain filters to keep generated images in line with its content policy and has pledged to keep updating those filters. Prompts that seem likely to produce forbidden content are blocked and, in an attempt to prevent deepfakes, it can't exactly reproduce faces it has seen during its training. Thus far, OpenAI has also used human reviewers to check images that have been flagged as possibly problematic." Maybe GPT-3 could use a similar system. Tube·of·Light 03:40, 4 August 2022 (UTC)[reply]

The GPT-3 used on OpenAI's site has a mandatory content filter model that it goes through; if content is marked as problematic, a warning appears and OpenAI's content policy doesn't allow for reusing the text. 🐶 EpicPupper ^{(he/him | talk)} 04:25, 4 August 2022 (UTC)[reply]

Exactly. My point is that @JPxG could have mentioned that such models have restrictions to prevent abuse. Tube·of·Light 05:48, 4 August 2022 (UTC)[reply]

It should be noted that such filters are often a rather ad-hoc measure, with DALL-E 2 believed to merely be adding keywords like "black," "Women," or "Asian American" randomly to text prompts to make the output appear more diverse. It is fairly easy to get past such filters using prompt engineering, and as such, I would not rely on those filters to protect us from malicious and biased uses. Yitz (talk) 19:12, 4 August 2022 (UTC)[reply]

Racism is far more nuanced, pernicious and deeply embedded than just saying the N-word or writing in the style of Hitler. To adapt the common phrase "garbage in, garbage out": racism in, racism out. Take a look at our excellent article on algorithmic bias. If DALL-E 2 isn't specifically designed to avert stereotypes from the dataset then it will perpetuate them (and it's hard to see how it could be—the key novelty of machine learning is that the programmers have little idea how it works). I'm sure if you analyse a large range of its output, you'd find it draws Jewish people or fictional people with Jewish-sounding names as having larger noses than non-Jewish people, or something similarly offensive. However, this is no criticism of The Signpost using curated DALL-E 2, Craiyon and GPT-3 content; I can't see any particular biases in this month's issue. — Bilorv (talk) 22:16, 4 August 2022 (UTC)[reply]

formerly known as "DALL-E Mini", despite having no relation to DALL-E "Formerly"? Aw, why? The Java / JavaScript relationship was definitely the right model to follow on this. /s -- FeRDNYC (talk) 08:57, 10 August 2022 (UTC)[reply]

With regards to the process a neural net uses to create images versus a human artist: the model does not experience qualia. It cannot have intent so it cannot create in the way a human can. Humans created art in prehistory without training on other art because it didn't exist, just like in the modern era artists have created quantum leaps in artistic style like cubism, impressionism etc. The model cannot possibly create anything new. When you learn fine art you don't go look at a Rothko painting and then immediately pick up a bucket of paint, you go through years of learning the foundations of figure drawing, perspective etc. Artists have an understanding of the world and their own interior life that the model cannot possibly have and that's why human works, even if derivative, are art and these images are imitation. Omicron91 (talk) 07:52, 23 August 2022 (UTC)[reply]

Explore Wikipedia history by browsing The Signpost archives.

Home

About