Hey readers,

It's Kelsey Piper here. What should we make of large language models (LLMs)? It’s quite literally a billion-dollar question.

It’s one addressed this week in an analysis by former OpenAI employee Leopold Aschenbrenner, in which he makes the case that we may be only a few years away from large language model-based general intelligence that can be a “drop-in remote worker” that can do any task human remote workers do. (He thinks that we need to push ahead and build it so that China doesn’t get there first.)

His (very long but worth reading) analysis is a good encapsulation of one strand of thinking about large language models like ChatGPT: that they are a larval form of artificial general intelligence (AGI) and that as we run larger and larger training runs and learn more about how to fine-tune and prompt them, their notorious errors will largely go away.

It's a view sometimes glossed as “scale is all you need,” meaning more training data and more computing power. GPT-2 was not very good, but then the bigger GPT-3 was much better, the even bigger GPT-4 is better yet, and our default expectation ought to be that this trend will continue. Have a complaint that large language models simply aren’t good at something? Just wait until we have a bigger one. (Disclosure: Vox Media is one of several publishers that has signed partnership agreements with OpenAI. Our reporting remains editorially independent.)

Among the most prominent skeptics of this perspective are two AI experts who otherwise rarely agree: Yann LeCun, Facebook’s head of AI research, and Gary Marcus, an NYU professor and vocal LLM skeptic. They argue that some of the flaws in LLMs — their difficulty with logical reasoning tasks, their tendency toward “hallucinations” — are not vanishing with scale. They expect diminishing returns from scale in the future and say we probably won’t get to fully general artificial intelligence by just doubling down on our current methods with billions more dollars.

Who’s right? Honestly, I think both sides are wildly overconfident.

Scale does make LLMs a lot better at a wide range of cognitive tasks, and it seems premature and sometimes willfully ignorant to declare that this trend will suddenly stop. I’ve been reporting on AI for six years now, and I keep hearing skeptics declare that there’s some straightforward task LLMs are unable to do and will never be able to do because it requires “true intelligence.” Like clockwork, years (or sometimes just months) later, someone figures out how to get LLMs to do precisely that task.

I used to hear from experts that programming was the kind of thing that deep learning could never be used for, and it’s now one of the strongest aspects of LLMs. When I see someone confidently asserting that LLMs can’t do some complex reasoning task, I bookmark that claim. Reasonably often, it immediately turns out that GPT-4 or its top-tier competitors can do it after all.

I tend to find the skeptics thoughtful and their criticisms reasonable, but their decidedly mixed track record makes me think they should be more skeptical about their skepticism.

We don’t know how far scale can take us

As for the people who think it’s quite likely we’ll have artificial general intelligence inside a few years, my instinct is that they, too, are overstating their case. Aschenbrenner’s argument features the following illustrative graphic:

I don’t want to wholly malign the “straight lines on a graph” approach to predicting the future; at minimum, “current trends continue” is always a possibility worth considering. But I do want to point out (and other critics have as well) that the right-hand axis here is ... completely invented.

GPT-2 is in no respects particularly equivalent to a human preschooler. GPT-3 is much much better than elementary schoolers at most academic tasks and, of course, much worse than them at, say, learning a new skill from a few exposures. LLMs are sometimes deceptively human-like in their conversations and engagements with us, but they are fundamentally not very human; they have different strengths and different weaknesses, and it’s very challenging to capture their capabilities by straight comparisons to humans.

Furthermore, we don’t really have any idea where on this graph “automated AI researcher/engineer” belongs. Does it require as many advances as going from GPT-3 to GPT-4? Twice as many? Does it require advances of the sort that didn’t particularly happen when you went from GPT-3 to GPT-4? Why place it six orders of magnitude above GPT-4 instead of five, or seven, or ten?

“AGI by 2027 is plausible ... because we are too ignorant to rule it out ... because we have no idea what the distance is to human-level research on this graph's y-axis,” AI safety researcher and advocate Eliezer Yudkowsky responded to Aschenbrenner.

That’s a stance I’m far more sympathetic to. Because we have very little understanding of which problems larger-scale LLMs will be capable of solving, we can’t confidently declare strong limits on what they’ll be able to do before we’ve even seen them. But that means we also can’t confidently declare capabilities they’ll have.

Prediction is hard — especially about the future

Anticipating the capabilities of technologies that don’t yet exist is extraordinarily difficult. Most people who have been doing it over the last few years have gotten egg on their face. For that reason, the researchers and thinkers I respect the most tend to emphasize a wide range of possibilities.

Maybe the vast improvements in general reasoning we saw between GPT-3 and GPT-4 will hold up as we continue to scale models. Maybe they won’t, but we’ll still see vast improvements in the effective capabilities of AI models due to improvements in how we use them: figuring out systems for managing hallucinations, cross-checking model results, and better tuning models to give us useful answers.

Maybe we’ll build generally intelligent systems that have LLMs as a component. Or maybe OpenAI’s hotly anticipated GPT-5 will be a huge disappointment, deflating the AI hype bubble and leaving researchers to figure out what commercially valuable systems can be built without vast improvements on the immediate horizon.

Crucially, you don’t need to believe that AGI is likely coming in 2027 to believe that the possibility and surrounding policy implications are worth taking seriously. I think that the broad strokes of the scenario Aschenbrenner outlines — in which an AI company develops an AI system it can use to aggressively further automate internal AI research, leading to a world in which small numbers of people wielding vast numbers of AI assistants and servants can pursue world-altering projects at a speed that doesn’t permit much oversight — is a real and scary possibility. Many people are spending tens of billions of dollars to bring that world about as fast as possible, and many of them think it’s on the near horizon.

That’s worth a substantive conversation and substantive policy response, even if we think those leading the way on AI are too sure of themselves. Marcus writes of Aschenbrenner — and I agree — that “if you read his manuscript, please read it for his concerns about our underpreparedness, not for his sensationalist timelines. The thing is, we should be worried, no matter how much time we have.”

But the conversation will be better, and the policy response more appropriately tailored to the situation, if we’re candid about how little we know — and if we take that confusion as an impetus to get better at measuring and predicting what we care about when it comes to AI.

— Kelsey Piper, senior writer

10 big things we think will happen in the next 10 years

Cartoon of eyeballs sitting on a box with many curly arrows pointing away from them.

Hudson Christie for Vox

Longtime newsletter readers know our annual tradition of making predictions for the year ahead — but we’re doing something a little different this week. To mark Vox's 10th anniversary, Vox’s staff dug into the turning points of the last decade, those moments in the news when history shifted course. We also took the opportunity to try to proactively predict what might be some of the meaningful turning points of the next decade.

More on this topic from Vox:

MDMA’s federal approval drama, briefly explained

A dark dish of many different pills, only one of which is a yellow smiley circle.

Anton Abramov/Getty Images

If federal approval for MDMA therapy for PTSD happens, it will mark the true beginning of the end of the psychedelic prohibition that has stifled research since 1970. And for a moment, it looked very much possible that by August, it would happen. But a vote this week from an FDA advising committee might prove otherwise, staff writer Oshan Jarow explains.

More on this topic from Vox:

Become a Vox Member

Support our journalism — become a Vox Member and you’ll get exclusive access to the newsroom with members-only perks including newsletters, bonus podcasts and videos, and more.

Join our community

WHAT WE'RE THINKING ABOUT

Photo by John Shearer/TAS24/Getty Images

Sam Cole at 404 Media reminded me that, as we continue having conversations about nonconsensual AI-generated images, there are at least two people in every deepfake. There’s the person whose face is being artificially spliced in — perhaps Taylor Swift’s — and there’s the person whose faceless, exposed body gets taken over. That person, Cole writes, is almost always a sex worker. Regulations like FOSTA-SESTA, which claims to fight online human trafficking, can inadvertently endanger sex workers by taking away the online platforms they need to network, promote themselves, and screen clients. As state and federal governments debate AI regulation in the coming years, I’ll be paying close attention to what sex workers have to say. —Celia Ford, Future Perfect fellow

One of the most useful sources of info on the current state of AI, and where it’s heading, is the research group Epoch. Their staff keeps tabs on the supply of GPUs and other AI-optimized chips; of machine learning researchers and research output; and training data, which seems like the most limited resource at this point. Will Henshall at Time has a lovely profile of Epoch’s head and founder, Jaime Sevilla, who seems poised to be the most important AI forecaster (the Nate Silver of AI, if you will) going forward. —Dylan Matthews, senior correspondent

If you think of Martin Van Buren at all, you probably think of two things: He had impeccable mutton chops and he is the only US president whose first language was not English (it was Dutch). But I quickly developed a new appreciation for Van Buren while listening to Ezra Klein talk with political scientists Daniel Schlozman and Sam Rosenfeld, authors of a new book The Hollow Parties: The Many Pasts and Disordered Present of American Party Politics. Van Buren is revered by poli-sci scholars for being the force behind the first modern political party, the Jacksonian Democrats. Repulsive as parts of their namesake’s policy agenda were, I can’t help thinking, as I watch the modern GOP devolve into a cult of personality, that we would be better off if Van Buren’s vision of political parties as sturdy institutions that could act as a check on messianic candidates had endured. —Dylan Scott, senior correspondent and editor.

Is this a New Yorker story with the headline "Are We Doomed?" Yes, please. The novelist and essayist Rivka Galchen sat in on a University of Chicago seminar with precisely that title, where an array of students grappled with the current state of existential risk, from climate change to AI to nuclear weapons. I’m obviously a weirdo about this subject, but this class, led by an astrophysicist and a computational scientist, is an experience I wish I had. But Glachen’s focus is rightly with the students who, because of their youth, have to live in the time of existential risk. You’ll be surprised by how they handle it. —Bryan Walsh, editorial director

Access the web version of this newsletter here.

This email was sent to shakeelh@me.com. Manage your email preferences or unsubscribe.

View our Privacy Notice and our Terms of Service.