How did Anthropic measure AI's "theoretical capabilities" in the job market? - Ars Technica

Overview

How did Anthropic measure AI’s “theoretical capabilities” in the job market?

2023 study made a lot of assumptions about future “anticipated LLM-powered software.”

Details

If you follow the ongoing debate over AI’s growing economic impact, you may have seen the graphic below floating around this month. It comes from an Anthropic report on the labor market impacts of AI and is meant to compare the current “observed exposure” of occupations to LLMs (in red) to the “theoretical capability” of those same LLMs (in blue) across 22 job categories.

While the current “observed exposure” area is interesting in its own right, it’s the blue “theoretical capability” that jumps out. At a glance, the graph implies that LLM-based systems could perform at least 80 percent of the individual “job tasks” across a shockingly wide range of human occupations, at least theoretically. It looks like Anthropic is predicting that LLMs will eventually be able to do the vast majority of jobs in broad categories ranging from “Arts & Media” and “Office & Admin” to “Legal, Business & Finance,” and even “Management.”

That “theoretical AI coverage” area seems like it’s destined to eat a huge swath of the US job market!

Digging into the basis for those “theoretical capability” numbers, though, provides a much less chilling image of AI’s future occupational impacts. When you drill down into the specifics, that blue field represents some outdated and heavily speculative educated guesses about where AI is likely to improve human productivity and not necessarily where it will take over for humans altogether.

The LLM “theoretical capability” baseline Anthropic is citing here isn’t based on the company’s own empirical testing of its current models or quantifiable projections of performance increases over time. Instead, Anthropic cites an August 2023 report titled “GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models” co-authored by researchers at Open AI, Open Research, and the University of Pennsylvania.

The researchers start with O*NET’s Detailed Work Activity reports, which break down the individual tasks involved in many jobs at an extremely granular level. They then use a mixture of human annotation and GPT-4-assisted labeling to judge whether “the most powerful Open AI large language model” at the time could reduce the time needed for that individual task by at least 50 percent “with equivalent quality.” If not, they also judged whether access to “anticipated LLM-powered software” might achieve a similar time savings in the future.

Crucially, the humans consulted for this labeling weren’t the ones who actually perform these jobs, or even ones familiar with those jobs. Instead, they were people familiar with the state of the art in AI in 2023, being asked to make broad guesses about where LLMs and future LLM-powered software would be most useful.

The rubric researchers used to make their educated guesses about AI’s job impacts. The broadest impacts were seen in E2, which assumes future “additional software” developed atop LLMs.

The researchers acknowledge that since the human annotators were “mostly unaware of the specific occupations” being evaluated, the “subjectivity of the labeling” forms “a fundamental limitation of our approach.” The results of that labeling show what the researchers call an “unclear logic for aggregating tasks and occupations, as well as some evident discrepancies in labels.” Those are some pretty big caveats for trying to create an objective-looking measure of AI’s occupational impacts.

Digging into the detailed rubric used by the researchers, we can also see the kinds of assumptions they made about occupations that could have the most “direct exposure” to LLMs at the time. That rubric provides many handy examples of the kinds of tasks that LLMs could perform, including:

Writing and transforming text and code according to complex instructions

Providing edits to existing text or code following specifications

Writing code that can help perform a task that used to be done by hand

Generating questions a user might want to ask about a document

All in all, this isn’t a bad list of the kinds of tasks LLMs were best at in 2023. But just because an LLM could perform these tasks to some degree doesn’t necessarily mean it could do so in a way that “can reduce the time it takes to complete the task with equivalent quality by at least half.”

Keep in mind, for instance, that a 2025 study found that open source coders using AI were 19 percent slower than those not using AI once time spent writing prompts and reviewing generated code were taken into account. Also keep in mind LLMs’ well-known penchant for hallucination and sycophancy before assuming that their output would be “of equivalent quality” to a human’s.

The promise of “anticipated LLM-powered software”

Even with this generous reading of 2023-era LLMs’ job-related capabilities, the researchers estimated only about 15 percent of all job-related tasks could be made at least 50 percent more efficient by LLMs at the time. All told, only about 2.3 percent of occupations saw at least 50 percent of their O*NET tasks “exposed” to LLMs of the time in this way.

To get to the scarier numbers shown in the chart from the beginning of this story, the researchers had to start projecting the impact of “anticipated LLM-powered software” on various jobs.

Think back for a second to the state of the AI industry in August 2023, just after the release of Open AI’s GPT-4 model. That moment in time might mark something of a high point for AI hype. Around this time, Elon Musk and others were calling for a six-month pause in AI development out of fears that we “risk loss of control of our civilization,” and Eliezer Yudkowsky was warning that we should be willing to “destroy a rogue datacenter by airstrike” if a superhuman AI entity threatened all life on Earth. Geoffrey Hinton was quitting Google so he could speak out about fears that AI “could actually get smarter than people” and “become impossible to control.” And high-profile work impacts of AI hallucinations were just beginning to gain widespread attention.

This was the environment in which AI experts were being asked to project the future job-altering capabilities of LLM-powered software.

The 𝛽 and 𝜁 lines here assume much larger potential LLM impacts on jobs by incorporating forward-looking projections for “anticipated LLM-powered software.”

Importantly, the researchers didn’t even have a self-imposed time limit for when these effects would be seen in future software. “We do not make predictions about the development or adoption timeline of such LLMs,” the researchers write, creating an essentially unbounded horizon that limits the predictive power of this kind of projection.

Digging in to some of the examples given shows how much the labelers are assuming about LLM capabilities going forward, too. For instance, the researchers predict that negotiating purchases or contracts could be impacted by LLMs because “you could have each party transcribe their point of view and then feed this to an LLM to resolve any disputes.” While we suppose some people might use LLMs in this way at some point, even the researchers blithely admit that “many people would need to buy into using new technological tools to accomplish this.”

It’s these forward-looking assumptions about LLM-powered software that generate the more eye-popping “theoretical capability” numbers, such as those cited by Anthropic. By the most generous read of this measure, the researchers predict that “between 47 and 56 percent of all tasks” will eventually be made at least 50 percent faster by LLMs, and that 19 percent of all workers “are in an occupation where over half of its tasks are labeled as exposed.” That expands to 100 percent of all job-related tasks for some “fully exposed” occupations, including “Mathematicians,” “Writers and Authors,” and “Web and Digital Interface Designers,” according to the researchers.

Even here, though, it’s important to note that the researchers are not suggesting LLMs will be able to replace humans or work unassisted at these tasks. Using LLM-powered software to speed up a human job task is not the same as wholly replacing human labor with that same software.

Sometimes the researchers even make the continued need for human labor explicit. When it comes to prescribing medications, for instance, the researchers note that “the model can provide guesses for different diagnoses and write prescriptions and case notes. However, it still requires a human in the loop using their judgment and knowledge to make the final decision.” The researchers also explicitly note they are performing their analysis “without distinguishing between labor-augmenting or labor-displacing effects.”

Thus far, Anthropic hasn’t found any obvious job market impacts in fields more exposed to current AI use.

When looking at current unemployment statistics, Anthropic says it hasn’t seen any differential impact in the jobs most exposed to current LLM use and those least exposed. But Anthropic also warns that AI’s job impacts could be slow to show up in the job data—much like the impacts of Chinese manufacturing or the Internet—and could be hard to distinguish from regular business cycle concerns.

In any case, Anthropic warns that, while the current AI use it observes does correlate somewhat with these 2023-era projections, that current usage “is far from reaching its theoretical capability: actual coverage remains a fraction of what’s feasible.” But that “feasible” capability is, at this point, based on outdated guesses that even the original researchers admit are highly limited in their usefulness.

“Accurately predicting future LLM applications remains a significant challenge, even for experts,” they wrote at the time. “Some tasks that seem unlikely for LLMs or LLM-powered software to impact today might change with the introduction of new model capabilities. Conversely, tasks that appear exposed might face unforeseen challenges limiting language model applications.”

Because how employers call their employees...has no bearing on their actual duties (which is what ONet and BLS and OEWS actually base their occupational categorization on)

How exactly you choose to "round" LLM impact on a job field. If you use Copilot in Excel once a day--are you impacted? Sort of....but it isn't taking your job necessarily...unless your boss fires you and then discovers LLM work sucks in quality.

         After 16 years and $8 billion, the military's new GPS software still doesn't work

         Water utility announces it's ditching fluoride—then reveals it did so years ago

         F1 in Japan: Oh no, what have they done to all the fast corners?

         Trump convenes "God Squad" to override Endangered Species Act, up oil production

         Judge halts Nexstar/Tegna merger after FCC let firms exceed TV ownership limit

Ars Technica has been separating the signal from the noise for over 25 years. With our unique combination of technical savvy and wide-ranging interest in the technological arts and sciences, Ars is the trusted source in a sea of information. After all, you don’t need to know everything, only what’s important.

Key Takeaways

How did Anthropic measure AI’s “theoretical capabilities” in the job market
2023 study made a lot of assumptions about future “anticipated LLM-powered software
If you follow the ongoing debate over AI’s growing economic impact, you may have seen the graphic below floating around this month
While the current “observed exposure” area is interesting in its own right, it’s the blue “theoretical capability” that jumps out
That “theoretical AI coverage” area seems like it’s destined to eat a huge swath of the US job market