â
We see dazzling demos and case studies where models ace professional tasks, and vendors frame AI like an ever-fresh superstar intern. In controlled tests, the assignments are clear, the files are attached, and the goal is obvious so results look magical. Back at work, we copy that pattern: short prompt, long output. It feels efficient, even considerate. Except the missing ingredientâcontextâis the hard part, and thatâs still our job.
OpenAIâs new GDPval evaluation measures models on 1,320 real tasks across 44 occupations, created by professionals averaging 14 years of experience. Expert graders compare AI deliverables to human ones in blind reviews. The upshot: todayâs best models are approaching industry experts on quality, and they can be dramatically cheaper and faster on pure inference time. However, those figures exclude the human oversight and iteration real offices require. Crucially, GDPval tasks ship with rich context, files, and clear deliverables. More context and scaffolding measurably improve performance. In the wild, those conditions are rare.
Meanwhile, the opposite signal is loud at work: âworkslop.â A Stanford Social Media Lab and BetterUp study (covered in HBR) finds roughly 40% of U.S. desk workers received AI-generated slop recently, with clean-up taking nearly two hours per instance. Analysts peg the drag around $186 per employee per month, about $9M/year for a 10,000-person firm.Â
Both can be true. Consider Klarna: its AI assistant handled two-thirds of customer service chats in early rollouts, the work of roughly 700 FTEs, when plugged into well-defined workflows and data. With structured context and tight goals, results followed. (We dig into a little further in our recent post breaking down Agentic AI.)
What People Think            What Actually Happens              Â
AI âdoes the jobâ                Brief, then iterate  Â
You + context do it             Speed equals quality
One prompt is enough       Speed needs oversight
â
â
The allure isnât fake. Given the full picture, references, constraints, audience, and success criteria, frontier models often produce drafts that look and score like expert work. Thatâs what GDPval approximates: realistic deliverables plus built-in context and scoring. In those conditions, more reasoning, more task context, and more scaffolding lift quality further. The catch is that most office tasks begin without that clarity. Deciding what to do, pulling the right files, and reconciling conflicts are human moves. Treat AI like a speed amplifier for already-well-framed work, and youâll see the lift; treat it like an autonomous coworker for ambiguous problems and youâll ship slop. The difference is the briefing, not the mode.Â
If you believed the myth, hereâs what changes. Treat AI as a partner, not as a dump-and-run inbox. Start by assembling the story that your work needs to tell: the outcome youâre aiming for, the audience you must persuade, the tone that fits, and the sources the answer must honor, because context is crucial. Your prompt should read like a creative brief. After the output has been generated, work in short loops: read what you get, respond with pointed feedback, refine, and only then, ship. In recent client work, teams that make this rhythm a habit cut rework while keeping quality steady across people and projects. So, build the skill before you spread the tools. Begin where outputs and ownership are clear, and save the fuzzy, cross-functional work for later once the contextual prompt habit has been built.Â
Want your team to turn AI from âslop machineâ into a real collaborator, with simple, durable habits? Letâs talk.Â
â