Ask ChatGPT “what was Q1 revenue?” against your warehouse on Monday and you get $42.3M. Ask it the same question on Tuesday and you get $38.1M. Ask it Wednesday with a different phrasing and you get a number that isn’t revenue at all. Your dashboards still tie out. Your finance team is still closing the books on schedule. So what’s going on?
On the same business questions, models without a semantic layer get the right answer about 70% of the time. With a governed semantic layer in front of them, that goes to 100%. This isn’t hallucination and it isn’t a model defect. It’s what happens when you ask a probabilistic system to invent its own definition of every business term, every time. The model sees a question, looks at your warehouse, finds a table that looks like revenue, picks a join, picks a filter, and returns a number. The SQL is valid. The number is wrong. A semantic layer fixes this by resolving “revenue” to one governed definition before the model writes any SQL at all.
What the Public Benchmarks Say About Accuracy
The accuracy numbers are worse than most data leaders assume.
- BIRD benchmark, academic schemas, frontier models: Top text-to-SQL systems land at roughly 75 to 82% execution accuracy against a human baseline near 93%, according to recent benchmarking by Datost. The original BIRD paper from NeurIPS 2023 reported GPT-4 at only 54.89% execution accuracy against the 92.96% human score.
- Enterprise schemas, real warehouses: Performance drops sharply. Research on enterprise text-to-SQL published in 2025 notes that while state-of-the-art models exceed 80% accuracy on academic datasets, “their performance drops dramatically in the face of large, heterogeneous enterprise schemas.” A 2026 VLDB paper from researchers at Tsinghua and others found annotation error rates as high as 52.8% on BIRD and 66.1% on Spider 2.0-Snow, calling into question whether the leaderboard numbers reflect real conditions at all.
- Production curated metrics, with and without a semantic layer: Industry-average accuracy without a semantic layer sits around 70%, according to AtScale’s production benchmark with a UK Tier 1 bank. The bank was getting the answer wrong about a third of the time. With a commercial semantic layer in front of the model, accuracy rose into the 95% to 100% range depending on the query and how ambiguous the underlying terms were, against the same warehouse and the same metrics.
Forrester analyst Jayesh Chaurasia, in a March 2025 blog post on agentic AI:
“Without explicit context, they guess. And when agents guess, they get joins wrong, misinterpret metrics, and act on flawed assumptions.”
Gartner is making the same call with numbers behind it. At the May 2026 Gartner Data & Analytics Summit in London, Distinguished VP Analyst Rita Sallam told the audience that
“by 2027, organizations that prioritize semantics in AI-ready data will increase agentic AI accuracy by up to 80% and reduce costs by up to 60%.”
Gartner’s survey numbers show the market is already moving: 44% of data and analytics leaders have implemented a semantic layer, and another 48% plan to by 2027. Sallam’s advice to the room:
“establish a context layer as a core component of your data infrastructure.”
Six Failure Modes That Produce Inconsistent Answers
When the same question resolves to different numbers, it’s almost always one of these six things, and usually more than one at the same time.
- Table ambiguity. Your warehouse has thirty-seven tables containing some flavor of the word “revenue.” The model picks one based on column names, descriptions, and surrounding context. Different session, different pick.
- Join ambiguity. The model has to choose how to connect the revenue table to the date table and the customer table. Inner join versus left join versus a CTE with a window function produces different numbers from the same source data.
- Filter ambiguity. “Q1” can mean calendar Q1, fiscal Q1, the most recent closed quarter, or the trailing 90 days. The model picks based on whatever date column it found.
- Time grain ambiguity. Booked on order date, recognized on shipment date, or recognized on accounting period. All three are defensible. Only one is what your CFO calls revenue.
- Definition drift across sessions. A system prompt that defined revenue in session one is gone by session two. The model has no memory.
- Definition drift across users. A prompt the data team wrote isn’t the prompt the sales operations team wrote. Two people ask the same question and get two answers because they have two different upstream instructions.
None of these are model defects. They’re infrastructure gaps. A better foundation model next quarter changes none of them.
Why a Semantic Layer Fixes This and SKILL Files Don’t
Most teams reach for a SKILL file. A markdown document, sometimes YAML, that tells the agent what revenue means, what active customer means, where trailing twelve months ends. Drop it in the agent’s context, point the agent at the warehouse, ship it.
A SKILL file is a description. The model reads it and then writes its own SQL anyway. Anthropic published numbers on this from their own internal-reporting work. Pointed at raw data, their agent was right about 21% of the time. With the AtScale semantic layer in front of the same warehouse, accuracy went to 95% with no prompt engineering. Add proper prompts on top of the semantic layer and accuracy approached 100%. The semantic layer did the heavy lifting; the SKILL and the prompts added polish. System prompts, RAG, and fine-tuning have the same problem. They sit at the instruction layer, while the model still chooses the tables, joins, filters, and time grain on every run.
A semantic layer sits between the agent and the warehouse and resolves each business term to one fixed query, regardless of the prompt, the SKILL, the conversation length, or the model version. It enforces access policy on every query, because permissions don’t live in prose. It routes every query to the cheapest correct path instead of rescanning a year of raw data. Write the SKILL if you want one, but build the semantic layer first.
A Concrete Example
Consider the kind of question a finance or product analyst at a SaaS company runs every quarter: API revenue, token consumption, and active workspaces by model and plan tier for the trailing twelve months. (I am borrowing this example, not Anthropic’s actual benchmark query, from an AtScale post on SKILL files versus semantic layers. The post uses it to illustrate the failure pattern.)
Without a semantic layer, the model writes its own SQL: joining the usage logs to the account and product tables, choosing where the trailing-twelve-month window starts, and running a COUNT(DISTINCT) across a year of raw usage data. The first run might be right. The second run almost certainly isn’t, because nothing in the warehouse pins down what “trailing twelve months” means, what counts as an active workspace, or which revenue table is the source of truth. The finance ops analyst who runs this every Monday gets a different answer than she got last Monday, and a different answer than the product analytics lead pulling the same numbers against the same warehouse.
With a semantic layer in front of the model, “trailing twelve months” resolves to one fiscal definition, “active workspace” resolves to one certified count rule, and “API revenue” maps to a maintained aggregate instead of a fresh year-long scan. The analyst gets the same number on Monday as she does on Tuesday, and the same number the product lead gets in Excel, ChatGPT, Claude, Power BI, or anything else pointed at the layer.
What This Looks Like in One Production Benchmark
The UK commercial banking division of a global Tier 1 bank tested its AI against five everyday analyst questions on more than 6,000 queries a day, documented in a June 2026 AtScale post. Without a semantic layer, the bank’s AI was wrong about 30% of the time. The cost of the rediscovery, meaning the model rescanning the warehouse to figure out what “revenue” and “active customer” meant on every query, came to nearly $9 million a year.
With a semantic layer in front of the model, accuracy on the same metrics moved into the 95% to 100% range. Compute per question dropped from $17.93 to less than a tenth of a cent, and data scanned per query went from 3.15 terabytes to 144 megabytes, a 21,903x reduction. Two-thirds of those 6,000 daily questions came back in under a second.
The model didn’t get better. The infrastructure underneath it did.
What to Ask Your Data Team
If your AI is giving inconsistent answers on the same business question, the problem isn’t your model. You’ve asked a probabilistic system to make a deterministic decision without a source of truth.
Three questions to put to the people who own your data stack:
- Are our top business metrics defined once, in one place, where AI can use them?
- When AI runs a query, is it guessing and scanning, or using our pre-agreed definitions?
- Can we see what each AI query costs us in compute and what the accuracy floor is on the answers?
If those answers are all no, you’re paying the same tax that bank was paying. The fix is a governed metric layer the model is required to query before it answers. Put the definitions where the model has to find them, and the inconsistency stops.
Frequently Asked Questions
Without a semantic layer, the model chooses the table, the join, the filter, and the time grain on every run. “Revenue” resolves to a different SQL statement each time. That’s not a bug in the model. It’s a missing layer of infrastructure between the model and the warehouse.
No. Prompts and SKILL files describe the definition but they don’t enforce it at query time. Anthropic’s own agent was accurate about 21% of the time against a raw warehouse. Adding a semantic layer underneath moved that to 95% with no prompt changes.
On curated business metrics, accuracy rises from an industry average around 70% into the 95% to 100% range, depending on how ambiguous the underlying terms are. Gartner’s projecting up to 80% accuracy improvement and up to 60% cost reduction by 2027 for organizations that prioritize semantics in their AI-ready data.
SHARE
How to Evaluate Context Platforms