What I Learned About MCP by Fact-Checking Dinner Table Economics

A learning journal from building factcheck.rachithasuresh.com


One agent with 4 tasks failed. Four agents with 1 task each worked.

I started with a single agent that did everything: pick the dataset, choose indicators, select filters, interpret the result. It kept breaking. Not because the model was dumb, but because it had too many decisions to make at once.

The most common failure: over-specifying. I'd ask it to fact-check "inflation is out of control" and instead of querying broad CPI data, it would drill into a specific group, category, and sub-category. The API would return no data. The agent would then confidently explain why there was no data, as if that was the answer.

Too many decisions. Too many places to go wrong.

So I split it into four agents, each with one narrow job. And then I discovered something I didn't expect about model selection.

But first, some context.


What I Built

factcheck.rachithasuresh.com

Every Indian family has that uncle who makes confident economic claims at dinner. I built a tool to fact-check him.

A couple of weeks ago, MoSPI (Ministry of Statistics, India) released India's official statistics as an MCP server. Seven datasets covering employment, inflation, GDP, industrial output, and more. I built The Dinner Table Economist on top of it.

Type a claim. Get a verdict backed by real government data.

Busted "Women are leaving the workforce"
Female LFPR rose from 24.8% to 31.7% in two years. The trend is up, not down.
Busted "Rural India has it worse than cities"
Rural unemployment 2.5% vs urban 5.1% in 2023-24. At least on jobs, rural India is doing better.
Complicated "Only services sector is growing"
Services GVA grew from ₹46.8L crore to ₹56.4L crore, but total GVA also grew. Other sectors are contributing too.

Every verdict comes with actual numbers, a chart, the source dataset, and a backstage panel showing exactly how AI fetched the data through MCP.

My previous project was a parenting game where AI agents debated what you should do with a crying baby at 2am. That taught me about agent coordination. But those agents only had one tool: language. This time I wanted agents that connect to real data, call real APIs, and return real numbers.


The Architecture

Four agents, each with one narrow job:

USER CLAIM "Inflation is out of control" AGENT 1 Classifier Pick the right dataset → CPI mini handoff: dataset=CPI AGENT 2 Selector A Pick indicators → CPI General Index mini handoff: indicator=CPI_General AGENT 3 Selector B Pick filters → All India, 2022-2025 mini handoff: raw data (12 rows) AGENT 4 Interpreter Verdict → "It's complicated" + explanation 4o

The claim flows through the pipeline sequentially. Each agent's output feeds the next. Get any step wrong and you get a confidently wrong answer.

I need to be honest about something. This is a pipeline, not a debate. My parenting game had agents with competing objectives that saw each other's outputs and responded. Genuine agent-to-agent interaction. The Dinner Table Economist is sequential specialization. The classifier doesn't know what the interpreter will do. The selectors don't debate which filters are best.

Both are valid patterns. The parenting game taught me about coordination. This one taught me about reliability. In production, most "multi-agent" systems are actually pipelines. And that's fine, because reliability usually matters more than sophistication.


The Insight I Didn't Expect: Small Models Work

I started with GPT-4o for all four agents. It worked but was slow and expensive.

Then I tried an experiment. The classifier's job is simple: read a claim, pick one of 7 datasets. Does it really need a frontier model for that? I switched the classifier to GPT-4o-mini.

Worked just as well.

So I extended mini to the selector agents too. Their job is also narrow: given a dataset, pick the right indicators and filters. One decision per agent.

Still worked. Output quality was as acceptable as the costlier model.

The only agent that genuinely needed a stronger reasoning model was the interpreter. Its job is nuanced: read raw data, judge whether a claim is confirmed or busted, write an explanation that's accurate without overstating. That requires real comprehension, not just selection.

The pattern: narrow task + small model = reliable. Complex reasoning task + stronger model = necessary. Match the model to the job, not to the system.

After I shipped this, I found NVIDIA published a research paper arguing exactly the same thing: small language models are the future of agentic AI. Their recommendation: design modular systems, use SLMs for routine narrow tasks, reserve LLMs for complex reasoning.

I arrived at the same conclusion by breaking things for two weeks.

To scale agentic applications: give each agent the narrowest possible job, and right-size the model. That's the whole insight.


What MCP Actually Looks Like

Before this project, MCP was a buzzword to me. Now I can tell you what it actually means in practice.

MCP (Model Context Protocol) is a standardized way for AI to connect to external data sources. Instead of downloading spreadsheets or scraping websites, your AI tool connects to an MCP server and queries data through a structured protocol.

MoSPI's MCP server exposes 4 tools in sequence:

Step 1: Discover
What datasets does MoSPI have? (7 datasets: PLFS, CPI, NAS, IIP, WPI, ASI, Energy)
Step 2: Explore
What indicators exist in this dataset? (e.g., unemployment rate, LFPR, CPI index)
Step 3: Prepare
What filters are available? (years, states, gender, age groups, sectors)
Step 4: Fetch
Get the actual data with those filters applied

Tools must be called in order. Skip the Prepare step and you'll send invalid filter codes to the Fetch step. The protocol enforces discipline.

The backstage panel in the app shows this entire chain for every claim. What was asked at each step, what came back, how long it took. If you're curious about what MCP looks like in practice, that panel is the whole point.


What Else Building Revealed

Constraints create reliability

I kept thinking, "if I just prompt better, it'll work."

What actually fixed things were hard rules. No commas in filter values except for years and months. Always pick the latest 3 years. Never include sub-categories unless explicitly requested. Always include required API parameters.

Once those rules existed, the system got boring. And boring turned out to mean correct.

Connection to the parenting game: That project taught me constraints reveal personality (5-word limits gave agents distinct voices). This project taught me constraints create reliability. Same principle, different application.

Tools shift the failure mode

Once you give agents access to tools, the problem changes completely. It stops being about reasoning and becomes about selection.

Which dataset? Which indicator? Which filter? How granular?

The hardest errors were not hallucinations. They were valid tool calls that returned the wrong slice of data. The single-agent version would query CPI with a specific sub-category for a broad claim and get zero rows back. The pipeline version, with one agent per decision, rarely made that mistake because each agent only had one thing to get right.

The failure mode shifted from "the model said something wrong" to "the model asked for the wrong thing." Narrow agents fix this because there's only one thing to ask for.

Observability is your debugging tool

The biggest improvements came from watching the system break. Every time I logged the raw API responses at each step, I could see exactly where things went wrong. An indicator that doesn't match the claim. A filter code that's valid but returns wrong granularity. A data response with 200 rows when I needed 5.

Multi-agent work isn't about clever prompting. It's about being able to see what each agent did and why.

The backstage panel in the app started as my debugging tool. It turned out to be the most interesting feature for users too.

Frontend was easy, backend was hard

The frontend took a day. Magic Patterns generated the UI from a prompt, I refined it, done.

The backend took two weeks. Not because of code, but because of everything I described above. Getting the MCP chain right. Making each agent reliable in its narrow job. Constraining the interpreter. Handling a beta government API that sometimes returns unexpected formats.

The hard part of AI products isn't the interface. It's the pipeline behind it.


The Learning Arc So Far

The parenting game taught me what multi-agent AI looks like: competing objectives, agent communication, context-aware responses.

The Dinner Table Economist taught me what it feels like in production: tool use, messy data, constraint design, model selection, and the unsexy work of making things reliable.

The simple insight I'll keep from both: the fewer decisions you give a model, the better it behaves. And the fewer parameters a model has to justify, the smaller it can be.

Try it yourself

Type a claim. See the verdict. Explore MCP in the backstage.

Fact-check a claim