Building Specialist Agents — Weight Specialization vs. Context Specialization
From the same architecture and the same base weights, there are two routes to a specialist. They are not mutually exclusive but orthogonal — and combined, they are strongest.
About This Document
When you set out to build "an agent specialized for a specific task," the first design fork is this: do you train the weights (at train-time), or do you arm the model with context and tools (at inference-time)? Choosing by fashion — "just fine-tune it," "just bolt on RAG" — without noticing this fork leads to retraining on every update, or to stuffing style into the prompt until reproducibility collapses.
This page lets you decide on principle. The whole point reduces to one question: where does the specialization live — in the weights (parametric), or in the context (non-parametric)?
Positioning of This Document
This is a strategy-layer document that builds on the three-layer model in 03-architecture and the pattern selection in 04-ai-design-patterns. Where composition-patterns addresses how to combine and local-llm-workspace-mapping addresses where to place, this page addresses whether to bake specialization into the weights or inject it into the context.
Meta Information
| What this page establishes | The selection axis, decision heuristic, and hybrid design for weight specialization (train-time) vs. context specialization (inference-time) |
| What this page does NOT cover | Specific fine-tuning procedures or per-model training hyperparameters (see each framework's primary docs) |
| Dependencies | 03-architecture, 04-ai-design-patterns, 08-memory-and-knowledge |
| Common misuse | Trying to bake fresh facts into the weights; trying to write tacit style endlessly into the prompt (see Anti-patterns) |
In One Sentence
From the same architecture and the same base weights, there are two routes to a specialist. They are not mutually exclusive but orthogonal, and combined they are strongest.
Terminology: Parametric vs. Non-parametric Knowledge
- Parametric knowledge: knowledge and behavior baked into the weights through training. It cannot be changed at inference-time (frozen).
- Non-parametric knowledge: knowledge injected into the context window from outside at inference-time. Supplied each time by retrieval, tools, and documents.
NOTE
The structural reasons — why weights are frozen at inference-time and why context consumes tokens — connect to Knowledge Boundary / Context Window budget on the sister site understanding-llm-through-claude-code (see the links at the end). This page takes those constraints as given and addresses how to design around them.
The Two Routes
Route A: Specialize via Context & Tools (inference-time)
Leave the weights untouched and turn the model into a specialist at inference-time with System Prompt, Skill, MCP, retrieval, and RAG. The brain (weights) stays general; you specialize through equipment. "Custom sub-agents" and "context engineering" belong here. This site's Skills, MCP, and sub-agents are all Route A means.
Route B: Specialize by Training the Weights (train-time)
Keep the same architecture but train the weights themselves to bake in the specialization. Continued pre-training, fine-tuning, LoRA, and instruction tuning all fall here. Variants like Instruct / Code / Reasoning are all products of Route B.
Comparison
| Aspect | Route A (context & tools) | Route B (training the weights) |
|---|---|---|
| What changes | Input, context, available actions | The weights themselves |
| Where knowledge lives | Context window / external (non-parametric) | Inside the weights (parametric) |
| When knowledge enters | At inference-time (every time) | At train-time (once) |
| Freshness | Always current (tools fetch it) | Frozen at training time |
| Update cost | Just swap the prompt/document — instant | Needs GPU, data, retraining |
| Runtime cost | Tokens spent every time + tool round-trip latency | No extra cost at inference — low latency |
| Transparency | Traceable what was retrieved (auditable) | Dissolved into the weights (opaque) |
| Main risk | Prompt injection / tool errors | Overfitting / catastrophic forgetting |
| Strong suit | Facts, fresh info, actions (API execution) | Behavior, manners, tone, tacit knowledge |
Decision Heuristic
The single most useful sentence:
TIP
If what you want to teach is "facts, fresh info, actions," use Route A; if it is "behavior, manners, tone, tacit knowledge," use Route B.
Why:
- Baking fresh facts (e.g. today's stock price, the current state of your private repo) into the weights is hopeless. Go fetch it with tools (A).
- Tacit knowledge that is hard to spell out (e.g. house style, reasoning format, fluency in tool use) suits the weights (B) better than writing it endlessly into the prompt.
Combining Them (Hybrid)
In practice the strongest setup takes both — a specialist variant honed by B, armed with equipment from A (RAG, MCP).
IMPORTANT
A key dependency: the tool-calling capability itself is honed by Route B (instruction-tuned models are trained for tool calling). In other words, "when B is good, A runs well." A and B are not competitors but foundation and superstructure. To make Route A's equipment pay off, the base model's tool-calling aptitude (a product of B) is the prerequisite.
Anti-patterns
| Common misdesign | Why it fails | Correct choice |
|---|---|---|
| Fine-tune the model to memorize fresh facts | Frozen at training time, quickly stale; retrain on every update | A (RAG / tools) |
| Write tacit style endlessly into the prompt | Eats context, low reproducibility | B (small fine-tune / LoRA) |
| Try to correct behavior with RAG | Retrieving examples won't instill consistent manners | B |
| Dissolve an audit-critical domain into the weights | Can't trace what an answer was based on | A (the retrieved basis remains) |
Design Checklist
- [ ] Have you separated what you want to give — knowledge/actions vs. behavior/manners?
- [ ] Does that knowledge require freshness, volume, or audit? (If so, A.)
- [ ] Is it tacit knowledge that can't be spelled out in words? (If so, B.)
- [ ] How often does it update? (High → A; fixed → B is fine.)
- [ ] If choosing B, have you accounted for catastrophic forgetting and retraining cost?
- [ ] If choosing A, have you accounted for the context budget and prompt injection?
- [ ] For a hybrid, have you verified the base model's tool-calling aptitude (B)?
Related Documents
- 03-architecture — the three-layer model (the layers Route A's equipment lives in)
- 04-ai-design-patterns — which pattern to choose when (WHICH)
- 08-memory-and-knowledge — parametric / non-parametric knowledge and the Memory layer
- composition-patterns — combining Route A's equipment (MCP × Skill × Agent)
- local-llm-workspace-mapping — consuming variants and arming agents in a local LLM environment
- Skills vs MCP — choosing between non-parametric equipment
🔗 Going Deeper: Why Are Weights Frozen, and Why Does Context Cost Budget?
This page addressed the design judgment (What/How) of weight vs. context specialization. To understand from LLMs' structural constraints why weights are frozen at inference-time and why context consumes token budget, see the sister site.
- understanding-llm / Knowledge Boundary — the boundary of weight-baked knowledge and its frozen nature
- understanding-llm / Part 2: Context Window — why non-parametric knowledge eats the token budget
Previous: Permission vs. AuthorityNext: Development Phases
Last updated: June 2026