Physical AI — Extending the Three-Layer Architecture to the Edge
The Agent / Skills / MCP three-layer model established in the cloud maintains structural integrity when deployed to edge devices and robotics.
INFO
Does the Agent / Skills / MCP three-layer model that works in cloud environments also hold up in edge devices operating in the physical world?
The answer is "yes" — and moreover, it holds up without changing the structure.
Target audience: Engineers who want to understand AI agent architecture beyond the boundaries of software. Also useful for teams designing interfaces with edge AI, IoT, and robotics.
Position of This Page
01-vision (WHY — Why we need authoritative references)
→ 02-reference-sources (WHAT — What to use as references)
→ 03-architecture (HOW — How to structure the system)
→ 04-ai-design-patterns (WHICH — Which pattern to choose and when)
→ 05-solving-ai-limitations (REALITY — How to face real-world constraints)
→ This page (EXTENSION — Extending the three-layer model to the physical world)
Meta Information
| What this chapter establishes | Structural consistency of the three-layer model's edge extension, cloud↔edge symmetry |
| What this chapter does NOT cover | Robotics control details (Motion Planner and below), specific hardware implementation guides |
| Dependencies | 03-architecture (three-layer model), 07-doctrine-and-intent (Doctrine Layer) |
| Common misuse | Treating BitNet as the only edge inference technology. This chapter's claim is "the structure doesn't change" — not dependence on a specific technology |
Position Within the Document Series
What Is Physical AI?
Physical AI refers to the technology domain where AI perceives the physical world, makes decisions, and directly acts upon it. Autonomous driving, industrial robots, drones, and humanoids are typical application areas.
Relationship with Embodied AI
In academic contexts, the term Embodied AI is widely used. Embodied AI focuses on "learning through physical embodiment and interaction with the environment," and Physical AI can be positioned as one of its implementation forms. This document uses the term "Physical AI" to align with the context of extending the three-layer architecture to the edge.
The Importance of World Models
What fundamentally distinguishes Physical AI from information-space AI is the need for a World Model — an internal representation of physical world laws. Without understanding gravity, friction, collision, and inertia, a robot cannot operate safely.
Information-space AI: Text/data processing → Physical laws are irrelevant
Physical AI: Real-world action → Understanding gravity, friction, collision is essentialThe World Model functions as part of the domain knowledge embedded in the Skills layer, providing physical plausibility to Agent layer decisions.
Traditionally, Physical AI has been discussed as a separate world from software AI. However, the following technological advances are dissolving this boundary:
- BitNet (1.58-bit quantized LLM) — Enables large language model inference on edge devices
- MCP (Model Context Protocol) — Standardizes tool connectivity
- Edge computing evolution — Improves real-time processing capabilities on devices
BitNet b1.58 — Making Edge Inference a Reality
The key to establishing the Agent layer for Physical AI is BitNet b1.58, published by Microsoft Research.
Why "1.58-bit"?
Conventional LLMs store weights as 16-bit (FP16) or 32-bit (FP32) floating-point values. BitNet b1.58 compresses these to an extreme — only three values: {-1, 0, 1}. The number "1.58" derives from the information required to encode three equiprobable values: log₂(3) ≈ 1.58.
Conventional LLM: weights = arbitrary floating-point values (FP16: 65,536 possibilities)
BitNet b1.58: weights = only three values {-1, 0, 1}How It Differs from Conventional Quantization
Existing methods such as GPTQ, AWQ, and QLoRA compress pre-trained models after the fact. A trade-off between precision and compression ratio is unavoidable. In contrast, BitNet b1.58 replaces the Transformer's linear layers with BitLinear and trains from scratch at 1.58-bit. This structurally avoids the quality degradation inherent in post-hoc compression.
Existing quantization: Train (FP16) → Post-compress → Quality degradation is unavoidable
BitNet b1.58: Train at 1.58-bit from scratch → Structurally optimizedConcrete Performance
The 70B-parameter BitNet b1.58 model shows the following results compared to LLaMA (FP16) of equivalent scale:
| Metric | BitNet b1.58 vs LLaMA (FP16) |
|---|---|
| Inference speed | 4.1x faster |
| Batch capacity | 11x |
| Throughput | 8.9x |
| Matrix operation energy efficiency | 71.4x |
| ARM CPU speedup | 1.37–5.07x |
| x86 CPU speedup | 2.37–6.17x |
| x86 CPU energy reduction | 71.9–82.2% |
Notably, a 100B-parameter model runs on a single CPU, achieving processing speeds equivalent to human reading speed (5–7 tokens per second).
No Special Hardware Required
The greatest significance of BitNet b1.58 is that it requires no special hardware.
The inference framework BitNet.cpp is built on llama.cpp and has been verified on the following architectures:
| Architecture | Verified Hardware Examples | Use Case |
|---|---|---|
| x86-64 (AVX2) | Intel i7-13800H (laptop), AMD EPYC | Desktop / Server |
| ARM (NEON) | Apple M2, Cobalt 100 | Laptop / Tablet |
| ARM (DOTPROD) | ARM v8.2 and later | Mobile / Edge devices |
Running an LLM at practical speeds on a laptop CPU — this means "edge AI" is no longer confined to research labs. It can be started on the hardware you already have.
The latest parallel kernel optimizations (January 2026) introduced configurable tiling, achieving an additional 1.15–2.1x speedup. Embedding layer quantization (Q6_K format) is also supported, improving memory usage and inference speed while nearly maintaining accuracy.
GPU's Role Changes but Doesn't Disappear
BitNet b1.58 drastically reduces GPU dependency for inference, but GPU is still required for model training. The accurate understanding is not "GPU becomes unnecessary" but rather "inference becomes practical on CPU". Additionally, being built on llama.cpp makes integration with existing inference pipelines straightforward.
Current Limitations of BitNet b1.58
BitNet is a promising technology, but the following constraints should be recognized:
- Limited pre-trained model selection — Unlike FP16 models, there is no abundant ecosystem of pre-trained BitNet models
- Text generation quality — FP16 models remain superior for high-precision natural language generation tasks
- Ecosystem maturity — Toolchains and community support are still developing
- Fine-tuning methods not established — Domain adaptation techniques for 1.58-bit models are still in the research stage
While BitNet has sufficient precision for Physical AI control tasks (discrete decisions, binary classification), it is not a universal solution. Identifying the right application domain is critical.
Other Edge Inference Approaches Beyond BitNet
This document highlights BitNet b1.58 as a representative example, but other technologies also enable edge inference.
| Approach | Characteristics | Maturity |
|---|---|---|
| GGUF Quantization (llama.cpp) | Post-training quantization (Q4_K_M / Q5_K_M etc.). Largest model selection | High |
| Apple MLX | Inference framework optimized for Apple Silicon | Medium–High |
| TinyLlama / Phi-3-mini | Small-by-design models. Can run on edge without quantization | Medium |
| MediaPipe LLM Inference | Google's mobile / edge inference API | Medium |
The three-layer model's structural claim (separation of responsibilities holds at the edge) remains valid regardless of what powers the Agent layer's inference engine. BitNet stands out among these for being "structurally optimized from scratch rather than post-compressed," aligning well with this document's design philosophy.
Affinity with Physical AI
This efficiency holds particular significance in the context of Physical AI. Robots and drones are fundamentally battery-powered, and 71.4x energy efficiency means not just "it can run on the edge" but "it can operate autonomously on battery for extended periods."
Furthermore, physical world control tasks often don't require the same precision as language generation:
Text generation : Expressing subtle nuances → High precision required
Code generation : Syntactic accuracy → High precision required
──────────────────────────────────────────
Robot control : "Rotate 30 degrees right" → Discrete decisions suffice
Anomaly detection : "Normal / Abnormal" → Close to binary classification
Route selection : "A / B / or C" → Limited choices1.58-bit weight precision may be insufficient for text generation, but it is fully practical for physical control. This is what makes the combination of quantized models and Physical AI viable.
Three-Layer Mapping: Cloud and Edge
Structural Symmetry
The three-layer model established in the cloud maps directly to the edge:
Layer Correspondence
| Layer | Cloud | Edge / Physical |
|---|---|---|
| Agent | LLMs such as Claude, GPT | Local inference via quantized LLM (BitNet etc.) |
| Skills | Markdown documents, guidelines | Embedded domain knowledge, safety standards, physical parameters |
| MCP | Web API, DB, external services | Sensor input, actuator control, physical device I/O |
Why It Holds Up "Without Changing the Structure"
The essence of the three-layer model is not technical implementation but separation of responsibilities:
Agent = "Decide what should be done"
Skills = "Hold the knowledge needed for decisions"
MCP = "Connect to the outside world and execute"This separation of responsibilities doesn't change whether the connection target is a Web API or a sensor. What changes is each layer's implementation, not its structure.
Implementation Differences
| Aspect | Cloud | Edge |
|---|---|---|
| Inference model | Full-size LLM (tens to hundreds of B parameters) | Quantized model (1.58-bit, several B or less) |
| Knowledge storage | File system / API | Embedded ROM / local storage |
| Tool connectivity | HTTP / JSON-RPC | GPIO / CAN / serial communication |
| Latency requirements | Seconds to minutes | Milliseconds to seconds |
| Connectivity | Always-connected assumed | Offline operation is essential |
| Energy constraints | Data center power | Battery-powered (efficiency is a survival condition) |
Edge AI-Specific Latency Design Considerations
Edge environments require consideration of latency characteristics that differ fundamentally from the cloud.
| Type | Target Range | Design Impact |
|---|---|---|
| Inference latency | 10–100ms | Depends on model size and hardware. Reducible with quantized models |
| Sensor fusion | 1–10ms | Synchronization timing across multiple sensors directly affects decision accuracy |
| Control loop | 1–10ms | Real-time requirements for PID control etc. May require an RTOS (Real-Time OS) |
| Network round-trip | 50–500ms | Round-trip to cloud. Unusable for emergency decisions, necessitating decision distribution design |
| Degraded mode transition | Immediate | Switching to fallback behavior upon communication loss or sensor failure |
These latency constraints are the rationale behind the "decision distribution" collaboration pattern (real-time decisions at edge, advanced analysis in cloud).
The Indispensability of the Doctrine Layer
For AI operating autonomously in the physical world, the importance of the Doctrine Layer is even higher than in the cloud.
Cloud: Decision error → Data misprocessing, degraded user experience
Edge: Decision error → Physical accidents, potential human casualtiesPhysical AI doctrine includes elements absent from software:
- Safety Constraints — Hard limits to prevent physical harm
- Fail-safe — Retreat behavior during communication loss or anomalies
- Ethical Constraints — Inviolable rules that prioritize human safety above all
- Real-time Constraints — Acceptable limits for decision latency
Irreversibility — Autonomous Decisions in the Physical World
In the software world, decision errors can be "undone," but in the physical world, they can produce irreversible consequences. The most important principle in Physical AI design is recognizing this irreversibility.
- Latency constraints: Emergency stop decisions for robots must complete within 100ms. Round-trips to the cloud are unacceptable — immediate edge-based decisions are mandatory
- Safety margins: Because the cost of decision errors is orders of magnitude higher, Doctrine layer constraints function as safeguards
- Mandatory fail-safe: Retreat behavior during communication loss or sensor anomalies is not optional — it is a design requirement
The nature of decisions doesn't change — what changes is the severity of consequences and latency constraints. This is the essential design challenge of Physical AI.
Correspondence with the OODA Cycle
Physical AI is the most intuitive implementation of the OODA cycle:
| OODA Phase | Three-Layer Correspondence | Physical AI Examples |
|---|---|---|
| Observe | MCP Layer (input) | Data acquisition from cameras, LiDAR, temperature sensors |
| Orient | Skills Layer + Doctrine | Referencing safety standards, classifying situations, determining priorities |
| Decide | Agent Layer | Selecting actions such as "stop," "evade," or "continue" |
| Act | MCP Layer (output) | Motor control, alert notification, communication transmission |
From Agent to Robot — The Control Flow
Between the Agent layer's decision and the robot's physical action, the signal passes through multiple control layers. The three-layer model does not replace the control system — it provides decision and knowledge layers above it.
Control Layer — Full Hierarchy from Agent to Robot
| Layer | Responsibility | This Document's Scope |
|---|---|---|
| Agent | High-level intent determination ("Move to Shelf A") | ✅ Three-layer model's Agent layer |
| Task Planner | Decompose intent into subtasks ("Split path into 3 stages") | ⚠️ Boundary area (uses Skills layer knowledge) |
| Motion Planner | Path planning and collision avoidance in physical space | ❌ Conventional robotics domain |
| Controller | Real-time control such as PID and torque control | ❌ Conventional robotics domain |
| Robot | Actuator drive and physical action | ❌ Hardware domain |
This document's scope primarily covers the Agent layer (and its boundary with the Task Planner). Motion Planner and below are handled by conventional robotics engineering — the three-layer model provides decision-making and knowledge above them, not as a replacement.
Cloud × Edge Collaboration Patterns
In actual Physical AI systems, edge devices do not operate in isolation — collaboration with the cloud is assumed:
Collaboration Patterns
| Pattern | Description | Example |
|---|---|---|
| Knowledge Sync | Reflect cloud Skills to the edge | Safety standard updates, distributing new operating parameters |
| Decision Distribution | Real-time decisions at edge, advanced analysis in cloud | Emergency stop is local, route optimization is cloud |
| Status Reporting | Aggregate edge sensor data to the cloud | Anomaly detection log transmission, remote monitoring |
Multi-Agent Systems and A2A
In Physical AI deployments, multi-agent configurations where multiple robots or drones coordinate on tasks are becoming commonplace. Agent-to-Agent (A2A) protocols for inter-agent communication are being put into practice in warehouse robot swarm control, drone formation flight, and similar applications.
However, in the physical world, inter-agent communication is not always guaranteed. Radio interference, distance limitations, and jamming can cause communication blackouts. Whether each agent can act safely and independently during these periods becomes the most critical design challenge.
Communication available: Agent A → Agent B "Avoid Shelf 3" → Direct coordination
Communication lost: Each Agent decides independently based on shared doctrine
→ "Collision avoidance is top priority" "Stop for unknown obstacles"
"Retry connection after 30 seconds"Here, the role of doctrine elevates from "constraints on individual agents" to "the foundation for distributed consensus." Even without communication, if all agents share the same doctrine, each agent's behavior becomes mutually predictable. This is the very archetype of military doctrine — the same structure as a unit that has lost contact with its commander acting autonomously according to pre-shared principles of action (see 07-doctrine-and-intent).
Digital Twins and MCP
By exposing a physical device's digital twin (virtual replica) as an MCP server, cloud-side Agents can reference and control the physical device's state. This is a natural extension of MCP's concept of "standardized external tool connectivity" and further blurs the boundary between the physical world and software.
What This Means for Software Engineers
Physical AI is not "a different world." With an understanding of the three-layer model, software engineers can participate in architecture design:
The MCP server you're designing right now —
its connection target just changed from a Web API to a sensor.
The Skill domain knowledge you're writing right now —
it just changed from translation guidelines to safety standards.
The Agent decision logic you're configuring right now —
it just changed from text processing to motor control.The three-layer structure is the same. Only the implementation details change.
Summary
Physical AI is not a "special case" of the three-layer architecture — it is its most direct extension.
| Aspect | Core Message |
|---|---|
| Structural consistency | The separation of responsibilities across Agent / Skills / MCP holds in the physical world as-is |
| Edge inference realized | BitNet b1.58 has made local inference on edge devices practical (71.4x energy efficiency vs LLaMA) |
| Addressing irreversibility | The importance of the Doctrine layer is most pronounced in the physical world, where consequences can be irreversible |
| Design framework | The OODA cycle functions naturally as a design framework for Physical AI |
| Essential difference | The nature of decisions doesn't change — what changes is the severity of consequences and latency constraints |
Architecture learned in the cloud becomes a bridge to the physical world.
Those who understand the structure can adapt regardless of the deployment target.References
- BitNet.cpp — Microsoft Research — Official 1-bit LLM inference framework repository
- BitNet CPU Inference Optimization — Technical details on parallel kernel implementations, supported architectures, and benchmark results
- BitNet: A 1-bit Model Solving LLM Challenges — Background on BitNet and comparison with existing quantization methods
- The Mechanism and Significance of BitNet b1.58 — Naming rationale for b1.58, performance data, and edge device applicability
Related Documents
- 03-architecture — HOW: Three-layer model structure definition (the foundation for this page's edge extension)
- 04-ai-design-patterns — WHICH: Pattern selection guidelines (relevant to pattern application in edge environments)
- 05-solving-ai-limitations — REALITY: AI constraints and countermeasures (latency and safety constraints are even stricter in the physical world)
- 07-doctrine-and-intent — DOCTRINE: Doctrine and intent design (essential for autonomous decisions in the physical world)