Skip to content

Physical AI — Extending the Three-Layer Architecture to the Edge

The Agent / Skills / MCP three-layer model established in the cloud maintains structural integrity when deployed to edge devices and robotics.

INFO

Does the Agent / Skills / MCP three-layer model that works in cloud environments also hold up in edge devices operating in the physical world?

The answer is "yes" — and moreover, it holds up without changing the structure.

Target audience: Engineers who want to understand AI agent architecture beyond the boundaries of software. Also useful for teams designing interfaces with edge AI, IoT, and robotics.

Position of This Page

01-vision (WHY — Why we need authoritative references)
02-reference-sources (WHAT — What to use as references)
03-architecture (HOW — How to structure the system)
04-ai-design-patterns (WHICH — Which pattern to choose and when)
05-solving-ai-limitations (REALITY — How to face real-world constraints)
This page (EXTENSION — Extending the three-layer model to the physical world)

Meta Information
What this chapter establishesStructural consistency of the three-layer model's edge extension, cloud↔edge symmetry
What this chapter does NOT coverRobotics control details (Motion Planner and below), specific hardware implementation guides
Dependencies03-architecture (three-layer model), 07-doctrine-and-intent (Doctrine Layer)
Common misuseTreating BitNet as the only edge inference technology. This chapter's claim is "the structure doesn't change" — not dependence on a specific technology

Position Within the Document Series

What Is Physical AI?

Physical AI refers to the technology domain where AI perceives the physical world, makes decisions, and directly acts upon it. Autonomous driving, industrial robots, drones, and humanoids are typical application areas.

Relationship with Embodied AI

In academic contexts, the term Embodied AI is widely used. Embodied AI focuses on "learning through physical embodiment and interaction with the environment," and Physical AI can be positioned as one of its implementation forms. This document uses the term "Physical AI" to align with the context of extending the three-layer architecture to the edge.

The Importance of World Models

What fundamentally distinguishes Physical AI from information-space AI is the need for a World Model — an internal representation of physical world laws. Without understanding gravity, friction, collision, and inertia, a robot cannot operate safely.

Information-space AI: Text/data processing → Physical laws are irrelevant
Physical AI:          Real-world action → Understanding gravity, friction, collision is essential

The World Model functions as part of the domain knowledge embedded in the Skills layer, providing physical plausibility to Agent layer decisions.

Traditionally, Physical AI has been discussed as a separate world from software AI. However, the following technological advances are dissolving this boundary:

  • BitNet (1.58-bit quantized LLM) — Enables large language model inference on edge devices
  • MCP (Model Context Protocol) — Standardizes tool connectivity
  • Edge computing evolution — Improves real-time processing capabilities on devices

BitNet b1.58 — Making Edge Inference a Reality

The key to establishing the Agent layer for Physical AI is BitNet b1.58, published by Microsoft Research.

Why "1.58-bit"?

Conventional LLMs store weights as 16-bit (FP16) or 32-bit (FP32) floating-point values. BitNet b1.58 compresses these to an extreme — only three values: {-1, 0, 1}. The number "1.58" derives from the information required to encode three equiprobable values: log₂(3) ≈ 1.58.

Conventional LLM: weights = arbitrary floating-point values (FP16: 65,536 possibilities)
BitNet b1.58:     weights = only three values {-1, 0, 1}

How It Differs from Conventional Quantization

Existing methods such as GPTQ, AWQ, and QLoRA compress pre-trained models after the fact. A trade-off between precision and compression ratio is unavoidable. In contrast, BitNet b1.58 replaces the Transformer's linear layers with BitLinear and trains from scratch at 1.58-bit. This structurally avoids the quality degradation inherent in post-hoc compression.

Existing quantization: Train (FP16) → Post-compress → Quality degradation is unavoidable
BitNet b1.58:          Train at 1.58-bit from scratch → Structurally optimized

Concrete Performance

The 70B-parameter BitNet b1.58 model shows the following results compared to LLaMA (FP16) of equivalent scale:

MetricBitNet b1.58 vs LLaMA (FP16)
Inference speed4.1x faster
Batch capacity11x
Throughput8.9x
Matrix operation energy efficiency71.4x
ARM CPU speedup1.37–5.07x
x86 CPU speedup2.37–6.17x
x86 CPU energy reduction71.9–82.2%

Notably, a 100B-parameter model runs on a single CPU, achieving processing speeds equivalent to human reading speed (5–7 tokens per second).

No Special Hardware Required

The greatest significance of BitNet b1.58 is that it requires no special hardware.

The inference framework BitNet.cpp is built on llama.cpp and has been verified on the following architectures:

ArchitectureVerified Hardware ExamplesUse Case
x86-64 (AVX2)Intel i7-13800H (laptop), AMD EPYCDesktop / Server
ARM (NEON)Apple M2, Cobalt 100Laptop / Tablet
ARM (DOTPROD)ARM v8.2 and laterMobile / Edge devices

Running an LLM at practical speeds on a laptop CPU — this means "edge AI" is no longer confined to research labs. It can be started on the hardware you already have.

The latest parallel kernel optimizations (January 2026) introduced configurable tiling, achieving an additional 1.15–2.1x speedup. Embedding layer quantization (Q6_K format) is also supported, improving memory usage and inference speed while nearly maintaining accuracy.

GPU's Role Changes but Doesn't Disappear

BitNet b1.58 drastically reduces GPU dependency for inference, but GPU is still required for model training. The accurate understanding is not "GPU becomes unnecessary" but rather "inference becomes practical on CPU". Additionally, being built on llama.cpp makes integration with existing inference pipelines straightforward.

Current Limitations of BitNet b1.58

BitNet is a promising technology, but the following constraints should be recognized:

  • Limited pre-trained model selection — Unlike FP16 models, there is no abundant ecosystem of pre-trained BitNet models
  • Text generation quality — FP16 models remain superior for high-precision natural language generation tasks
  • Ecosystem maturity — Toolchains and community support are still developing
  • Fine-tuning methods not established — Domain adaptation techniques for 1.58-bit models are still in the research stage

While BitNet has sufficient precision for Physical AI control tasks (discrete decisions, binary classification), it is not a universal solution. Identifying the right application domain is critical.

Other Edge Inference Approaches Beyond BitNet

This document highlights BitNet b1.58 as a representative example, but other technologies also enable edge inference.

ApproachCharacteristicsMaturity
GGUF Quantization (llama.cpp)Post-training quantization (Q4_K_M / Q5_K_M etc.). Largest model selectionHigh
Apple MLXInference framework optimized for Apple SiliconMedium–High
TinyLlama / Phi-3-miniSmall-by-design models. Can run on edge without quantizationMedium
MediaPipe LLM InferenceGoogle's mobile / edge inference APIMedium

The three-layer model's structural claim (separation of responsibilities holds at the edge) remains valid regardless of what powers the Agent layer's inference engine. BitNet stands out among these for being "structurally optimized from scratch rather than post-compressed," aligning well with this document's design philosophy.

Affinity with Physical AI

This efficiency holds particular significance in the context of Physical AI. Robots and drones are fundamentally battery-powered, and 71.4x energy efficiency means not just "it can run on the edge" but "it can operate autonomously on battery for extended periods."

Furthermore, physical world control tasks often don't require the same precision as language generation:

Text generation     : Expressing subtle nuances → High precision required
Code generation     : Syntactic accuracy → High precision required
──────────────────────────────────────────
Robot control       : "Rotate 30 degrees right" → Discrete decisions suffice
Anomaly detection   : "Normal / Abnormal" → Close to binary classification
Route selection     : "A / B / or C" → Limited choices

1.58-bit weight precision may be insufficient for text generation, but it is fully practical for physical control. This is what makes the combination of quantized models and Physical AI viable.

Three-Layer Mapping: Cloud and Edge

Structural Symmetry

The three-layer model established in the cloud maps directly to the edge:

Layer Correspondence

LayerCloudEdge / Physical
AgentLLMs such as Claude, GPTLocal inference via quantized LLM (BitNet etc.)
SkillsMarkdown documents, guidelinesEmbedded domain knowledge, safety standards, physical parameters
MCPWeb API, DB, external servicesSensor input, actuator control, physical device I/O

Why It Holds Up "Without Changing the Structure"

The essence of the three-layer model is not technical implementation but separation of responsibilities:

Agent  = "Decide what should be done"
Skills = "Hold the knowledge needed for decisions"
MCP    = "Connect to the outside world and execute"

This separation of responsibilities doesn't change whether the connection target is a Web API or a sensor. What changes is each layer's implementation, not its structure.

Implementation Differences

AspectCloudEdge
Inference modelFull-size LLM (tens to hundreds of B parameters)Quantized model (1.58-bit, several B or less)
Knowledge storageFile system / APIEmbedded ROM / local storage
Tool connectivityHTTP / JSON-RPCGPIO / CAN / serial communication
Latency requirementsSeconds to minutesMilliseconds to seconds
ConnectivityAlways-connected assumedOffline operation is essential
Energy constraintsData center powerBattery-powered (efficiency is a survival condition)
Edge AI-Specific Latency Design Considerations

Edge environments require consideration of latency characteristics that differ fundamentally from the cloud.

TypeTarget RangeDesign Impact
Inference latency10–100msDepends on model size and hardware. Reducible with quantized models
Sensor fusion1–10msSynchronization timing across multiple sensors directly affects decision accuracy
Control loop1–10msReal-time requirements for PID control etc. May require an RTOS (Real-Time OS)
Network round-trip50–500msRound-trip to cloud. Unusable for emergency decisions, necessitating decision distribution design
Degraded mode transitionImmediateSwitching to fallback behavior upon communication loss or sensor failure

These latency constraints are the rationale behind the "decision distribution" collaboration pattern (real-time decisions at edge, advanced analysis in cloud).

The Indispensability of the Doctrine Layer

For AI operating autonomously in the physical world, the importance of the Doctrine Layer is even higher than in the cloud.

Cloud: Decision error → Data misprocessing, degraded user experience
Edge:  Decision error → Physical accidents, potential human casualties

Physical AI doctrine includes elements absent from software:

  • Safety Constraints — Hard limits to prevent physical harm
  • Fail-safe — Retreat behavior during communication loss or anomalies
  • Ethical Constraints — Inviolable rules that prioritize human safety above all
  • Real-time Constraints — Acceptable limits for decision latency

Irreversibility — Autonomous Decisions in the Physical World

In the software world, decision errors can be "undone," but in the physical world, they can produce irreversible consequences. The most important principle in Physical AI design is recognizing this irreversibility.

  • Latency constraints: Emergency stop decisions for robots must complete within 100ms. Round-trips to the cloud are unacceptable — immediate edge-based decisions are mandatory
  • Safety margins: Because the cost of decision errors is orders of magnitude higher, Doctrine layer constraints function as safeguards
  • Mandatory fail-safe: Retreat behavior during communication loss or sensor anomalies is not optional — it is a design requirement

The nature of decisions doesn't change — what changes is the severity of consequences and latency constraints. This is the essential design challenge of Physical AI.

Correspondence with the OODA Cycle

Physical AI is the most intuitive implementation of the OODA cycle:

OODA PhaseThree-Layer CorrespondencePhysical AI Examples
ObserveMCP Layer (input)Data acquisition from cameras, LiDAR, temperature sensors
OrientSkills Layer + DoctrineReferencing safety standards, classifying situations, determining priorities
DecideAgent LayerSelecting actions such as "stop," "evade," or "continue"
ActMCP Layer (output)Motor control, alert notification, communication transmission

From Agent to Robot — The Control Flow

Between the Agent layer's decision and the robot's physical action, the signal passes through multiple control layers. The three-layer model does not replace the control system — it provides decision and knowledge layers above it.

Control Layer — Full Hierarchy from Agent to Robot
LayerResponsibilityThis Document's Scope
AgentHigh-level intent determination ("Move to Shelf A")✅ Three-layer model's Agent layer
Task PlannerDecompose intent into subtasks ("Split path into 3 stages")⚠️ Boundary area (uses Skills layer knowledge)
Motion PlannerPath planning and collision avoidance in physical space❌ Conventional robotics domain
ControllerReal-time control such as PID and torque control❌ Conventional robotics domain
RobotActuator drive and physical action❌ Hardware domain

This document's scope primarily covers the Agent layer (and its boundary with the Task Planner). Motion Planner and below are handled by conventional robotics engineering — the three-layer model provides decision-making and knowledge above them, not as a replacement.

Cloud × Edge Collaboration Patterns

In actual Physical AI systems, edge devices do not operate in isolation — collaboration with the cloud is assumed:

Collaboration Patterns

PatternDescriptionExample
Knowledge SyncReflect cloud Skills to the edgeSafety standard updates, distributing new operating parameters
Decision DistributionReal-time decisions at edge, advanced analysis in cloudEmergency stop is local, route optimization is cloud
Status ReportingAggregate edge sensor data to the cloudAnomaly detection log transmission, remote monitoring

Multi-Agent Systems and A2A

In Physical AI deployments, multi-agent configurations where multiple robots or drones coordinate on tasks are becoming commonplace. Agent-to-Agent (A2A) protocols for inter-agent communication are being put into practice in warehouse robot swarm control, drone formation flight, and similar applications.

However, in the physical world, inter-agent communication is not always guaranteed. Radio interference, distance limitations, and jamming can cause communication blackouts. Whether each agent can act safely and independently during these periods becomes the most critical design challenge.

Communication available: Agent A → Agent B "Avoid Shelf 3" → Direct coordination
Communication lost:      Each Agent decides independently based on shared doctrine
                         → "Collision avoidance is top priority" "Stop for unknown obstacles"
                           "Retry connection after 30 seconds"

Here, the role of doctrine elevates from "constraints on individual agents" to "the foundation for distributed consensus." Even without communication, if all agents share the same doctrine, each agent's behavior becomes mutually predictable. This is the very archetype of military doctrine — the same structure as a unit that has lost contact with its commander acting autonomously according to pre-shared principles of action (see 07-doctrine-and-intent).

Digital Twins and MCP

By exposing a physical device's digital twin (virtual replica) as an MCP server, cloud-side Agents can reference and control the physical device's state. This is a natural extension of MCP's concept of "standardized external tool connectivity" and further blurs the boundary between the physical world and software.

What This Means for Software Engineers

Physical AI is not "a different world." With an understanding of the three-layer model, software engineers can participate in architecture design:

The MCP server you're designing right now —
its connection target just changed from a Web API to a sensor.

The Skill domain knowledge you're writing right now —
it just changed from translation guidelines to safety standards.

The Agent decision logic you're configuring right now —
it just changed from text processing to motor control.

The three-layer structure is the same. Only the implementation details change.

Summary

Physical AI is not a "special case" of the three-layer architecture — it is its most direct extension.

AspectCore Message
Structural consistencyThe separation of responsibilities across Agent / Skills / MCP holds in the physical world as-is
Edge inference realizedBitNet b1.58 has made local inference on edge devices practical (71.4x energy efficiency vs LLaMA)
Addressing irreversibilityThe importance of the Doctrine layer is most pronounced in the physical world, where consequences can be irreversible
Design frameworkThe OODA cycle functions naturally as a design framework for Physical AI
Essential differenceThe nature of decisions doesn't change — what changes is the severity of consequences and latency constraints
Architecture learned in the cloud becomes a bridge to the physical world.
Those who understand the structure can adapt regardless of the deployment target.

References

  • 03-architectureHOW: Three-layer model structure definition (the foundation for this page's edge extension)
  • 04-ai-design-patternsWHICH: Pattern selection guidelines (relevant to pattern application in edge environments)
  • 05-solving-ai-limitationsREALITY: AI constraints and countermeasures (latency and safety constraints are even stricter in the physical world)
  • 07-doctrine-and-intentDOCTRINE: Doctrine and intent design (essential for autonomous decisions in the physical world)

Released under the MIT License.