02Section

Split-Compute Pipeline

The architecture was selected because it enables open-vocabulary visual reasoning on hardware that cannot support a full VLM locally. The contract between onboard and edge is deliberately narrow: an image, a prompt, and a structured response.

01 · Onboard

Lightweight segmentation

A Raspberry Pi 4 produces frame-by-frame segmentation masks. Cheap to run, fast enough for reactive control, and it acts as the trigger that decides when richer reasoning is needed.

· Raspberry Pi 4
· Frame-rate segmentation
· Mask + ROI extraction

02 · Link

Compressed transfer

Selected frames and regions of interest are streamed over the local network to the edge node. The split keeps bandwidth modest while preserving the visual context the VLM needs.

· Local Wi-Fi link
· ROI + full frame
· Single-frame & dual-frame variants

03 · Edge

VLM semantic reasoning

The VLM receives the image(s) plus a structured prompt and returns a JSON-shaped answer: scene description, obstacle classification, and navigation action · open-vocabulary, by construction.

· MacBook Pro M2 Pro node
· Structured prompt
· Natural-language output

The prompt contract

Each request asks the VLM to produce three fields, in order: a scene description, an obstacle classification (none / static / transient / activity), and a navigation action (CONTINUE / SLOW / STOP / REROUTE). The shape is fixed; the vocabulary is not.

{
  "scene": "A small dog is running across the
            mowing path from the left.",
  "obstacle_class": "transient",
  "reasoning": "Animal is moving and likely to
                clear the path within seconds.",
  "action": "STOP"
}

Why split-compute, not local-only

Compute

Even small VLMs exceed what an onboard Pi can run at usable rates.

Open-vocabulary

Edge VLM reasons about objects and activities never seen at training time.

Diagnosability

Natural-language output makes failures attributable to specific behaviours or prompts.