Split-Compute Pipeline
The architecture was selected because it enables open-vocabulary visual reasoning on hardware that cannot support a full VLM locally. The contract between onboard and edge is deliberately narrow: an image, a prompt, and a structured response.
Lightweight segmentation
A Raspberry Pi 4 produces frame-by-frame segmentation masks. Cheap to run, fast enough for reactive control, and it acts as the trigger that decides when richer reasoning is needed.
- · Raspberry Pi 4
- · Frame-rate segmentation
- · Mask + ROI extraction
Compressed transfer
Selected frames and regions of interest are streamed over the local network to the edge node. The split keeps bandwidth modest while preserving the visual context the VLM needs.
- · Local Wi-Fi link
- · ROI + full frame
- · Single-frame & dual-frame variants
VLM semantic reasoning
The VLM receives the image(s) plus a structured prompt and returns a JSON-shaped answer: scene description, obstacle classification, and navigation action · open-vocabulary, by construction.
- · MacBook Pro M2 Pro node
- · Structured prompt
- · Natural-language output
The prompt contract
Each request asks the VLM to produce three fields, in order: a scene description, an obstacle classification (none / static / transient / activity), and a navigation action (CONTINUE / SLOW / STOP / REROUTE). The shape is fixed; the vocabulary is not.
{
"scene": "A small dog is running across the
mowing path from the left.",
"obstacle_class": "transient",
"reasoning": "Animal is moving and likely to
clear the path within seconds.",
"action": "STOP"
}Why split-compute, not local-only
Compute
Even small VLMs exceed what an onboard Pi can run at usable rates.
Open-vocabulary
Edge VLM reasons about objects and activities never seen at training time.
Diagnosability
Natural-language output makes failures attributable to specific behaviours or prompts.