01Section

Overview

Research question

Is a modular Vision-Language Model based split-compute architecture a suitable approach for semantic obstacle detection on a resource-constrained autonomous lawn mower?

The problem

Autonomous service robots in unstructured outdoor environments rely on reactive sensing and fixed-vocabulary perception models. They can identify what is in a scene, but not what is happening in it. Knowing a region is “a person” or “an animal” does not, by itself, determine the right navigation action · the answer depends on state, activity, and position, none of which a standard segmentation model can encode.

The proposal

Lightweight image segmentation runs on a constrained onboard computer, while visual reasoning is offloaded to a nearby edge node running a VLM. The VLM emits a structured natural-language response · scene description, obstacle classification, and navigation action · enabling open-vocabulary reasoning on hardware that cannot host a full VLM locally.

Two evaluation dimensions

A · Capability

Can the VLM layer reason about obstacles in situations where understanding goes beyond pixel-level classification?

B · Diagnosability

Does natural-language output make navigation errors more attributable than in classification-only systems?

Scope & delimitations

  • Deployed on a single platform · Husqvarna Automower 450X Nera · as a research prototype, not a production system.
  • Edge VLM runs on a nearby workstation (MacBook Pro M2 Pro), simulating a realistic edge-compute node, not on cloud infrastructure.
  • Evaluation focuses on suitability for the task; throughput, energy budget, and long-term reliability are out of scope.
  • Prompt portability across model families and temporal reasoning over longer horizons are flagged as future work.