Overview
Research question
Is a modular Vision-Language Model based split-compute architecture a suitable approach for semantic obstacle detection on a resource-constrained autonomous lawn mower?
The problem
Autonomous service robots in unstructured outdoor environments rely on reactive sensing and fixed-vocabulary perception models. They can identify what is in a scene, but not what is happening in it. Knowing a region is “a person” or “an animal” does not, by itself, determine the right navigation action · the answer depends on state, activity, and position, none of which a standard segmentation model can encode.
The proposal
Lightweight image segmentation runs on a constrained onboard computer, while visual reasoning is offloaded to a nearby edge node running a VLM. The VLM emits a structured natural-language response · scene description, obstacle classification, and navigation action · enabling open-vocabulary reasoning on hardware that cannot host a full VLM locally.
Two evaluation dimensions
Can the VLM layer reason about obstacles in situations where understanding goes beyond pixel-level classification?
Does natural-language output make navigation errors more attributable than in classification-only systems?
Scope & delimitations
- Deployed on a single platform · Husqvarna Automower 450X Nera · as a research prototype, not a production system.
- Edge VLM runs on a nearby workstation (MacBook Pro M2 Pro), simulating a realistic edge-compute node, not on cloud infrastructure.
- Evaluation focuses on suitability for the task; throughput, energy budget, and long-term reliability are out of scope.
- Prompt portability across model families and temporal reasoning over longer horizons are flagged as future work.