A walkthrough of how the InsectAgent app blends a classic vision model with a lightweight multimodal model (FastVLM) to identify insects on your iPhone, and only “thinks harder” when it needs to.
InsectAgent was presented at ISVLSI 2025
This post is designed for readers with little to no coding experience. We’ll keep things visual and plain-language, and focus on the why and how of the app’s hybrid pipeline on iPhone.
I hope you find this exploration both accessible and useful!
Field reality: cell coverage can be patchy (or nonexistent) around traps and plots. When your model depends on the network, everything from retries to latency becomes unpredictable. That’s why running locally matters here, beyond privacy alone.
Spotting and identifying insects early can protect crops, reduce chemical sprays, and improve our understanding of seasonal dynamics. The challenge on mobile is achieving good accuracy with reliable, offline behavior.
InsectAgent takes a pragmatic approach:
The result feels snappy on easy photos and smarter on ambiguous ones—no cloud round-trips and no dependency on spotty field networks.
InsectAgent is an iOS app (Swift/Xcode) (code) that demonstrates hybrid insect recognition on modern iPhones:
Why this design?
1) Vision model → top-k logits A compact CNN (say, ResNet-18) produces logits and a ranked list of species candidates. If the top-1 confidence ≥ τ (a configurable threshold), we return that label immediately.
2) Dynamic information augmentation (conditional) If confidence is low, the system fetches short, visual-first cue cards for the top-k species (e.g., “yellow-black banding,” “hind-wing ocelli,” “clubbed antennae”) from a knowledge base.
3) Multimodal reasoning A MLLM compares the input image with those cue cards, weighs evidence, and selects the best match—often correcting near-misses among visually similar species.
100 random insect images from IP102
| Method | Latency (s) | Accuracy (%) |
|---|---|---|
| ResNet-18 | 0.150 | 46.47 |
| ResNet-18 + FastVLM 0.5B | 3.016 | 51.18 |
| ResNet-18 + FastVLM 1.5B | 7.205 | 57.06 |
| ResNet-18 + FastVLM 7B | Out of Memory | Out of Memory |
Notes.
Pairing a fast vision model with conditional multimodal reasoning gives the best of both worlds on mobile: quick answers when the photo is clear and expert-like judgment when species look alike—without relying on flaky field networks.
What’s next: