Jarvis uses two kinds of model: the ears that turn your speech into text, and the brain that reasons and acts.
The ears: transcription
Speech-to-text runs on-device by default, with no internet needed. Smaller models are faster, larger ones more accurate, and you pick the trade-off. If you want a cloud transcription service for extra accuracy, plug in your own key.
The brain: reasoning and action
For the agent work, choose the model that does the thinking.
- Cloud: Claude, OpenAI, or Gemini, using your own API key.
- Local: run an on-device model for fully offline, fully private operation.
Picking a trade-off
- Local models keep everything on your Mac and work offline.
- Cloud models are more capable for hard, multi-step work.
- You can mix: on-device transcription with a cloud brain, or all local.