Jarvis uses two kinds of model: the ears that turn your speech into text, and the brain that reasons and acts.

The ears: transcription

Speech-to-text runs on-device by default, with no internet needed. Smaller models are faster, larger ones more accurate, and you pick the trade-off. If you want a cloud transcription service for extra accuracy, plug in your own key.

The brain: reasoning and action

For the agent work, choose the model that does the thinking.

Cloud: Claude, OpenAI, or Gemini, using your own API key.
Local: run an on-device model for fully offline, fully private operation.

Picking a trade-off

Local models keep everything on your Mac and work offline.
Cloud models are more capable for hard, multi-step work.
You can mix: on-device transcription with a cloud brain, or all local.