Discover providers and models
term-llm providers
term-llm providers --configured
term-llm providers anthropic
term-llm models --provider anthropic
term-llm models --provider openrouter
term-llm models --provider nearai
term-llm models --provider sambanova
term-llm models --provider ollama
term-llm models --json
Use providers when you want to know what is available and how it is configured. Use models when you want the concrete model names a provider currently exposes.
Provider categories
term-llm supports a mix of provider types:
- hosted API providers such as Anthropic, AWS Bedrock, OpenAI, xAI, Gemini, NEAR AI Cloud, SambaNova, and OpenRouter
- subscription-backed OAuth providers such as ChatGPT, Copilot, and Gemini CLI
- local or self-hosted OpenAI-compatible providers such as Ollama, LM Studio, vLLM, or custom endpoints
Credentials
Most providers use API keys via environment variables. Some use OAuth credentials from companion CLIs or locally stored auth files.
| Provider | Credentials source | Notes |
|---|---|---|
anthropic |
ANTHROPIC_API_KEY |
API key |
bedrock |
AWS credential chain or explicit access_key_id / secret_access_key |
Anthropic Claude via AWS Bedrock |
openai |
OPENAI_API_KEY |
Standard OpenAI API key |
chatgpt |
~/.config/term-llm/chatgpt_creds.json |
ChatGPT Plus/Pro OAuth |
copilot |
~/.config/term-llm/copilot_creds.json |
GitHub Copilot OAuth |
gemini |
GEMINI_API_KEY |
Google AI Studio key |
gemini-cli |
~/.gemini/oauth_creds.json |
gemini-cli OAuth |
xai |
XAI_API_KEY |
xAI API key |
venice |
VENICE_API_KEY |
Venice OpenAI-compatible API key |
nearai |
NEARAI_API_KEY |
NEAR AI Cloud OpenAI-compatible TEE inference key |
sambanova |
SAMBANOVA_API_KEY |
SambaNova Cloud OpenAI-compatible API key |
openrouter |
OPENROUTER_API_KEY |
OpenRouter API key |
vllm / custom type: vllm entries |
VLLM_API_KEY or <PROVIDER_NAME>_API_KEY |
Optional for unauthenticated local servers; vLLM OpenAI-compatible API plus reasoning controls for supported chat templates |
zen |
ZEN_API_KEY optional |
empty is valid for free tier |
Examples:
term-llm ask --provider anthropic "question"
term-llm ask --provider chatgpt "question"
term-llm ask --provider copilot "question"
term-llm ask --provider gemini-cli "question"
WebSocket defaults
The built-in openai and chatgpt text providers use the Responses WebSocket transport by default. This improves latency in agentic/tool-heavy runs by reusing one connection and continuing compatible turns with previous_response_id plus only new input. If setup fails before streaming starts, term-llm falls back to HTTP/SSE; if a WebSocket continuation rejects the previous response ID, it retries once with full input.
To force HTTP/SSE for either built-in provider:
providers:
openai:
use_websocket: false
chatgpt:
use_websocket: false
OpenAI-compatible providers remain HTTP/SSE by default. WebSocket defaults are not applied to type: openai_compatible entries.
SambaNova Cloud
SambaNova is available as a built-in OpenAI-compatible provider:
export SAMBANOVA_API_KEY=your-key
term-llm ask --provider sambanova:gpt-oss-120b "quick question"
term-llm models --provider sambanova
providers:
sambanova:
model: gpt-oss-120b
fast_model: Meta-Llama-3.3-70B-Instruct
The provider uses https://api.sambanova.ai/v1, supports tool calls, and has a curated fallback model list. term-llm models --provider sambanova queries SambaNova’s /models endpoint; because the OpenAI-compatible model response does not generally include price metadata, term-llm annotates known SambaNova models with bundled public prices from https://cloud.sambanova.ai/plans/pricing. These prices are also used by term-llm usage cost calculation for matching SambaNova model IDs.
NEAR AI Cloud
NEAR AI Cloud is available as a built-in OpenAI-compatible provider for TEE-backed private inference:
export NEARAI_API_KEY=your-key
term-llm ask --provider nearai:zai-org/GLM-5.1-FP8 "quick question"
term-llm models --provider nearai
providers:
nearai:
model: zai-org/GLM-5.1-FP8
fast_model: Qwen/Qwen3.6-35B-A3B-FP8
The provider uses https://cloud-api.near.ai/v1, supports tool calls, and has a curated fallback list of TEE-hosted text models. term-llm models --provider nearai queries NEAR AI Cloud’s public /model/list catalog and filters it to chat-capable models, with token prices shown per 1M tokens when available.
vLLM reasoning providers
For vLLM servers running reasoning models, prefer type: vllm instead of generic openai_compatible. The vLLM provider still uses the OpenAI-compatible /v1/chat/completions API, but maps term-llm reasoning effort suffixes into model-family-specific vLLM request fields and replays prior assistant reasoning with vLLM’s current reasoning message field.
For Qwen-family models, term-llm sends Qwen chat-template controls:
providers:
cdck_qwen:
type: vllm
base_url: https://gpu-server.example.com:8000/v1
model: Qwen/Qwen3.5-122B-A10B
api_key: ${CDCK_QWEN_API_KEY}
context_window: 200000
max_output_tokens: 50000
Use the normal provider flag, or append an effort suffix to the configured provider name:
term-llm ask -p cdck_qwen "hello" # default: no thinking
term-llm ask -p cdck_qwen-low "harder" # thinking budget 1024
term-llm ask -p cdck_qwen-medium "hard" # thinking budget 4096
term-llm ask -p cdck_qwen-high "very hard" # thinking budget 10000
The effort suffix is stripped before sending the model name upstream. For example, -p cdck_qwen-high sends model: Qwen/Qwen3.5-122B-A10B plus Qwen thinking controls, not a literal Qwen/Qwen3.5-122B-A10B-high model ID.
| Effort | Request fields sent to vLLM |
|---|---|
| default / empty | chat_template_kwargs.enable_thinking: false; no thinking_token_budget |
low |
enable_thinking: true, thinking_token_budget: 1024 |
medium |
enable_thinking: true, thinking_token_budget: 4096 |
high / xhigh / max |
enable_thinking: true, thinking_token_budget: 10000 |
For DeepSeek models served by vLLM, the official recipes use a different chat-template shape. term-llm auto-detects model names containing deepseek; if your provider/model is mistitled, set vllm_thinking_param: thinking explicitly:
providers:
cdck_deepseek:
type: vllm
base_url: https://gpu-server.example.com:8000/v1
model: ds31 # served-model alias; actual backend is DeepSeek
vllm_thinking_param: thinking # use DeepSeek/vLLM chat_template_kwargs.thinking
DeepSeek effort mapping follows the official DeepSeek/vLLM behavior: DeepSeek exposes non-think, Think High, and Think Max rather than Qwen-style token budgets.
| term-llm effort | Request fields sent to vLLM for DeepSeek |
|---|---|
default / empty / minimal / none |
chat_template_kwargs.thinking: false |
low / medium / high |
chat_template_kwargs.thinking: true, chat_template_kwargs.reasoning_effort: high |
xhigh / max |
chat_template_kwargs.thinking: true, chat_template_kwargs.reasoning_effort: max |
DeepSeek requests do not send thinking_token_budget. The reasoning_effort value is nested inside chat_template_kwargs, matching vLLM’s DeepSeek recipes, rather than sent as top-level OpenAI reasoning_effort.
Notes:
- Start vLLM with the appropriate reasoning parser for your model, for example
--reasoning-parser qwen3for Qwen3-family reasoning output or--reasoning-parser deepseek_v3/deepseek_v4for DeepSeek variants. - Qwen
thinking_token_budgetrequires a vLLM server new enough to support it and, on recent vLLM, a server-side--reasoning-config. Plain/default Qwen requests omit the budget so they work without that extra server option. - vLLM currently streams reasoning text in
delta.reasoning, but may not reportusage.completion_tokens_details.reasoning_tokensaccurately. In that case term-llm can show reasoning in debug output whilereasoning_tokensremains0; this reflects vLLM usage metadata, not missing reasoning text. - For multi-turn conversations, term-llm persists streamed reasoning and replays it as assistant
reasoningon the next vLLM request. This lets vLLM’s chat template render the prior reasoning consistently and gives prefix caching the best chance to reuse shared prompt prefixes when the server has prefix caching enabled.
OpenAI-compatible providers
For local or custom backends that do not need vLLM chat-template thinking controls, use type: openai_compatible.
providers:
ollama:
type: openai_compatible
base_url: http://localhost:11434/v1
model: llama3.2:latest
lmstudio:
type: openai_compatible
base_url: http://localhost:1234/v1
model: deepseek-coder-v2
cerebras:
type: openai_compatible
base_url: https://api.cerebras.ai/v1
model: llama-4-scout-17b
api_key: ${CEREBRAS_API_KEY}
Use base_url when the standard /chat/completions path should be appended automatically. Use url when you need to specify the full chat completions endpoint directly.
Configuration reference
| Field | Type | Description |
|---|---|---|
type |
string | Use openai_compatible for generic custom providers, or vllm for vLLM servers that should receive reasoning controls for Qwen/DeepSeek-style chat templates. Inferred automatically for known names like ollama, cerebras, groq, and vllm. |
base_url |
string | Base URL (e.g., http://localhost:11434/v1). /chat/completions is appended automatically. |
url |
string | Full chat completions URL, used as-is. Use this when your endpoint path differs from the standard. Supports srv:// for DNS SRV discovery and $() for command-based resolution. |
api_key |
string | API key. Supports ${ENV_VAR}, op://, file://, and $() resolution. If omitted, term-llm tries <PROVIDER_NAME>_API_KEY from the environment. |
model |
string | Default model name. For configured model objects, this may be either the upstream id or the friendly alias. |
models |
list | Optional list for model pickers and shell completion. Entries may be strings or objects with id, optional alias, context_window, max_output_tokens, parse_reasoning, include_reasoning, thinking_param, and reasoning_efforts. |
fast_model |
string | Lightweight model used for control-plane tasks (e.g., title generation) and the agent model: fast alias. This is separate from service-tier fast mode. Usually this is all you need. |
fast_provider |
string | Optional provider key to use when the fast_model should run on a different configured provider than this one. |
service_tier |
string | Optional Responses API service tier for built-in openai and chatgpt providers. Use fast or priority to request fast/priority service where the selected model supports it. Omit the field to send no service tier. |
context_window |
int | Override context window size in tokens. Use this for self-hosted models not in the built-in token limit tables. |
max_output_tokens |
int | Override maximum output tokens. Same use case as context_window. |
no_stream_options |
bool | When true, don’t send stream_options in the request. Use this for servers that reject the field. Default false; most OpenAI-compatible servers (vLLM, Ollama, LM Studio) support it and need it to report token usage. |
parse_reasoning |
bool | Send parse_reasoning for OpenAI-compatible APIs that can parse inline model thinking into reasoning_content (for example Friendli). |
include_reasoning |
bool | Send include_reasoning; useful with parse_reasoning: true when you want streamed delta.reasoning_content events. |
thinking_param |
string | Generic OpenAI-compatible chat-template control. When a reasoning effort is selected (for example a -high/-max suffix), term-llm sends chat_template_kwargs.<thinking_param>: true. Friendli GLM-5.2 uses enable_thinking. |
vllm_thinking_param |
string | type: vllm only. Override the chat-template thinking key when auto-detection is not possible: enable_thinking for Qwen-style templates, thinking for DeepSeek-style templates. |
use_websocket |
bool | Reserved for providers with native Responses WebSocket support. Defaults to true only for built-in openai and chatgpt; OpenAI-compatible providers default to HTTP/SSE. |
Model object entries
models may mix plain strings and objects. Plain strings are enough for autocomplete/model picker entries. Object entries are for endpoints where the model you want to type locally differs from the model ID the API expects, or where each model needs its own metadata.
providers:
custom:
type: openai_compatible
base_url: https://api.example.com/v1
api_key: ${CUSTOM_API_KEY}
model: friendly-name
models:
- simple-upstream-model
- id: upstream/model-id
alias: friendly-name
context_window: 262144
max_output_tokens: 32768
# Optional: only set these for APIs/models that support them.
parse_reasoning: true
include_reasoning: true
thinking_param: enable_thinking
reasoning_efforts: [high, max]
| Model object field | Description |
|---|---|
id |
Upstream model ID sent in the API request. If alias is omitted, this is also the local name. |
alias |
Friendly local name for CLI use, shell completion, and model picker display. The provider default model may be either id or alias. |
context_window |
Per-model context window metadata. |
max_output_tokens |
Per-model output token cap metadata; OpenAI-compatible requests clamp explicit max_output_tokens to this value. |
parse_reasoning |
Per-model override for the provider-level parse_reasoning flag. |
include_reasoning |
Per-model override for the provider-level include_reasoning flag. |
thinking_param |
Per-model override for the provider-level thinking_param key. Sent as chat_template_kwargs.<thinking_param>: true only when a non-default effort is selected. |
reasoning_efforts |
Exact suffixes to expose for this model, for example [high, max]. The bare model/alias remains the default and sends no reasoning_effort. |
For -p custom:friendly-name-max, term-llm sends model: upstream/model-id plus reasoning_effort: max. If reasoning_efforts is empty or omitted, no effort-suffixed aliases are generated for that model.
Friendli reasoning
Friendli is OpenAI-compatible, but reasoning-capable models such as GLM-5.2 need explicit parser flags to expose thinking as reasoning_content instead of leaving it inline. Configure those fields on the specific model entry:
providers:
friendli:
type: openai_compatible
base_url: https://api.friendli.ai/serverless/v1
api_key: ${FRIENDLI_API_KEY}
model: glm52
models:
- id: zai-org/GLM-5.2
alias: glm52
context_window: 1048576
max_output_tokens: 131072
parse_reasoning: true
include_reasoning: true
thinking_param: enable_thinking
reasoning_efforts: [high, max]
With that config, effort suffixes are generated only from the declared reasoning_efforts. For example -p friendli:glm52-max sends model: zai-org/GLM-5.2, reasoning_effort: max, parse_reasoning: true, include_reasoning: true, and chat_template_kwargs.enable_thinking: true. The bare -p friendli:glm52 sends no reasoning_effort; because only high and max are listed, term-llm does not offer glm52-low or glm52-medium completions.
You can set the same reasoning-parser fields at the provider level as defaults, then override them per model object when different models on the same endpoint need different behavior.
Full example
providers:
my-vllm:
type: vllm
base_url: http://gpu-server:8000/v1
model: Qwen/Qwen3-30B-A3B
api_key: ${VLLM_API_KEY}
context_window: 32768
max_output_tokens: 8192
models:
- Qwen/Qwen3-30B-A3B
- Qwen/Qwen3-8B
legacy-server:
type: openai_compatible
url: http://old-server:5000/api/chat
model: custom-finetune
no_stream_options: true # this server rejects stream_options
Service tiers and fast mode
Built-in openai and chatgpt text providers can send the Responses API service_tier field. To request fast/priority service for all turns through a provider, set service_tier in that provider config:
providers:
openai:
model: gpt-5.4
service_tier: fast # alias for API value "priority"
chatgpt:
model: gpt-5.5-medium
service_tier: priority # equivalent to "fast"
Leave service_tier unset to omit the field entirely. Only some models/accounts support fast service; unsupported requests may be ignored or rejected by the provider. In chat, /fast toggles the fast service tier for the current session. The status line shows fast when it is currently active.
This is different from fast_model / optional fast_provider, which choose a lightweight model for term-llm control-plane tasks such as summaries or title generation, and for agent configs that use model: fast.
Reasoning and model suffixes
Model/provider suffixes control how much reasoning a provider is asked to do. Display of the resulting reasoning is controlled separately by the top-level reasoning config. Non-encrypted provider-marked thinking is shown as collapsed Thinking... / Thought: <title> blocks by default; encrypted reasoning/signature payloads are replay-only and are never displayed.
OpenAI reasoning effort
For OpenAI models, append -low, -medium, -high, or -xhigh to control reasoning effort.
term-llm ask --provider openai:gpt-5.2-xhigh "complex question"
term-llm exec --provider openai:gpt-5.2-low "quick task"
providers:
openai:
model: gpt-5.2-high
| Effort | Meaning |
|---|---|
low |
faster, cheaper, less thorough |
medium |
balanced default |
high |
more thorough reasoning |
xhigh |
maximum reasoning on supported models |
vLLM thinking suffixes
For configured providers with type: vllm, the same suffix parser can be applied to the provider name itself. This is useful when the model ID is long and already configured:
term-llm ask -p cdck_qwen-high "reason carefully"
With:
providers:
cdck_qwen:
type: vllm
base_url: https://gpu-server.example.com:8000/v1
model: Qwen/Qwen3.5-122B-A10B
term-llm sends the base model ID plus vLLM thinking controls. For Qwen-style templates:
| Suffix | Qwen/vLLM behavior |
|---|---|
| none | disable thinking by default (enable_thinking: false), no thinking_token_budget |
-low |
enable thinking, budget 1024 |
-medium |
enable thinking, budget 4096 |
-high / -xhigh / -max |
enable thinking, budget 10000 |
For DeepSeek-style templates, auto-detected from deepseek in the model name or forced with vllm_thinking_param: thinking:
| Suffix | DeepSeek/vLLM behavior |
|---|---|
| none | thinking: false |
-low / -medium / -high |
thinking: true, reasoning_effort: high |
-xhigh / -max |
thinking: true, reasoning_effort: max |
Reasoning replay uses vLLM’s reasoning assistant-message field. vLLM may still report reasoning_tokens: 0 in usage metadata even when reasoning text was streamed; this is a known vLLM-side accounting gap.
Anthropic extended thinking
For Anthropic models, append -thinking:
term-llm ask --provider anthropic:claude-sonnet-4-6-thinking "complex question"
providers:
anthropic:
model: claude-sonnet-4-6-thinking
AWS Bedrock
The bedrock provider routes Anthropic Claude models through AWS Bedrock. It supports the same model suffixes (-thinking, -1m) and has full feature parity with the direct anthropic provider.
Authentication uses the standard AWS credential chain (AWS_ACCESS_KEY_ID env var, ~/.aws/credentials, instance profiles), or explicit credentials in config:
providers:
bedrock:
region: us-west-2
access_key_id: $(op-cache read "op://Private/AWS Bedrock/AWS_ACCESS_KEY_ID")
secret_access_key: $(op-cache read "op://Private/AWS Bedrock/AWS_SECRET_ACCESS_KEY")
model: claude-sonnet-4-6-thinking
Model resolution uses a 3-tier system. Friendly model names like claude-sonnet-4-6 are automatically translated to Bedrock cross-region IDs. Use model_map to override with application inference profile ARNs or specific Bedrock IDs:
providers:
bedrock:
region: us-west-2
model: claude-sonnet-4-6-thinking
model_map:
claude-sonnet-4-6: arn:aws:bedrock:us-west-2:123456789:application-inference-profile/abc123
claude-opus-4-6: us.anthropic.claude-opus-4-6-v1
Suffixes are stripped before lookup, so claude-sonnet-4-6-1m-thinking strips to claude-sonnet-4-6, resolves through model_map, then re-applies thinking and 1M context.
The geographic prefix (us., eu., ap.) is derived from the configured region automatically. For example, eu-west-1 produces eu.anthropic.* IDs, ap-southeast-1 produces ap.anthropic.*, etc. This ensures data residency matches your region without manual override.
Raw Bedrock model IDs (us.anthropic.claude-sonnet-4-6, anthropic.claude-sonnet-4-6) and full ARNs are passed through without translation.
| Config field | Description |
|---|---|
region |
AWS region. Falls back to AWS_REGION env var, then us-east-1. |
profile |
AWS profile name from ~/.aws/credentials. |
access_key_id |
Explicit AWS access key. Supports $(), op://, ${ENV}. |
secret_access_key |
Explicit AWS secret key. Same resolution support. |
session_token |
Optional session token for temporary credentials. |
model_map |
Map of friendly names to Bedrock model IDs or ARNs. |
Native search support
Some providers support native web search. Others rely on external search tooling.
Native support is most relevant for:
- Anthropic
- Bedrock
- OpenAI
- xAI
- Gemini
You can override behavior with:
term-llm ask "latest news" -s --native-search
term-llm ask "latest news" -s --no-native-search
Or in config:
search:
force_external: true
providers:
gemini:
use_native_search: false
See Search for the full routing model.
Recommendations by use case
- fast free experimentation:
zen - OpenAI ecosystem / Codex editing:
openai - Claude models:
anthropic - Claude models via AWS billing:
bedrock - broad model access:
openrouter - local inference:
ollamaor another OpenAI-compatible endpoint - subscription-backed consumer access:
chatgpt,copilot, orgemini-cli