Hugging face inference providers, inference through :cheapest variant does not work correctly

John6666 · December 23, 2025, 4:15am

Hmm… Is that just how :cheapest behaves…?

What you’re seeing is consistent with how HF Inference Providers is documented to work once you include the words “available right now” in the definition of :cheapest.

1) What `:cheapest` is supposed to do

Hugging Face documents :cheapest as a selection policy that “selects the provider with the lowest price per output token.” (Hugging Face)

So, in the abstract, if Hyperbolic is the lowest output $/1M for openai/gpt-oss-20b, it should win.

2) Why you can still land on Fireworks even if Hyperbolic is “cheapest” on the catalog page

There are several “filters” before pricing is even considered. The catalog UI is a static comparison table. The router is a runtime system that applies eligibility and availability constraints.

A. Failover on availability and health (most common)

HF explicitly describes automatic failover: if the chosen provider is “flagged as unavailable by our validation system,” requests are routed to alternatives. (Hugging Face)

So the sequence can be:

Compute cheapest candidate (Hyperbolic, by output token price).
Check if Hyperbolic is currently considered available for that model + endpoint + your auth context.
If not available, fallback to the next eligible provider (Fireworks in your case).
Router returns x-inference-provider: fireworks-ai.

This exactly matches your symptom of “consistently resolves to Fireworks.” It suggests Hyperbolic is being excluded at runtime, not that the suffix is ignored.

B. Provider eligibility is not universal (account, org, and settings constraints)

Even when a provider is listed on the public pricing/catalog page, it may be disabled for you:

HF routes “by default” following your Inference Provider settings preference order. (Hugging Face)
Organizations can disable a set of Inference Providers in org settings. (Hugging Face)
Billing mode matters. “Routed by HF” vs “Custom Provider Key” changes what HF can do on your behalf. (Hugging Face)

Docs do not spell out whether :cheapest ignores provider allow/deny lists, but in practice routers almost always compute “cheapest among eligible providers.” If Hyperbolic is disabled (by you or your org), it will never be selected.

C. Endpoint or feature compatibility filtering (chat vs “text generation” differences)

Your request uses the OpenAI-compatible chat endpoint (/v1/chat/completions), which HF notes is “chat tasks only.” (Hugging Face)

Even if a provider serves the model, it can still be excluded for a specific endpoint if the mapping for that provider-model pair does not support that task or required parameters (structured outputs, tools, etc.). The catalog table mixes multiple capability signals (Tools, Structured) and can differ by provider row. (Hugging Face)

In your specific payload you are not using tools or structured outputs. But task-level mapping mismatches still happen in practice.

D. Catalog price labels can lag backend routing metadata

The “cheapest” badge and displayed prices come from provider metadata used in the Hub UI. Providers register pricing metadata and HF uses it for comparison and selection features like :cheapest. (Hugging Face)
If backend routing uses a cached snapshot or temporarily overrides a provider (maintenance, incident), UI and router can diverge.

3) Confirm what is happening in your case (fast checks)

Run these in order. Each one narrows the cause.

Step 1. Force Hyperbolic explicitly

If this fails, then :cheapest is doing the right thing by skipping Hyperbolic.

curl https://huggingface.co/static-proxy/router.huggingface.co/v1/chat/completions \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b:hyperbolic",
    "messages": [{"role":"user","content":"ping"}],
    "stream": false
  }' -i

Interpretation:

200 OK + x-inference-provider: hyperbolic: Hyperbolic works. If :cheapest still picks Fireworks, that is likely a routing bug or an “eligibility” filter (Step 2).
4xx/5xx: Hyperbolic is unavailable to you for that route. That explains Fireworks fallback.

Step 2. Check your provider allowlist / org policy

If you are in an org, confirm the org didn’t disable Hyperbolic. HF explicitly allows org admins to disable providers. (Hugging Face)
Also check your personal Inference Provider settings and whether Hyperbolic is enabled and not blocked by preference configuration. (Hugging Face)

Step 3. Ask the router what it thinks is available

HF documents that the router exposes GET /v1/models to list available models across providers. (Hugging Face)

curl https://huggingface.co/static-proxy/router.huggingface.co/v1/models \
  -H "Authorization: Bearer $HF_TOKEN"

Look for:

Whether openai/gpt-oss-20b appears once or multiple times.
Whether the router exposes any metadata about “cheapest/fastest” or provider availability.

Step 4. Inspect provider mapping via Hub API (debug view)

HF’s Hub API supports an inferenceProviderMapping expansion that exposes which providers are mapped for a model and can include status-like fields (live vs staging patterns exist in HF’s provider ecosystem docs). (Hugging Face)

Try:

curl -s \
  "https://huggingface.co/api/models/openai/gpt-oss-20b?expand=inferenceProviderMapping" \
  | jq .

You are looking for:

Whether Hyperbolic is present in the mapping.
Any “status” fields or hints that it is not live for chat completion.

4) Practical workarounds

If you need the cheapest provider deterministically

Use an explicit provider suffix instead of :cheapest:

openai/gpt-oss-20b:hyperbolic

This bypasses price routing logic and avoids “cheapest but filtered out” ambiguity. HF documents explicit provider selection via suffix. (Hugging Face)

If you want “cheapest with safe fallback”

Set provider preference order so Hyperbolic is first, then a fallback provider second, and call without :cheapest so you get “first available” behavior. That behavior is explicitly documented. (Hugging Face)

This is not identical to :cheapest, but it is operationally stable.

5) If this is a bug, what to include in a report

To make HF able to reproduce quickly, include:

x-request-id (you already have it)
exact timestamp and region POP (cf-ray hints at this)
result of :hyperbolic forced call
output of /v1/models filtered for that model
output of Hub API expand=inferenceProviderMapping

That distinguishes “router bug” from “provider unavailable/disabled.”

Good references and background reading (high-signal)

HF Inference Providers overview and routing policies (:cheapest, :fastest, explicit provider, failover): https://huggingface.co/docs/inference-providers/en/index (Hugging Face)
HF Pricing and Billing (HF-routed vs custom provider keys, org-level provider disabling): https://huggingface.co/docs/inference-providers/en/pricing (Hugging Face)
Inspect guide that restates the meaning of :cheapest (“lowest price per output token”): https://huggingface.co/docs/inference-providers/en/guides/evaluation-inspect-ai (Hugging Face)
Hub API / provider mapping discussion showing expand=inferenceProviderMapping in the wild: https://github.com/crewAIInc/crewAI/issues/3038 (GitHub)
HF forum thread mentioning the router suffixes and how users select cheapest/fastest: https://huggingface.co/static-proxy/discuss.huggingface.co/t/batch-inference-with-huggingface-hub-for-serverless-providers/170649 (Hugging Face Forums)

Summary

:cheapest means “lowest output-token price,” but only among providers the router considers eligible and available. (Hugging Face)
If Hyperbolic is unavailable, disabled, or incompatible for that endpoint, failover can route you to Fireworks. (Hugging Face)
Verify by forcing :hyperbolic, then check /v1/models and expand=inferenceProviderMapping. (Hugging Face)

Topic		Replies	Views
Request API access? Remote model access Beginners	3	78	August 25, 2025
Free monthly limit reached Inference Endpoints on the Hub	7	340	December 22, 2025
Is only inference provider :HF Inference API >> permit API Call succefully for any model with fixed URL pattern <f"https://huggingface.co/static-proxy/api-inference.huggingface.co/models/{repo_id}"> Beginners	2	82	July 16, 2025
Batch inference with huggingface_hub for serverless providers Beginners	3	109	November 20, 2025
I am getting this error only Inference Providers with accurate pricing are available for organizations Beginners	12	140	July 29, 2025