Hugging face inference providers, inference through :cheapest variant does not work correctly

Incorrect Provider Resolution for :cheapest Model Variant (openai/gpt-oss-20b)

We are observing an inconsistency with the :cheapest routing behavior for the openai/gpt-oss-20b model. When invoking the model using the :cheapest variant, requests are being routed to the fireworks-ai provider, even though hyperbolic is currently listed as the lowest-cost provider for this model on the Hugging Face Inference pricing page.

This can be verified at:

where Hyperbolic is shown as the cheapest available provider.

However, the following request consistently resolves to Fireworks AI:

curl https://huggingface.co/static-proxy/router.huggingface.co/v1/chat/completions \
  -H "Authorization: <redacted-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "Hi there buddy" }
    ],
    "model": "openai/gpt-oss-20b:cheapest",
    "stream": false
  }' -i

The response headers confirm this routing to Fireworks AI: x-inference-provider: fireworks-ai

HTTP/2 200 
content-type: application/json
date: Mon, 22 Dec 2025 19:04:24 GMT
x-ratelimit-remaining-tokens-generated: 600000
x-ratelimit-remaining-tokens-prompt: 59925
x-powered-by: huggingface-moon
vary: Origin
access-control-allow-origin: *
access-control-expose-headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash
x-robots-tag: none
x-request-id: 1aaebb16-ebf7-42c0-8f44-6743c40fc694
cross-origin-opener-policy: same-origin
referrer-policy: strict-origin-when-cross-origin
x-inference-provider: fireworks-ai  <------- HERE
cf-cache-status: DYNAMIC
cf-ray: 9b21e27a7a858de7-IAD
fireworks-prompt-tokens: 75
fireworks-sampling-options: {"max_tokens": 2048, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "min_p": 0.0, "typical_p": 1.0, "frequency_penalty": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0, "mirostat_target": null, "mirostat_lr": 0.1}
fireworks-server-processing-time: 1.265
fireworks-server-time-to-first-token: 0.139
fireworks-speculation-prompt-matched-tokens: 0
server: cloudflare
via: 1.1 google, 1.1 a05ab23a60026e7a94dfc15016962b24.cloudfront.net (CloudFront)
x-envoy-upstream-service-time: 1271
x-ratelimit-limit-requests: 6000
x-ratelimit-limit-tokens-generated: 600000
x-ratelimit-limit-tokens-prompt: 60000
x-ratelimit-over-limit: no
x-ratelimit-remaining-requests: 5999
x-cache: Miss from cloudfront
x-amz-cf-pop: DXB53-P2
x-amz-cf-id: X6LPyjWO5rxJoeGTPaN_P-_n2ktrOQ88gXPR5sJcITbwLKgf50Jnyg==


# Response

{"id":"1aaebb16-ebf7-42c0-8f44-6743c40fc694","object":"chat.completion","created":1766430263,"model":"accounts/fireworks/models/gpt-oss-20b","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! 👋 How can I help you today?","reasoning_content":"The user says \"Hi there buddy\". This is greeting. The assistant should respond politely. According to system message: \"You are ChatGPT, a large language model trained by OpenAI.\" There's also context on how to respond when missing info: if the user requests something that the assistant cannot do: reply with \"I'm sorry, but I can't assist with that.\", but the user is just greeting. So a friendly response. Then maybe ask what they need. \nAdditionally guidelines: \"When user says Hi: Reply with a greeting. ... You can ask if you'd like me to answer any question.\" There's no conflicting instructions. So answer: \"Hello! How can I help you today?\" or something. The prompt doesn't ask to start with anything else. Just respond with greeting and ask what they want."},"finish_reason":"stop"}],"usage":{"prompt_tokens":75,"total_tokens":258,"completion_tokens":183,"prompt_tokens_details":{"cached_tokens":0}}}

Can someone please help me understand this?

Thanks in advance!

1 Like

Hmm… Is that just how :cheapest behaves…?


What you’re seeing is consistent with how HF Inference Providers is documented to work once you include the words “available right now” in the definition of :cheapest.

1) What :cheapest is supposed to do

Hugging Face documents :cheapest as a selection policy that “selects the provider with the lowest price per output token.” (Hugging Face)

So, in the abstract, if Hyperbolic is the lowest output $/1M for openai/gpt-oss-20b, it should win.

2) Why you can still land on Fireworks even if Hyperbolic is “cheapest” on the catalog page

There are several “filters” before pricing is even considered. The catalog UI is a static comparison table. The router is a runtime system that applies eligibility and availability constraints.

A. Failover on availability and health (most common)

HF explicitly describes automatic failover: if the chosen provider is “flagged as unavailable by our validation system,” requests are routed to alternatives. (Hugging Face)

So the sequence can be:

  1. Compute cheapest candidate (Hyperbolic, by output token price).
  2. Check if Hyperbolic is currently considered available for that model + endpoint + your auth context.
  3. If not available, fallback to the next eligible provider (Fireworks in your case).
  4. Router returns x-inference-provider: fireworks-ai.

This exactly matches your symptom of “consistently resolves to Fireworks.” It suggests Hyperbolic is being excluded at runtime, not that the suffix is ignored.

B. Provider eligibility is not universal (account, org, and settings constraints)

Even when a provider is listed on the public pricing/catalog page, it may be disabled for you:

  • HF routes “by default” following your Inference Provider settings preference order. (Hugging Face)
  • Organizations can disable a set of Inference Providers in org settings. (Hugging Face)
  • Billing mode matters. “Routed by HF” vs “Custom Provider Key” changes what HF can do on your behalf. (Hugging Face)

Docs do not spell out whether :cheapest ignores provider allow/deny lists, but in practice routers almost always compute “cheapest among eligible providers.” If Hyperbolic is disabled (by you or your org), it will never be selected.

C. Endpoint or feature compatibility filtering (chat vs “text generation” differences)

Your request uses the OpenAI-compatible chat endpoint (/v1/chat/completions), which HF notes is “chat tasks only.” (Hugging Face)

Even if a provider serves the model, it can still be excluded for a specific endpoint if the mapping for that provider-model pair does not support that task or required parameters (structured outputs, tools, etc.). The catalog table mixes multiple capability signals (Tools, Structured) and can differ by provider row. (Hugging Face)

In your specific payload you are not using tools or structured outputs. But task-level mapping mismatches still happen in practice.

D. Catalog price labels can lag backend routing metadata

The “cheapest” badge and displayed prices come from provider metadata used in the Hub UI. Providers register pricing metadata and HF uses it for comparison and selection features like :cheapest. (Hugging Face)
If backend routing uses a cached snapshot or temporarily overrides a provider (maintenance, incident), UI and router can diverge.

3) Confirm what is happening in your case (fast checks)

Run these in order. Each one narrows the cause.

Step 1. Force Hyperbolic explicitly

If this fails, then :cheapest is doing the right thing by skipping Hyperbolic.

curl https://huggingface.co/static-proxy/router.huggingface.co/v1/chat/completions \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b:hyperbolic",
    "messages": [{"role":"user","content":"ping"}],
    "stream": false
  }' -i

Interpretation:

  • 200 OK + x-inference-provider: hyperbolic: Hyperbolic works. If :cheapest still picks Fireworks, that is likely a routing bug or an “eligibility” filter (Step 2).
  • 4xx/5xx: Hyperbolic is unavailable to you for that route. That explains Fireworks fallback.

Step 2. Check your provider allowlist / org policy

If you are in an org, confirm the org didn’t disable Hyperbolic. HF explicitly allows org admins to disable providers. (Hugging Face)
Also check your personal Inference Provider settings and whether Hyperbolic is enabled and not blocked by preference configuration. (Hugging Face)

Step 3. Ask the router what it thinks is available

HF documents that the router exposes GET /v1/models to list available models across providers. (Hugging Face)

curl https://huggingface.co/static-proxy/router.huggingface.co/v1/models \
  -H "Authorization: Bearer $HF_TOKEN"

Look for:

  • Whether openai/gpt-oss-20b appears once or multiple times.
  • Whether the router exposes any metadata about “cheapest/fastest” or provider availability.

Step 4. Inspect provider mapping via Hub API (debug view)

HF’s Hub API supports an inferenceProviderMapping expansion that exposes which providers are mapped for a model and can include status-like fields (live vs staging patterns exist in HF’s provider ecosystem docs). (Hugging Face)

Try:

curl -s \
  "https://huggingface.co/api/models/openai/gpt-oss-20b?expand=inferenceProviderMapping" \
  | jq .

You are looking for:

  • Whether Hyperbolic is present in the mapping.
  • Any “status” fields or hints that it is not live for chat completion.

4) Practical workarounds

If you need the cheapest provider deterministically

Use an explicit provider suffix instead of :cheapest:

  • openai/gpt-oss-20b:hyperbolic

This bypasses price routing logic and avoids “cheapest but filtered out” ambiguity. HF documents explicit provider selection via suffix. (Hugging Face)

If you want “cheapest with safe fallback”

Set provider preference order so Hyperbolic is first, then a fallback provider second, and call without :cheapest so you get “first available” behavior. That behavior is explicitly documented. (Hugging Face)

This is not identical to :cheapest, but it is operationally stable.

5) If this is a bug, what to include in a report

To make HF able to reproduce quickly, include:

  • x-request-id (you already have it)
  • exact timestamp and region POP (cf-ray hints at this)
  • result of :hyperbolic forced call
  • output of /v1/models filtered for that model
  • output of Hub API expand=inferenceProviderMapping

That distinguishes “router bug” from “provider unavailable/disabled.”

Good references and background reading (high-signal)


Summary

  • :cheapest means “lowest output-token price,” but only among providers the router considers eligible and available. (Hugging Face)
  • If Hyperbolic is unavailable, disabled, or incompatible for that endpoint, failover can route you to Fireworks. (Hugging Face)
  • Verify by forcing :hyperbolic, then check /v1/models and expand=inferenceProviderMapping. (Hugging Face)

It looks like its started to work for openai/gpt-oss-20b:cheapest now, inference request is getting routed to hyperbolic which is indeed the cheapest provider so I am no longer able to reproduce with openai/gpt-oss-20b now.

But I see the same issue with deepseek-ai/DeepSeek-V3 now, its getting routed to fireworks-ai with :cheapest variant while cheapest provider is novita according to all the three i.e.:

  1. Pricing Page
  2. v1/modelsAPI
  3. ProviderMapping

Please look below for all the details:

cURL with deepseek-ai/DeepSeek-V3:cheapest

curl https://huggingface.co/static-proxy/router.huggingface.co/v1/chat/completions \
    -H "Authorization: <redacted_token>" \
    -H 'Content-Type: application/json' \
    -d '{
        "messages": [
            {
                "role": "user",
                "content": "Hi there buddy"
            }
        ],
        "model": "deepseek-ai/DeepSeek-V3:cheapest",
        "stream": false
    }' -i
HTTP/2 404 
content-type: application/json
content-length: 197
date: Tue, 23 Dec 2025 17:03:31 GMT
x-inference-provider: fireworks-ai <-------- Provider
x-powered-by: huggingface-moon
vary: Origin
access-control-allow-origin: *
access-control-expose-headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash
x-robots-tag: none
x-request-id: Root=1-694acb63-634e7eb839a55f25565e0a3e <-------- Request ID
cross-origin-opener-policy: same-origin
referrer-policy: strict-origin-when-cross-origin
via: 1.1 google, 1.1 11dadbbdc784cc6bd07ca958a23917fa.cloudfront.net (CloudFront)
x-cache: Error from cloudfront
x-amz-cf-pop: DXB53-P2
x-amz-cf-id: R9gphI8ya1dP5TAkS_yso3hgPeNtAnzJy0ui63Ovzy3kYDTg351mtA==


# Response:

{"error":{"message":"Model not found, inaccessible, and/or not deployed","param":"model","code":"NOT_FOUND","type":"error"},"request_id":"Root=1-694acb63-634e7eb839a55f25565e0a3e-chatcmpl-bf09f54"}

Pricing Page
The cheapest provider is novita while fireworks-ai is down, but the request is getting routed to the downed provider:

Forced call to the cheapest provider i.e. novita is working fine:

curl https://huggingface.co/static-proxy/router.huggingface.co/v1/chat/completions \
    -H "Authorization: hf_token" \
    -H 'Content-Type: application/json' \
    -d '{
        "messages": [
            {
                "role": "user",
                "content": "Hi there buddy"
            }
        ],
        "model": "deepseek-ai/DeepSeek-V3:novita",  
        "stream": false
    }' -i
HTTP/2 200 
content-type: application/json
content-length: 436
date: Tue, 23 Dec 2025 17:06:19 GMT
x-inference-provider: novita  <-------- Provider
x-powered-by: huggingface-moon
vary: Origin
access-control-allow-origin: *
access-control-expose-headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash
x-robots-tag: none
x-request-id: Root=1-694acc0a-0ce7105e082464804a8b184d  <-------- Request ID
cross-origin-opener-policy: same-origin
referrer-policy: strict-origin-when-cross-origin
inference-id: 0-019b4c2d-0968-7690-883d-2667b5683be3
x-trace-id: f80d406197d1ab30b7d2506bc98e9d7c
x-cache: Miss from cloudfront
via: 1.1 fe143e4cfc20b38cefacea92e106cfac.cloudfront.net (CloudFront)
x-amz-cf-pop: DXB53-P2
x-amz-cf-id: 9f4ttVYtesNfuRyzfQ10jILwDD7nUNmsELTJvDhXSyIo1yNmumA4jQ==

# Response:

{"id":"f80d406197d1ab30b7d2506bc98e9d7c","object":"chat.completion","created":1766509578,"model":"deepseek/deepseek-v3-turbo","choices":[{"index":0,"message":{"role":"assistant","content":"Hey there! 👋 How's it going? What's on your mind today? 😊"},"finish_reason":"stop"}],"usage":{"prompt_tokens":6,"completion_tokens":20,"total_tokens":26,"prompt_tokens_details":null,"completion_tokens_details":null},"system_fingerprint":""}

The Provider Mapping also shows fireworks-ai as down:

curl -s \                                     
  "https://huggingface.co/api/models/deepseek-ai/DeepSeek-V3?expand=inferenceProviderMapping" \
  | jq .
{
  "_id": "676c000762cee1f3abc3ed5f",
  "id": "deepseek-ai/DeepSeek-V3",
  "inferenceProviderMapping": {
    "novita": {
      "status": "live",
      "providerId": "deepseek/deepseek-v3-turbo",
      "task": "conversational",
      "isModelAuthor": false
    },
    "together": {
      "status": "live",
      "providerId": "deepseek-ai/DeepSeek-V3",
      "task": "conversational",
      "isModelAuthor": false
    },
    "fal-ai": {
      "status": "error",
      "providerId": "deepseek-v3",
      "task": "conversational",
      "isModelAuthor": false
    },
    "fireworks-ai": {
      "status": "error",  <-------------------- HERE
      "providerId": "accounts/fireworks/models/deepseek-v3",
      "task": "conversational",
      "isModelAuthor": false
    }
  }
}

The https://huggingface.co/static-proxy/router.huggingface.co/v1/models/deepseek-ai/DeepSeek-V3 endpoint doesn’t have the fireworks-ai listed for the model:

{
  "data": {
    "id": "deepseek-ai/DeepSeek-V3",
    "object": "model",
    "created": 1735131143,
    "owned_by": "deepseek-ai",
    "providers": [
      {
        "provider": "novita",
        "status": "live",
        "context_length": 64000,
        "pricing": {
          "input": 0.32,
          "output": 1.04
        },
        "supports_tools": true,
        "supports_structured_output": false,
        "is_model_author": false
      },
      {
        "provider": "together",
        "status": "live",
        "context_length": 131072,
        "pricing": {
          "input": 1.25,
          "output": 1.25
        },
        "supports_tools": true,
        "supports_structured_output": true,
        "is_model_author": false
      }
    ]
  }
}
1 Like

Thanks for the detailed response, I appreciate every bit of it :slightly_smiling_face: !

Let me confirm the other factors that you’ve mentioned.

Suggestion: I think if this is due to eligibility / availability filter gates in addition to Pricing one. This page should still respect those filter gates while showing cheapest / fastest and live information OR just don’t just show questionable information if it cannot comply to additional eligibility / availability gates.

1 Like

Indeed. Or perhaps increasing the verbosity regarding the internal provider selection criteria during function calls.:sweat_smile:

I think this would mainly involve changes on Hugging Face’s server side, but since the huggingface_hub library is involved, raising an issue on https://github.com/huggingface/huggingface_hub/issues would be the fastest way to address it.