Frequent 504 Gateway Timeout Errors on Inference API – sentence-transformers/all-MiniLM-L6-v2

MathanRamar · February 2, 2026, 10:29pm

Hi Hugging Face Support Team,

We are currently using the Inference API endpoint for the model sentence-transformers/all-MiniLM-L6-v2 in our production application for semantic similarity scoring.

Recently, we have been experiencing frequent failures with the following error:

504 Server Error: Gateway Time-out
router.huggingface.co/hf-inference/models/sentence-transformers/all-MiniLM-L6-v2

Details:
• Multiple retries triggered
• Each request waits ~30–60 seconds before timing out
• Previously the same requests were completing quickly
• No changes were made on our server or application side
• This issue occurs intermittently but very frequently under moderate load

Our usage pattern:
• Multiple similarity requests per document (batch style)
• Requests made sequentially via HTTPS POST
• Python requests library

Could you please confirm:

Whether there are current rate limits or throttling applied to this endpoint?
If the model is under heavy load or cold-start delays?
Recommended best practices for higher reliability (batching / dedicated endpoints / paid tier)?
Whether upgrading to a paid or dedicated endpoint would avoid these timeouts?

This issue is affecting our service reliability, so your guidance would be greatly appreciated.

Thank you for your supports

John6666 · February 3, 2026, 12:32am

If the error is 504, it probably doesn’t mean you’ve hit an account-level limit. However, free Endpoints are fundamentally published for demo purposes, so you generally shouldn’t expect them to be very stable for production use.

If you’re using it for product integration, I recommend using a Dedicated Endpoint or using TEI to host it on your own server.

What a `504` means on `router.huggingface.co/hf-inference/...`

A 504 Gateway Time-out from router.huggingface.co usually means:

the router (gateway) waited too long for an upstream backend to produce a response, then gave up; and
your client-side timeout settings generally cannot override a server-side gateway cutoff.

A Hugging Face forum deep-dive on timeouts describes this pattern as a gateway/proxy cap, distinct from a normal application error body. (Hugging Face Forums)

This fits your symptoms (30–60s “hang” → 504) more than classic rate limiting.

Important: make sure you are using the “pipeline” URL for this model

For sentence-transformers/all-MiniLM-L6-v2, the model maintainers pinned a notice that the inference URL moved to:

.../pipeline/feature-extraction (embeddings)
.../pipeline/sentence-similarity (similarity)

…and they provide curl examples. (Hugging Face)

If you are calling:

https://huggingface.co/static-proxy/router.huggingface.co/hf-inference/models/sentence-transformers/all-MiniLM-L6-v2

without /pipeline/..., fix that first. It won’t solve all 504s, but it eliminates a common source of “it used to work quickly, then became flaky”.

Answers to your questions

1) Are there rate limits or throttling on this endpoint?

Yes, there are rate limits in the Hugging Face ecosystem, and hitting them is normally expressed as HTTP 429 Too Many Requests with RateLimit* headers (5-minute windows, tiers by plan). (Hugging Face)

However:

Rate limiting ≠ your current symptom. You’re seeing 504 after long waits, which typically indicates queueing / backend load / router timeouts, not a clean “you are over quota” response.
Inference via “HF Inference” is serverless and shared; it’s documented as a serverless service (formerly “Inference API (serverless)”). (Hugging Face)

If you want to confirm whether any of your failures are rate-limit related, log the status codes: if you never see 429, rate limits are probably not the primary cause.

2) Heavy load or cold starts?

Both are plausible on serverless:

HF Inference is serverless and focuses mostly on CPU inference use-cases like embeddings / classification. (Hugging Face)
Serverless systems can exhibit cold starts and/or capacity contention, which shows up to clients as timeouts.

There are multiple public reports of intermittent 504s on HF serverless / router paths, sometimes acknowledged as fixed by staff after reports (suggesting operational issues, not client misuse). (Hugging Face Forums)

Also note: the official status page can show “Operational” even while particular models/providers have trouble. The status page currently shows operational and “no incidents reported” for recent months (as of Feb 2, 2026). (Hugging Face Status)

3) Best practices for higher reliability

A. Reduce calls: use the sentence-similarity pipeline when you can

Instead of many “pairwise” requests, send:

one source_sentence
many other_sentences

in a single request to .../pipeline/sentence-similarity. The pinned model discussion explicitly mentions this endpoint. (Hugging Face)
A staff reply in a real outage thread shows using the Python client’s sentence_similarity for this model. (Hugging Face Forums)

This is often the single biggest reliability improvement because it reduces:

request count
per-request overhead
total time spent waiting in upstream queues

B. Batch embeddings (if you must embed) and cache aggressively

If you embed chunks, send lists of texts per request (batching), and cache embeddings by (model_id, text_hash) so you don’t recompute.

C. Keep inputs short and chunk long documents

The model card states: inputs longer than 256 word pieces are truncated by default. (Hugging Face)
There is also a model discussion about this exact behavior and the need to split into meaningful parts. (Hugging Face)

If you send long document text repeatedly, you increase latency and cost while potentially not improving embedding quality past the truncation point.

D. Connection management + backoff (avoid “retry storms”)

Use a single requests.Session() with connection pooling/keep-alive.
Use bounded retries with exponential backoff + jitter.
Add a circuit breaker: if 504 rate spikes, stop hammering the endpoint for a short cooldown.

A forum post about repeated gateway timeouts notes that these are often intermittent and sometimes resolve after platform-side fixes—aggressive retries can worsen congestion. (Hugging Face Forums)

E. Prefer the official client when possible

huggingface_hub.InferenceClient is designed to work across:

the (free) Inference API,
Inference Endpoints,
third-party Inference Providers. (Hugging Face)

In practice, it also reduces “URL drift” problems (wrong base URL / wrong task path) during platform transitions. (Hugging Face Forums)

4) Would paid or dedicated endpoints avoid these timeouts?

Paid (PRO) helps with billing/quotas; it does not inherently guarantee the serverless router won’t time out.

Inference Providers pricing shows Free vs PRO monthly credits and pay-as-you-go eligibility. (Hugging Face)
But 504 reports exist even when users suspect plan-related issues, and the more direct fix in several threads is “HF staff applied a fix.” (Hugging Face Forums)

If you need production-grade reliability, the typical step is:

Move to dedicated Inference Endpoints (recommended)

HF Inference docs explicitly distinguish HF Inference (serverless) from Inference Endpoints (dedicated + autoscaling). (Hugging Face)

For embeddings specifically: use Text Embeddings Inference (TEI) on a dedicated endpoint

TEI is designed for embeddings workloads and includes token-based dynamic batching to improve throughput and reduce tail latency under load. (Hugging Face)

Why your specific workload pattern tends to trigger `504`s

You described:

“Multiple similarity requests per document”
“Sequential HTTPS POST”
“Moderate load”

This often creates an “amplification” effect:

Each document triggers many API calls.
Under concurrency (multiple docs in flight), calls accumulate faster than they complete.
The shared serverless backend queues up work.
The router hits its max wait and returns 504.

If you switch to:

one sentence-similarity call per document (or per chunk),
or one batched embeddings call per N chunks,
you usually reduce request counts by 10×–100×, which directly lowers timeout risk.

Similar cases/issues online (high-signal)

Same model: intermittent 504 / multi-minute delay, and a staff-suggested workaround using the Python client. (Sep 2025) (Hugging Face Forums)
Other embedding model (bge-large-en-v1.5) timing out on the same router path; staff applied a fix. (Sep 2025) (Hugging Face Forums)
General HF inference 503/504 reports on serverless inference. (Mar–Sep 2025) (Hugging Face Forums)
Discussion of gateway timeout mechanics and the difference between cold-start waits and gateway caps. (Hugging Face Forums)
Platform transition notes: legacy api-inference.huggingface.co/models/... path is being deprecated in favor of the router; mis-targeting endpoints is a recurring source of breakage. (Hugging Face Forums)

Good docs / guides / references (to implement fixes)

Official documentation

HF Inference (serverless) overview + examples. (Hugging Face)
Feature Extraction task docs (embeddings usage examples). (Hugging Face)
huggingface_hub.InferenceClient reference. (Hugging Face)
Model pinned “API URL moved” (use /pipeline/...). (Hugging Face)
Model card truncation note (256 word pieces). (Hugging Face)
TEI on Inference Endpoints (dynamic batching, optimized inference). (Hugging Face)
Inference Providers pricing/billing (Free vs PRO, pay-as-you-go). (Hugging Face)
Hub rate-limits behavior (429, headers, tiers) (useful to distinguish from 504). (Hugging Face)

Community / issues (useful for operational context)

Timeout/availability threads showing intermittent 504s and fixes. (Hugging Face Forums)
Recent router outage / errors reported on GitHub issues (not specific to your model but relevant to router stability). (GitHub)

Recommended “next actions” checklist

Switch to the correct pipeline URL (/pipeline/sentence-similarity or /pipeline/feature-extraction). (Hugging Face)
Collapse per-document calls into:
- one sentence-similarity request, or
- one batched embeddings request + local cosine similarity
Chunk inputs to fit the model’s practical length constraints. (Hugging Face)
Implement bounded retries + jitter and a circuit breaker.
If reliability is a hard requirement: move embeddings to a dedicated endpoint using TEI. (Hugging Face)

Topic		Replies	Views
HF Inference API: 503/504 Server Error Inference Endpoints on the Hub	4	510	September 5, 2025
Intermittent 504 Gateway Timeout on Inference API for mixedbread-ai/mxbai-embed-large-v1 Inference Endpoints on the Hub	1	31	September 4, 2025
API error for model sentence-transformers/all-MiniLM-L6-v2 🤗Transformers	8	252	September 4, 2025
504 Gateway Time-out in Inference Endpoints Inference Endpoints on the Hub	3	790	January 23, 2025
504 Server Error: Gateway Time-out for url 🤗Hub	3	44	November 6, 2025