If the error is 504, it probably doesn’t mean you’ve hit an account-level limit. However, free Endpoints are fundamentally published for demo purposes, so you generally shouldn’t expect them to be very stable for production use.
If you’re using it for product integration, I recommend using a Dedicated Endpoint or using TEI to host it on your own server.
What a 504 means on router.huggingface.co/hf-inference/...
A 504 Gateway Time-out from router.huggingface.co usually means:
- the router (gateway) waited too long for an upstream backend to produce a response, then gave up; and
- your client-side timeout settings generally cannot override a server-side gateway cutoff.
A Hugging Face forum deep-dive on timeouts describes this pattern as a gateway/proxy cap, distinct from a normal application error body. (Hugging Face Forums)
This fits your symptoms (30–60s “hang” → 504) more than classic rate limiting.
Important: make sure you are using the “pipeline” URL for this model
For sentence-transformers/all-MiniLM-L6-v2, the model maintainers pinned a notice that the inference URL moved to:
.../pipeline/feature-extraction (embeddings)
.../pipeline/sentence-similarity (similarity)
…and they provide curl examples. (Hugging Face)
If you are calling:
https://huggingface.co/static-proxy/router.huggingface.co/hf-inference/models/sentence-transformers/all-MiniLM-L6-v2
without /pipeline/..., fix that first. It won’t solve all 504s, but it eliminates a common source of “it used to work quickly, then became flaky”.
Answers to your questions
1) Are there rate limits or throttling on this endpoint?
Yes, there are rate limits in the Hugging Face ecosystem, and hitting them is normally expressed as HTTP 429 Too Many Requests with RateLimit* headers (5-minute windows, tiers by plan). (Hugging Face)
However:
- Rate limiting ≠ your current symptom. You’re seeing
504 after long waits, which typically indicates queueing / backend load / router timeouts, not a clean “you are over quota” response.
- Inference via “HF Inference” is serverless and shared; it’s documented as a serverless service (formerly “Inference API (serverless)”). (Hugging Face)
If you want to confirm whether any of your failures are rate-limit related, log the status codes: if you never see 429, rate limits are probably not the primary cause.
2) Heavy load or cold starts?
Both are plausible on serverless:
- HF Inference is serverless and focuses mostly on CPU inference use-cases like embeddings / classification. (Hugging Face)
- Serverless systems can exhibit cold starts and/or capacity contention, which shows up to clients as timeouts.
There are multiple public reports of intermittent 504s on HF serverless / router paths, sometimes acknowledged as fixed by staff after reports (suggesting operational issues, not client misuse). (Hugging Face Forums)
Also note: the official status page can show “Operational” even while particular models/providers have trouble. The status page currently shows operational and “no incidents reported” for recent months (as of Feb 2, 2026). (Hugging Face Status)
3) Best practices for higher reliability
A. Reduce calls: use the sentence-similarity pipeline when you can
Instead of many “pairwise” requests, send:
- one
source_sentence
- many
other_sentences
in a single request to .../pipeline/sentence-similarity. The pinned model discussion explicitly mentions this endpoint. (Hugging Face)
A staff reply in a real outage thread shows using the Python client’s sentence_similarity for this model. (Hugging Face Forums)
This is often the single biggest reliability improvement because it reduces:
- request count
- per-request overhead
- total time spent waiting in upstream queues
B. Batch embeddings (if you must embed) and cache aggressively
If you embed chunks, send lists of texts per request (batching), and cache embeddings by (model_id, text_hash) so you don’t recompute.
C. Keep inputs short and chunk long documents
The model card states: inputs longer than 256 word pieces are truncated by default. (Hugging Face)
There is also a model discussion about this exact behavior and the need to split into meaningful parts. (Hugging Face)
If you send long document text repeatedly, you increase latency and cost while potentially not improving embedding quality past the truncation point.
D. Connection management + backoff (avoid “retry storms”)
- Use a single
requests.Session() with connection pooling/keep-alive.
- Use bounded retries with exponential backoff + jitter.
- Add a circuit breaker: if
504 rate spikes, stop hammering the endpoint for a short cooldown.
A forum post about repeated gateway timeouts notes that these are often intermittent and sometimes resolve after platform-side fixes—aggressive retries can worsen congestion. (Hugging Face Forums)
E. Prefer the official client when possible
huggingface_hub.InferenceClient is designed to work across:
- the (free) Inference API,
- Inference Endpoints,
- third-party Inference Providers. (Hugging Face)
In practice, it also reduces “URL drift” problems (wrong base URL / wrong task path) during platform transitions. (Hugging Face Forums)
4) Would paid or dedicated endpoints avoid these timeouts?
Paid (PRO) helps with billing/quotas; it does not inherently guarantee the serverless router won’t time out.
- Inference Providers pricing shows Free vs PRO monthly credits and pay-as-you-go eligibility. (Hugging Face)
- But
504 reports exist even when users suspect plan-related issues, and the more direct fix in several threads is “HF staff applied a fix.” (Hugging Face Forums)
If you need production-grade reliability, the typical step is:
Move to dedicated Inference Endpoints (recommended)
HF Inference docs explicitly distinguish HF Inference (serverless) from Inference Endpoints (dedicated + autoscaling). (Hugging Face)
For embeddings specifically: use Text Embeddings Inference (TEI) on a dedicated endpoint
TEI is designed for embeddings workloads and includes token-based dynamic batching to improve throughput and reduce tail latency under load. (Hugging Face)
Why your specific workload pattern tends to trigger 504s
You described:
- “Multiple similarity requests per document”
- “Sequential HTTPS POST”
- “Moderate load”
This often creates an “amplification” effect:
- Each document triggers many API calls.
- Under concurrency (multiple docs in flight), calls accumulate faster than they complete.
- The shared serverless backend queues up work.
- The router hits its max wait and returns
504.
If you switch to:
- one
sentence-similarity call per document (or per chunk),
- or one batched embeddings call per N chunks,
you usually reduce request counts by 10×–100×, which directly lowers timeout risk.
Similar cases/issues online (high-signal)
- Same model: intermittent
504 / multi-minute delay, and a staff-suggested workaround using the Python client. (Sep 2025) (Hugging Face Forums)
- Other embedding model (
bge-large-en-v1.5) timing out on the same router path; staff applied a fix. (Sep 2025) (Hugging Face Forums)
- General HF inference
503/504 reports on serverless inference. (Mar–Sep 2025) (Hugging Face Forums)
- Discussion of gateway timeout mechanics and the difference between cold-start waits and gateway caps. (Hugging Face Forums)
- Platform transition notes: legacy
api-inference.huggingface.co/models/... path is being deprecated in favor of the router; mis-targeting endpoints is a recurring source of breakage. (Hugging Face Forums)
Good docs / guides / references (to implement fixes)
Official documentation
- HF Inference (serverless) overview + examples. (Hugging Face)
- Feature Extraction task docs (embeddings usage examples). (Hugging Face)
huggingface_hub.InferenceClient reference. (Hugging Face)
- Model pinned “API URL moved” (use
/pipeline/...). (Hugging Face)
- Model card truncation note (256 word pieces). (Hugging Face)
- TEI on Inference Endpoints (dynamic batching, optimized inference). (Hugging Face)
- Inference Providers pricing/billing (Free vs PRO, pay-as-you-go). (Hugging Face)
- Hub rate-limits behavior (
429, headers, tiers) (useful to distinguish from 504). (Hugging Face)
Community / issues (useful for operational context)
- Timeout/availability threads showing intermittent
504s and fixes. (Hugging Face Forums)
- Recent router outage / errors reported on GitHub issues (not specific to your model but relevant to router stability). (GitHub)
Recommended “next actions” checklist
-
Switch to the correct pipeline URL (/pipeline/sentence-similarity or /pipeline/feature-extraction). (Hugging Face)
-
Collapse per-document calls into:
- one
sentence-similarity request, or
- one batched embeddings request + local cosine similarity
-
Chunk inputs to fit the model’s practical length constraints. (Hugging Face)
-
Implement bounded retries + jitter and a circuit breaker.
-
If reliability is a hard requirement: move embeddings to a dedicated endpoint using TEI. (Hugging Face)