Hmm… Trying to do that with embedding seems to make it harder…?
Below is what I would do for your exact goal: “JSON with a variable-length list of entities → fixed-length representation → an MLP probe recovers per-entity (azimuth_deg, elevation_deg) with high fidelity.”
Core recommendation for your case
Treat this as a set-to-slots problem, not “text embedding”
- Your input is a set of structured records (category + two angles), not natural language.
- Your output needs per-entity recoverability, which is fundamentally different from “semantic similarity” embeddings.
So the best default is:
- Parse JSON into typed records (
category, az, el)
- Build per-entity numeric-first tokens (angles stay numeric)
- Use a set/slot encoder that outputs a fixed number of slots
K
- Train with Hungarian matching (DETR-style) so “slot ↔ entity” alignment is stable
- Freeze the encoder; train your MLP probe per slot
This training pattern (“fixed number of predictions vs variable ground truth, matched by bipartite assignment”) is standard in set prediction and is one of the main ideas behind DETR. (arXiv)
Why this fits your constraints
1) Arbitrary input length
Set/slot encoders consume N entity tokens and produce K slots (fixed shape), so N can vary.
2) Numerical fidelity
You avoid “numbers-as-text” entirely (or keep text only for the category label), so floats are never degraded by tokenization. Tokenization pitfalls for floating point numbers are well documented—e.g., a float like "3.14159" can be split into multiple chunks by common LLM tokenizers, which is hostile to exact numeric recovery. (arXiv)
3) Fixed output size
You choose a capacity K and always output K×d (or flatten to K·d).
Important reality check: a finite fixed-size vector cannot losslessly encode an unbounded number of entities. In practice, you pick K large enough for your dataset and define an overflow policy (truncate, sample, or increase K). This is the same practical compromise used by set prediction models. (arXiv)
Q1) Encoders / “LLMs” you can use (ranked for your use case)
A. Best first choice: Set Transformer with PMA(K) (recommended)
What it is: A Transformer architecture designed for sets; it includes a pooling module (PMA) that uses K learnable seed vectors to produce exactly K outputs (your slots). (Proceedings of Machine Learning Research)
Why it’s a good fit
- Permutation handling is built-in (set input, order-agnostic)
- Produces fixed
K outputs cleanly via PMA
- Easy to combine with a DETR-style matching loss
Good implementation starting point
- Official PyTorch implementation: (GitHub)
When it’s enough
- If typical
N is up to a few hundred, this is usually fine.
B. If N can be very large: Perceiver IO
What it is: A cross-attention architecture that scales linearly with input size and supports structured outputs via queries—i.e., you can decode K slots efficiently even when N is big. (arXiv)
When to choose it
- If
N is frequently in the hundreds to thousands, Perceiver IO often scales better than full self-attention over inputs.
C. Strong baseline: Deep Sets
What it is: The canonical “set function” form: per-element encoder + permutation-invariant pooling. (arXiv)
Why it’s useful
- It’s the simplest sanity-check: if you can’t get good probe decoding with this, your numeric encoding/training pipeline likely has issues.
Limitation
- Vanilla pooling (sum/mean) tends to entangle entities; it’s better for “global properties of the set” than “recover each element”.
D. Slot-binding alternative: Slot Attention
Not in the citations above, but conceptually: Slot Attention iteratively binds inputs to K slots. It can produce very clean per-slot “object-like” representations, but it often takes more tuning than Set Transformer.
E. Numeric embedding modules you can (and should) borrow
Regardless of the set encoder, you should embed numeric scalars into higher-dimensional vectors before mixing.
A widely used reference is “On Embeddings for Numerical Features in Tabular Deep Learning” (periodic and piecewise-linear numeric embeddings) and its official implementation repo. (arXiv)
This is directly relevant because your “angles must be recoverable” requirement is basically “my numeric scalars shouldn’t get washed out by the backbone.”
F. If you insist on an “LLM-like” encoder end-to-end
This is rarely the best path for your goal, but if you must:
- Tokenization-free / byte/char models
- ByT5 processes raw bytes (no subword vocabulary). (arXiv)
- CANINE processes characters without explicit tokenization/vocabulary. (arXiv)
These avoid subword fragmentation, but you still need task-specific training to make the representation numerically decodable.
- Numeric tokenization research (if you’re training from scratch)
- xVal proposes continuous numerical tokenization for numerically dense datasets. (arXiv)
This is relevant if you’re building a foundation-model-like pipeline, not if you just want a practical probe-friendly encoder quickly.
Why not CLIP for this
- Standard CLIP uses a fixed text context length (commonly 77 tokens), and this is a known limitation in practice (errors when exceeding it, and model constraints around
context_length). (GitHub)
- There is active work on extending CLIP beyond this limit (e.g., TULIP and other approaches), but that still doesn’t address the deeper issue: CLIP-style text embeddings are not trained to preserve precise numeric values. (ICLR Proceedings)
Q2) Should you convert JSON to natural language?
Recommendation: No, not for “MLP decodes floats”
Converting structured records into prose tends to:
- introduce formatting variability (“-125.5 degrees” vs “-125.50°”)
- reintroduce tokenizer issues for numbers
- encourage the encoder to focus on semantics rather than numeric exactness
If your success metric is “recover precise angles,” keep angles numeric and only embed categories as text if you truly need open-vocabulary behavior.
What to do instead: structured → tokens (numeric-first)
A practical representation per entity:
Why harmonics can help: Fourier feature mappings are known to help MLPs learn higher-frequency structure (reducing “spectral bias”). (arXiv)
If you must serialize to text (only for compatibility)
Use a two-stream approach:
- Text stream: stable placeholders
{"category":"chair","az":<NUM0>,"el":<NUM1>}
- Numeric stream: the real floats
[-125.5, -37.2, ...]
Then fuse them (concat + MLP) before the set encoder. This prevents your numeric fidelity from depending on the text tokenizer.
Q3) How to prevent subword tokenizers from breaking floats
The most reliable solution: don’t tokenize floats
Parse JSON → keep numbers as floats → embed numerically.
This completely avoids the class of problems illustrated by common tokenizers splitting floating point strings into multiple pieces. (arXiv)
If you must go through a tokenizer anyway
Use one of these, in descending order of reliability:
- Tokenization-free encoders (byte/char): ByT5, CANINE (arXiv)
- Placeholder substitution + numeric side-channel (best practical workaround with standard LLM tokenizers)
- Quantize angles to fixed-point integers (e.g., tenths of a degree) so the model learns an easier discrete mapping
- Train a custom tokenizer/vocab where numbers (or digit chunks) are handled consistently (only if you control the full training stack)
A concrete configuration I would start with (defaults)
If you want the simplest “works-first” version
-
K = 32, d = 256
-
Per-entity token:
- category embedding (e.g.,
d_cat=64)
- angles:
[sin(az), cos(az), sin(el), cos(el)]
- MLP →
d
-
Set encoder: Set Transformer + PMA(K) (Proceedings of Machine Learning Research)
-
Training: DETR-style Hungarian matching + no-object class (arXiv)
-
Freeze encoder; probe MLP predicts sin/cos, decode with atan2
If you need tighter numeric fidelity
Add harmonics to the input numeric features (e.g., sin(2^kθ), cos(2^kθ) for k=0..3) and/or use numeric embedding modules from the “numerical features embeddings” line of work. (arXiv)
If you expect large N
Swap Set Transformer → Perceiver IO. (arXiv)
Practical online resources to implement quickly
- Set Transformer paper + official implementation (PMA(K) → fixed slots) (Proceedings of Machine Learning Research)
- DETR paper + matcher implementation (Hungarian set matching; adapt cost to angles) (arXiv)
- Numeric scalar embeddings (periodic + piecewise-linear modules; official repo) (NeurIPS Proceedings)
- Tokenization-free encoders if you insist on string-first: ByT5, CANINE (arXiv)
- Why floats in text break (concrete tokenization examples) (arXiv)
What I would do next (implementation order)
- Implement parser + angle canonicalization
- Implement per-entity numeric-first token builder (sin/cos)
- Implement Set Transformer PMA(K)
- Implement Hungarian matching loss (angle distance + optional category cost)
- Overfit a tiny dataset to verify near-zero error
- Train normally; freeze encoder; train probe