Structured JSON (numerical data) into fixed-length vectors for MLP probing

Problem: I need to encode JSON strings (containing object class with center positions) into fixed-length vectors. The goal is to use a downstream MLP to accurately recover the azimuth and elevation from the vector for each entity.

JSON Data example:

{
  "entities": [
    {"category": "chair", "centre": {"azimuth_deg": -125.5, "elevation_deg": -37.2}},
    {"category": "cup", "centre": {"azimuth_deg": 110.3, "elevation_deg": -29.9}}
  ]
}

I’ve tried using CLIP, but it struggles to preserve the high-fidelity numerical information. Also, the 77-token length limit is too restrictive for my data.

Requirements for my task:

  1. Arbitrary Input Length: Must handle varying JSON sizes.

  2. Numerical Fidelity: The latent space must represent numbers precisely, not just semantically.

  3. Fixed Output: Must result in a consistent vector size for MLP probing.

Questions:

  1. Any recommdation encoders or LLMs I could use?
  2. Should I convert the JSON into natural language (e.g., “A chair is at -125.5 degrees…”) before encoding, or is there a better way to embed raw structured data?
  3. How do I prevent subword tokenizers from breaking my floating-point numbers into meaningless fragments?
2 Likes

Hmm… Trying to do that with embedding seems to make it harder…?


Below is what I would do for your exact goal: “JSON with a variable-length list of entities → fixed-length representation → an MLP probe recovers per-entity (azimuth_deg, elevation_deg) with high fidelity.”


Core recommendation for your case

Treat this as a set-to-slots problem, not “text embedding”

  • Your input is a set of structured records (category + two angles), not natural language.
  • Your output needs per-entity recoverability, which is fundamentally different from “semantic similarity” embeddings.

So the best default is:

  1. Parse JSON into typed records (category, az, el)
  2. Build per-entity numeric-first tokens (angles stay numeric)
  3. Use a set/slot encoder that outputs a fixed number of slots K
  4. Train with Hungarian matching (DETR-style) so “slot ↔ entity” alignment is stable
  5. Freeze the encoder; train your MLP probe per slot

This training pattern (“fixed number of predictions vs variable ground truth, matched by bipartite assignment”) is standard in set prediction and is one of the main ideas behind DETR. (arXiv)


Why this fits your constraints

1) Arbitrary input length

Set/slot encoders consume N entity tokens and produce K slots (fixed shape), so N can vary.

2) Numerical fidelity

You avoid “numbers-as-text” entirely (or keep text only for the category label), so floats are never degraded by tokenization. Tokenization pitfalls for floating point numbers are well documented—e.g., a float like "3.14159" can be split into multiple chunks by common LLM tokenizers, which is hostile to exact numeric recovery. (arXiv)

3) Fixed output size

You choose a capacity K and always output K×d (or flatten to K·d).

Important reality check: a finite fixed-size vector cannot losslessly encode an unbounded number of entities. In practice, you pick K large enough for your dataset and define an overflow policy (truncate, sample, or increase K). This is the same practical compromise used by set prediction models. (arXiv)


Q1) Encoders / “LLMs” you can use (ranked for your use case)

A. Best first choice: Set Transformer with PMA(K) (recommended)

What it is: A Transformer architecture designed for sets; it includes a pooling module (PMA) that uses K learnable seed vectors to produce exactly K outputs (your slots). (Proceedings of Machine Learning Research)

Why it’s a good fit

  • Permutation handling is built-in (set input, order-agnostic)
  • Produces fixed K outputs cleanly via PMA
  • Easy to combine with a DETR-style matching loss

Good implementation starting point

  • Official PyTorch implementation: (GitHub)

When it’s enough

  • If typical N is up to a few hundred, this is usually fine.

B. If N can be very large: Perceiver IO

What it is: A cross-attention architecture that scales linearly with input size and supports structured outputs via queries—i.e., you can decode K slots efficiently even when N is big. (arXiv)

When to choose it

  • If N is frequently in the hundreds to thousands, Perceiver IO often scales better than full self-attention over inputs.

C. Strong baseline: Deep Sets

What it is: The canonical “set function” form: per-element encoder + permutation-invariant pooling. (arXiv)

Why it’s useful

  • It’s the simplest sanity-check: if you can’t get good probe decoding with this, your numeric encoding/training pipeline likely has issues.

Limitation

  • Vanilla pooling (sum/mean) tends to entangle entities; it’s better for “global properties of the set” than “recover each element”.

D. Slot-binding alternative: Slot Attention

Not in the citations above, but conceptually: Slot Attention iteratively binds inputs to K slots. It can produce very clean per-slot “object-like” representations, but it often takes more tuning than Set Transformer.


E. Numeric embedding modules you can (and should) borrow

Regardless of the set encoder, you should embed numeric scalars into higher-dimensional vectors before mixing.

A widely used reference is “On Embeddings for Numerical Features in Tabular Deep Learning” (periodic and piecewise-linear numeric embeddings) and its official implementation repo. (arXiv)

This is directly relevant because your “angles must be recoverable” requirement is basically “my numeric scalars shouldn’t get washed out by the backbone.”


F. If you insist on an “LLM-like” encoder end-to-end

This is rarely the best path for your goal, but if you must:

  1. Tokenization-free / byte/char models
  • ByT5 processes raw bytes (no subword vocabulary). (arXiv)
  • CANINE processes characters without explicit tokenization/vocabulary. (arXiv)

These avoid subword fragmentation, but you still need task-specific training to make the representation numerically decodable.

  1. Numeric tokenization research (if you’re training from scratch)
  • xVal proposes continuous numerical tokenization for numerically dense datasets. (arXiv)

This is relevant if you’re building a foundation-model-like pipeline, not if you just want a practical probe-friendly encoder quickly.


Why not CLIP for this

  • Standard CLIP uses a fixed text context length (commonly 77 tokens), and this is a known limitation in practice (errors when exceeding it, and model constraints around context_length). (GitHub)
  • There is active work on extending CLIP beyond this limit (e.g., TULIP and other approaches), but that still doesn’t address the deeper issue: CLIP-style text embeddings are not trained to preserve precise numeric values. (ICLR Proceedings)

Q2) Should you convert JSON to natural language?

Recommendation: No, not for “MLP decodes floats”

Converting structured records into prose tends to:

  • introduce formatting variability (“-125.5 degrees” vs “-125.50°”)
  • reintroduce tokenizer issues for numbers
  • encourage the encoder to focus on semantics rather than numeric exactness

If your success metric is “recover precise angles,” keep angles numeric and only embed categories as text if you truly need open-vocabulary behavior.

What to do instead: structured → tokens (numeric-first)

A practical representation per entity:

  • Category: embedding lookup (closed vocab) or a small text embedding (open vocab)

  • Angles:

    • encode azimuth/elevation as sin/cos (wrap-safe)
    • optionally add harmonics for higher precision

Why harmonics can help: Fourier feature mappings are known to help MLPs learn higher-frequency structure (reducing “spectral bias”). (arXiv)

If you must serialize to text (only for compatibility)

Use a two-stream approach:

  • Text stream: stable placeholders
    {"category":"chair","az":<NUM0>,"el":<NUM1>}
  • Numeric stream: the real floats [-125.5, -37.2, ...]

Then fuse them (concat + MLP) before the set encoder. This prevents your numeric fidelity from depending on the text tokenizer.


Q3) How to prevent subword tokenizers from breaking floats

The most reliable solution: don’t tokenize floats

Parse JSON → keep numbers as floats → embed numerically.

This completely avoids the class of problems illustrated by common tokenizers splitting floating point strings into multiple pieces. (arXiv)

If you must go through a tokenizer anyway

Use one of these, in descending order of reliability:

  1. Tokenization-free encoders (byte/char): ByT5, CANINE (arXiv)
  2. Placeholder substitution + numeric side-channel (best practical workaround with standard LLM tokenizers)
  3. Quantize angles to fixed-point integers (e.g., tenths of a degree) so the model learns an easier discrete mapping
  4. Train a custom tokenizer/vocab where numbers (or digit chunks) are handled consistently (only if you control the full training stack)

A concrete configuration I would start with (defaults)

If you want the simplest “works-first” version

  • K = 32, d = 256

  • Per-entity token:

    • category embedding (e.g., d_cat=64)
    • angles: [sin(az), cos(az), sin(el), cos(el)]
    • MLP → d
  • Set encoder: Set Transformer + PMA(K) (Proceedings of Machine Learning Research)

  • Training: DETR-style Hungarian matching + no-object class (arXiv)

  • Freeze encoder; probe MLP predicts sin/cos, decode with atan2

If you need tighter numeric fidelity

Add harmonics to the input numeric features (e.g., sin(2^kθ), cos(2^kθ) for k=0..3) and/or use numeric embedding modules from the “numerical features embeddings” line of work. (arXiv)

If you expect large N

Swap Set Transformer → Perceiver IO. (arXiv)


Practical online resources to implement quickly

  • Set Transformer paper + official implementation (PMA(K) → fixed slots) (Proceedings of Machine Learning Research)
  • DETR paper + matcher implementation (Hungarian set matching; adapt cost to angles) (arXiv)
  • Numeric scalar embeddings (periodic + piecewise-linear modules; official repo) (NeurIPS Proceedings)
  • Tokenization-free encoders if you insist on string-first: ByT5, CANINE (arXiv)
  • Why floats in text break (concrete tokenization examples) (arXiv)

What I would do next (implementation order)

  1. Implement parser + angle canonicalization
  2. Implement per-entity numeric-first token builder (sin/cos)
  3. Implement Set Transformer PMA(K)
  4. Implement Hungarian matching loss (angle distance + optional category cost)
  5. Overfit a tiny dataset to verify near-zero error
  6. Train normally; freeze encoder; train probe
1 Like

Thank you for the detailed reply! I’ll try Deep Sets and Set Transformers first, and will separate the text and numbers for encoding :slight_smile:

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.