Vespa - Custom tokenization in DocumentProcessor - best practices for sending processed tokens to content nodes

In Vespa, I need custom token analysis chains for different fields in a schema which are in the same language. Should I perform custom Lucene analysis in DocumentProcessor, and how should processed tokens be sent to content nodes?

I’m migrating from Solr to Vespa and need to implement custom field-specific text analysis. I’m considering using Lucene libraries directly in DocumentProcessor (for indexing) and Searcher (for queries) to replicate our Solr analyzer chains (which include WordDelimiterGraph, pattern replacements, stemming, shingles, synonyms, etc.).

My current approach is to perform the analysis in DocumentProcessor and then send the processed data to content nodes. However, I’m unclear about several aspects:

  1. Format question: After performing custom tokenization/analysis in DocumentProcessor, what format should I use to send the data to content nodes?

    • Space-separated string of all tokens?

    • Array of tokens?

    • Token stream?

    • Any other form?

  2. Synonym handling: If “fabric” and “cloth” are synonyms, and the field value is “I want to buy cloth”, should I send:

    • A string: “i want to buy cloth fabric” (including the synonym)?

    • Some other structure that preserves synonym relationships?

  3. Re-processing concern: If I send processed tokens as a space-separated string to the content node, will it be tokenized, stemmed, etc. again by Vespa’s linguistic processing? How do I prevent double-processing?

  4. Architecture question: Is performing such operations in DocumentProcessor a recommended pattern in Vespa, or is there a better approach to achieve field-specific custom analysis?

Any guidance on the proper way to handle custom text analysis and token passing in Vespa would be greatly appreciated.

1 Like

for now:


You generally should not pre-tokenize in a DocumentProcessor and “send tokens to content nodes”. In Vespa, content nodes index text fields by running linguistic processing (tokenization, normalization, stemming) to produce an internal token stream with positions, and that token stream is not something you normally serialize as “tokens” in the document you feed.

If you want Solr-like, field-specific analysis chains, the “Vespa-native” pattern is:

  • Configure linguistics (often LuceneLinguistics) to implement your analyzer chains.
  • Select the right chain per field via schema settings (and sometimes via multiple derived fields).
  • Do synonyms mostly at query time (query rewriting), not by stuffing extra tokens into the indexed text.

Below I answer your four questions directly, then give concrete architectures that map well from Solr.


Mental model: where Vespa keeps “tokens”

What Vespa stores

When you index a string field as an index field (indexing: index), Vespa uses linguistic processing to tokenize and transform the text (default match: text does tokenization and transformations) . The resulting token stream is stored internally in the inverted index (positions, term stats, etc).

A key point from a commonly-cited Vespa team answer: the tokenizer output is stored as annotations; you can add annotations yourself, but most people use the existing tokenizer or implement a custom one (Stack Overflow).

What Vespa does not do

Vespa does not provide a supported “feed-time token stream” wire format (like Lucene TokenStream) that you pass to content nodes. You feed document field values, not low-level index terms.


1) “What format should I send to content nodes after tokenization?”

Recommended answer

Send the original (or lightly normalized) text as a string field and let Vespa’s linguistics produce the token stream on the content nodes.

That means: do not send space-joined tokens, arrays of tokens, or any “token stream” representation as the primary way to populate a full-text index.

Why

If you send a “token string”, you are throwing away things Vespa’s index cares about:

  • true character offsets
  • token positions and gaps
  • phrase/proximity behavior
  • correct term statistics for ranking (BM25 length normalization, etc)

If you insist on sending tokens anyway

Only do this when you explicitly accept different search semantics:

  • Create a separate field such as array<string> or weightedset<string> that stores tokens.
  • Query it as a “bag of tokens” (often attribute-like semantics, limited phrase/proximity).
  • This is closer to a feature field than a full text field.

Also note: attribute matching does not use linguistic processing, while indexed string fields do . So moving to “token arrays” often changes how matching behaves and what ranking features you can rely on.


2) Synonyms: should I append synonyms into the indexed string?

Example: synonyms “fabric” ↔ “cloth”, document text is “I want to buy cloth”.

Recommended answer

Do not index as "i want to buy cloth fabric" in the same text field.

Instead, do synonyms as query rewriting, so you preserve the idea “these are alternatives” rather than “this document literally contains both words”.

Vespa’s query rewriting docs show two important operators:

  • EQUIV(a, b) for truly equivalent alternatives.
  • OR(a, b) when they are not strictly equivalent and you may want weighting or different behavior .

If you want “cloth” queries to also match “fabric”, typical rewrites are:

  • ... contains equiv("cloth", "fabric") (strong equivalence)
  • or ... contains ({"weight":w1}"cloth" OR {"weight":w2}"fabric") (soft equivalence, tunable) .

Why appending synonyms is usually harmful

Appending synonyms into the indexed text:

  • distorts BM25 length normalization and term frequencies
  • breaks proximity semantics (“cloth fabric” wasn’t actually present)
  • makes debugging relevance harder
  • forces you to refeed/reindex when synonyms change

Query-time synonyms avoid most of that.


3) If I send a space-separated token string, will Vespa tokenize and stem again?

Yes, unless you change the field to avoid text tokenization

For a normal full-text index field (match: text default), Vespa tokenizes and applies transformations . So if you feed "cloth fabric" as a “token string”, Vespa will tokenize that string again, and may also normalize/stem depending on your field settings and linguistics configuration.

How to avoid “double processing”

The real fix is architectural:

  1. Do not pre-tokenize in DocumentProcessor for full-text indexing.
  2. Put your custom logic into linguistics (LuceneLinguistics or custom tokenizer) so indexing and querying share the same behavior.

There is also a query-side setting that can prevent extra processing: model.type.tokenization=linguistics combined with model.type.syntax=none makes Vespa “pass through” tokens from linguistics and apply no further stemming/normalization even if configured in the schema . This is useful when your linguistics chain already does everything and you want to avoid layering additional transformations.


4) Is DocumentProcessor the right place for this?

Usually no

DocumentProcessor is best for:

  • enrichment (lookup, classification, embeddings)
  • field rewriting at the document-value level
  • generating derived fields

It is not the best place to reimplement the full Solr/Lucene analysis pipeline and then attempt to “ship tokens”.

The Vespa docs explicitly describe a better extension point: implement custom linguistics (a custom Tokenizer or full linguistics implementation) (Vespa Documentation), and ensure the same configuration is used for indexing and query processing (container cluster details matter) .


What to do for your Solr-to-Vespa migration

You said: same language, but different analyzer chains per field (WordDelimiterGraph, pattern replaces, stemming, shingles, synonyms).

The clean mapping in Vespa is usually a combination of:

  1. LuceneLinguistics for Lucene-like token filters and stemming.
  2. Query rewriting for synonyms.
  3. Derived fields (multiple indexes) for shingles/ngrams or alternative match behaviors.

A) Use LuceneLinguistics as your “analysis engine”

Vespa’s LuceneLinguistics is configured by a key consisting of:

  • a language code
  • optionally a stemming mode

…and it does not do language detection, so you typically set language explicitly .

Practical trick: use stemming mode as an “analyzer selector”

Even if your fields are the same human language (English), you can treat the (language, stemMode) key as “variant slots” and configure different Lucene analyzer chains per slot.

Then, per field in schema, you set:

  • stemming: none|best|shortest|all|default (these map to the stem modes used by linguistics)

This gives you up to ~5 variants per language “for free”.

Example structure (illustrative)

<!-- services.xml -->
<container id="default" version="1.0">
  <search>
    <chain id="default" inherits="vespa">
      <!-- your Searchers for query rewriting, etc -->
    </chain>
  </search>

  <component id="com.yahoo.language.lucene.LuceneLinguistics">
    <config name="language-lucene">
      <!-- multiple <analysis> blocks keyed by language + stemMode -->
    </config>
  </component>
</container>

Configuring the analyzer by key is the core design .

B) Lock language explicitly (avoid detection surprises)

For indexing, you can hardcode language per field using the indexing language set_language expression .

This matters because query-side language detection can be unreliable for short queries. A Vespa issue documents that query language detection “returns English for 3 terms or less” and recommends setting language explicitly via query parameters (model.language or model.locale) (GitHub).

C) Do synonyms at query time (semantic rules or Searcher)

Vespa’s query rewriting guide describes rewriting as either:

  • semantic rules (rule-based, declarative)
  • Searchers for more complex rewriting

Use equiv or OR rewrites for synonyms .

This keeps synonym maintenance out of indexing. You can change synonym rules without refeeding.

D) Handle shingles using derived fields or n-grams

Instead of injecting shingles into the main field text, create a derived field and index it appropriately. Vespa’s indexing language supports adding n-gram annotations via ngram <size> (Vespa Documentation), and the text matching guide discusses n-gram matching as a pattern .

If you truly need Lucene ShingleFilter behavior, you can also put it in the Lucene analyzer chain for a variant key, but in many Vespa apps a dedicated ngram field is simpler to reason about and tune.

E) Pattern replacements

If your Solr chain uses char filters or regex replacements, you have two reasonable homes:

  • In the Lucene analyzer (char filters + token filters), if you already commit to LuceneLinguistics.
  • In indexing language / docproc only for document-value transformations (not as a surrogate token stream).

Similar cases and useful online references

“How do I pass tokens / custom tokenization?”

  • Stack Overflow: Vespa team answer explains that tokenizer output is stored as annotations and suggests implementing a custom tokenizer rather than feeding tokens (Stack Overflow).
  • Your exact question on Stack Overflow appears to be asked (unanswered at the time captured) (Stack Overflow).

Query rewriting and synonym structure

  • Vespa query rewriting guide: explains EQUIV vs OR and when to use Searchers vs semantic rules .

Language detection pitfalls

  • GitHub issue #24265: documents the short-query language detection problem and recommends setting language explicitly (GitHub).

“I want flexible full text behavior like weakAnd, tokenization”

  • GitHub issue #12154: discussion about needing tokenization + weakAnd-style behavior without client-side tokenization . (Not the same as Solr analyzers, but very relevant to “where do tokens come from” in Vespa.)

Core docs you will keep returning to

  • Linguistics: language setting, custom linguistics/tokenizer, indexing container cluster configuration .
  • LuceneLinguistics: analyzer keying by language + stem mode, no language detection .
  • Text matching guide: match modes, indexing vs attribute behavior, n-gram patterns .
  • Indexing language reference: tokenize, ngram, set_language and related primitives (Vespa Documentation).
  • Query API: model.type.tokenization=linguistics and model.type.syntax=none to avoid extra transformations .

My recommendation for your exact situation (Solr analyzer chains per field)

Step 1: Stop trying to “send tokens”

Feed normal text. Put “analysis” into linguistics.

Step 2: Implement analyzers in LuceneLinguistics

  • Recreate your WordDelimiterGraph, pattern replacements, stemming, shingling filters as Lucene analyzer chains.
  • Use (language, stemMode) keys as “variant selectors” and map fields to variants via stemming: in schema .

If you have more than ~5 variants, consider:

  • consolidating chains (many Solr installations have accidental differences)
  • using derived fields (separate indexes) rather than endless analyzer variants
  • using language tags plus set_language as additional key space only if you confirm Vespa accepts the tags you choose .

Step 3: Do synonyms at query time

  • Implement synonyms via semantic rules or a Searcher.
  • Use EQUIV for strict equivalence, OR when you want weights or softer behavior .

Step 4: Make query processing symmetric

  • Set the query language explicitly (or at least be aware of short-query detection pitfalls) (GitHub).
  • If your Lucene chain already normalizes/stems fully, consider model.type.tokenization=linguistics + model.type.syntax=none to avoid double work .

Step 5: Use DocumentProcessor only for enrichment and derived fields

Use it to produce extra fields (exact-match field, ngram field, normalized field), not to “inject tokens” for the main full-text index.


Summary

  • Do not feed “tokens” to content nodes for full-text indexes. Feed text. Let linguistics produce internal tokens. (Stack Overflow)
  • Do synonyms via query rewriting using EQUIV or OR, not by appending synonym terms into indexed text.
  • If you pre-tokenize into a space-separated string, Vespa will tokenize again for normal text indexes. Avoid by moving logic into linguistics.
  • Best practice for Solr-like field analyzers is LuceneLinguistics with per-field selection via stem modes and derived fields for extra behaviors. (Vespa Documentation)
  • Set language explicitly to avoid detection pitfalls on short queries. (GitHub)