Custom KV Cache Steering Implementation Fails with IndexError in LLaVA Generation

1. Context & Goal

I am implementing “Visual KV Cache Steering” for LLaVA (based on llava-hf/llava-1.5-7b-hf). The goal is to:

  1. Run a prefill step to populate the KV cache for the prompt (USER: <image>\n{text}...).

  2. Intervene in the cache by adding steering vectors specifically to the visual token positions (the 576 tokens corresponding to the image).

  3. Generate the rest of the response using this modified cache.

2. Implementation Strategy

My current logic follows this pattern:

  1. Prefill: Call model(**inputs, use_cache=True) to get past_key_values.

  2. Modify: Convert past_key_values (which is a DynamicCache in newer transformers) to a legacy list/tuple format (or modify it in place) to inject vectors into the visual token indices.

  3. Generate: Call model.generate() passing the modified past_key_values and the last token as input_ids.

3. The Issue

When passing the modified cache back to model.generate, I encounter an IndexError. It appears that prepare_inputs_for_generation inside LlavaForConditionalGeneration is misinterpreting the cache length or structure, likely due to conflicts between the new DynamicCache format and the legacy tuple format I am trying to use.

Traceback (most recent call last):
…
File “transformers/generation/utils.py”, line 2781, in _sample
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
File “transformers/models/llava/modeling_llava.py”, line 466, in prepare_inputs_for_generation
model_inputs = super().prepare_inputs_for_generation(
File “transformers/generation/utils.py”, line 574, in prepare_inputs_for_generation
inputs_embeds, input_ids = self._cache_dependant_input_preparation(
File “transformers/generation/utils.py”, line 476, in _cache_dependant_input_preparation
or (cache_position[-1] >= input_ids.shape[1]) # Exception 3
IndexError: index -1 is out of bounds for dimension 0 with size 0

4. Reproduction Code

Here is the simplified reproduction script. The failure occurs at the final model.generate call.

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration, DynamicCache
from PIL import Image

Setup

model_id = “llava-hf/llava-1.5-7b-hf”
model = LlavaForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.float16, device_map=“cuda”
)
processor = AutoProcessor.from_pretrained(model_id)

Dummy Inputs

image = Image.new(‘RGB’, (336, 336), color=‘red’)
prompt_text = “Describe this image.”
prompt = f"USER: \n{prompt_text}\nASSISTANT:"
inputs = processor(text=prompt, images=image, return_tensors=‘pt’).to(“cuda”, torch.float16)

1. Prefill

with torch.inference_mode():
out = model(**inputs, use_cache=True, return_dict=True)
cache = DynamicCache.from_legacy_cache(out.past_key_values)

# 2. Modify Cache (Simulated Steering)
# Converting to legacy format to iterate and modify
legacy = list(cache.to_legacy_cache())
for i, (k, v) in enumerate(legacy):
    # Example modification: simply cloning for reproduction
    # In real code, I add vectors to specific indices here
    legacy[i] = (k.clone(), v.clone())
    
# Re-wrap as DynamicCache (Attempt 1) or Tuple (Attempt 2)
# Both approaches lead to issues in LLaVA's generate step
steered_cache = DynamicCache.from_legacy_cache(tuple(legacy))

# 3. Generate
# We provide the last token of the prompt as the seed
seed_ids = inputs["input_ids"][:, -1:]

# Calculate cache position
past_len = steered_cache.get_seq_length()
cache_pos = torch.arange(past_len, past_len + seed_ids.shape[1], device=seed_ids.device)

# Construct Attention Mask
attn_mask = torch.cat(
    [inputs["attention_mask"], inputs["attention_mask"].new_ones((1, 1))],
    dim=-1
)

# FAILURE HAPPENS HERE
generated_ids = model.generate(
    input_ids=seed_ids,
    past_key_values=steered_cache, # Passing the modified cache
    cache_position=cache_pos,
    attention_mask=attn_mask,
    max_new_tokens=100
)

5. Questions

  1. How should we correctly modify the KV cache in-place between the prefill and generate steps for LLaVA models in the latest Transformers version?

  2. Does LlavaForConditionalGeneration require pixel_values to be passed to generate even if the image tokens are already present in the past_key_values?

  3. How do we resolve the IndexError regarding cache_position? It seems the model thinks the cache is empty or misaligned with the input_ids.

1 Like

I could reproduce the error.


What’s going on (two separate failure modes)

A) LLaVA “image tokens vs image features” mismatch (ValueError)

In transformers’ LLaVA implementation, the image is not “implicitly in the cache.” Instead:

  1. Your prompt must contain image_token_id placeholders.
  2. During the forward pass, LLaVA computes image features from pixel_values.
  3. It then replaces the placeholder token embeddings with those image features.
  4. It hard-checks that placeholder-token count matches feature count and throws if not.

You can see the check in get_placeholder_mask() (this is exactly the error you hit): it compares the number of <image> placeholder slots against the flattened image feature length and raises ValueError if they differ. (GitHub)

Also, LLaVA commonly drops the CLIP CLS token when vision_feature_select_strategy == "default", so the expected number of visual “patch tokens” is typically (H/patch)*(W/patch) (plus any “additional image tokens”). (GitHub)

Implication for cache steering: the KV entries you want to steer correspond to those placeholder positions after the merge, so you must correctly build the prompt/processor so that the placeholder positions exist and match what the vision tower produces.


B) cache_position becomes empty → IndexError: cache_position[-1] … size 0

This is a generation-side failure: when generate() is called with past_key_values, some builds/paths can end up with an empty cache_position tensor, which later crashes at cache_position[-1].

There are two relevant references:

  • HF docs: cache_position must always be valid and advances by 1 per token; e.g. if cache has 10 tokens, next token must use torch.tensor([10]). (Hugging Face)
  • A reported bug: “cache position incorrectly inferred for generation” when past_key_values is provided, leading to exactly your IndexError. (GitHub)
    (There are also forum reports of the same IndexError pattern. (Hugging Face Forums))

Implication: even if you pass a non-empty cache_position, you can still hit this if generate() internally overwrites/infers it incorrectly in some paths. The most robust workaround is: don’t rely on generate() for the “resume from modified cache” step—use a small manual decode loop (greedy or sampling) where you control cache_position.


Answers to your 3 questions

1) Correct way to modify the KV cache between prefill and generation (Transformers v5 / Cache objects)

Key points

  1. In Transformers v5, LLaVA returns past_key_values as a Cache instance (not the old tuple-of-tuples). (GitHub)
  2. The cache stores tensors shaped [batch, num_heads, seq_len, head_dim]. (Hugging Face)
  3. The cache is structured as layers; docs show cache.layers[idx].keys / cache.layers[idx].values. (Hugging Face)

Recommended cache-edit pattern (in-place)

Prefill → find visual positions → edit those positions in cache.layers[l].keys/values → resume decoding.

Important: avoid converting to legacy format unless you must. Conversions are a common place to lose metadata / end up with wrong “seen token” bookkeeping, which increases the chance generate() later infers cache_position incorrectly.

Practical steering index logic:

  • Get the token id used for images: image_token_id = model.config.image_token_id (LLaVA uses it in the forward merge). (GitHub)
  • Find visual token positions from the prefill input_ids (the sequence whose KV you actually cached): visual_pos = (prefill_input_ids[0] == image_token_id).nonzero().squeeze(-1)

Then edit (per layer):

  • cache.layers[layer].keys[:, :, visual_pos, :] += delta_k
  • cache.layers[layer].values[:, :, visual_pos, :] += delta_v

Two critical alignment rules

Rule A (prompt split): If you will “resume” with seed_ids = last_token, then the cache must contain everything before that last token.
So prefill on input_ids[:, :-1], not the full prompt.

Rule B (visual positions): Compute positions against the sequence you actually cached (input_ids[:, :-1]), and if needed clamp to < prefill_len.


2) Do you need to pass pixel_values to generate() if image tokens are already in the cache?

If your prefill already ran with pixel_values and you are resuming from that cache:

  • Do not pass pixel_values again for the resume step.
  • Also do not include fresh <image> placeholders in the resume input_ids (your resume token should be normal text).

Why: LlavaForConditionalGeneration.prepare_inputs_for_generation() explicitly forwards pixel_values only on the first iteration, and notes that “first iteration” can mean “continue generate from cache.” (GitHub)
If you pass pixel_values on a resume call where your new input_ids contain 0 image tokens, you can trigger the mismatch check (“tokens: 0, features: …”). (GitHub)

So: prefill step uses pixel_values; resume step should not.


3) How to resolve the cache_position IndexError (model thinks cache is empty / misaligned)

What cache_position must be

Docs specify:

  • If cache already has N tokens and you are now processing K new tokens, cache_position must be [N, ..., N+K-1]. (Hugging Face)
  • Attention mask length must match past_kv_length + new_tokens_length. (Hugging Face)

The “most correct” resume formula for your setup

If you prefill on input_ids[:, :-1]:

  • prefill_len = input_ids.shape[1] - 1
  • seed_ids = input_ids[:, -1:]
  • cache_position = torch.tensor([prefill_len], device=...)
  • attention_mask should have length prefill_len + 1 (typically just the original attention_mask, unchanged, if there’s no padding)

Why you still may crash with generate()

There is a known report that generate() can infer/overwrite cache_position incorrectly when past_key_values is passed, leading to empty cache_position and then the same IndexError you saw. (GitHub)
This is consistent with your trace.

Robust fix: avoid generate() for the resume step

Implement a small decode loop using model() directly (the HF caching guide even shows a manual loop and emphasizes correct attention_mask + cache_position handling). (Hugging Face)

This also makes KV steering easier (you control exactly when cache is read/written).


Minimal “known-good” pipeline for LLaVA visual KV steering

  1. Build inputs correctly

    • Prompt must include <image> placeholders.
    • Ensure processor is configured so it expands placeholders consistently with the vision tower settings (patch size / strategy). The LLaVA docs explicitly expose these knobs on the processor. (Hugging Face)
    • Avoid truncation that could drop image placeholders.
  2. Prefill (cache prompt minus last token)

    • Run model(input_ids[:, :-1], pixel_values=..., attention_mask[:, :-1], use_cache=True, return_dict=True)
  3. Locate visual token positions

    • visual_pos = where(prefill_input_ids == image_token_id)
    • Verify len(visual_pos) equals what the model expects for your resolution/patching.
  4. Edit cache in place

    • Add steering deltas at seq_len indices = visual_pos for each layer.
  5. Resume decoding with a manual loop

    • Start from seed_ids = input_ids[:, -1:]
    • Use cache_position = tensor([prefill_len]), increment by 1 each step.
    • Do not pass pixel_values.

Similar cases / references worth reading

  • Transformers caching guide: explains cache_position, cache tensor shapes, layer storage, and a manual decode loop. (Hugging Face)
  • LLaVA forward + generation code: shows the placeholder-token vs image-feature check and the “pixel_values only on first iteration” rule. (GitHub)
  • Bug report on empty cache_position during generate(past_key_values=...) leading to IndexError. (GitHub)
  • HF forum thread with the same IndexError signature. (Hugging Face Forums)