Model Card for Model ID

Model Description

This is a fine-tuned version of aya-expanse-8b for Token-level Language Identification (LID) on Hinglish (Hindi-English code-mixed) text. It performs token-wise classification into three categories: en (English), hi (Hindi), or ot (Other).

The model supports mixed Roman and Devanagari scripts and is designed for identifying language switches at the word level in natural code-mixed content.

Achieves 94.75 F1 on the COMI-LINGUA LID test set (5K instances), outperforming zero-shot inference from strong closed-source LLMs (e.g., gpt-4o: 92.7 F1) and traditional tools (Microsoft LID: 74.4 F1).

  • Model type: LoRA-adapted Transformer LLM (8B params, ~32M trainable)
  • Finetuned from model: CohereForAI/aya-expanse-8b

Model Sources

Uses

  • Token-level LID in Hinglish pipelines (e.g., preprocessing for downstream tasks in social media analysis and chatbots).

  • Helps detect language switches in code-mixed user-generated content.

  • Example inference prompt:

Identify the language of each token (en = English, hi = Hindi, ot = Other) in: "New Delhi/Alive News : आज के दिन में इंडिया की टीम ने ऑस्ट्रेलिया को हराया। #INDvAUS"
Output: [{'New': 'en'}, {'Delhi': 'en'}, {'/', 'ot'}, {'Alive': 'en'}, {'News': 'en'}, {':': 'ot'}, {'आज': 'hi'}, {'के': 'hi'}, {'दिन': 'hi'}, {'में': 'hi'}, {'इंडिया': 'en'}, {'की': 'hi'}, {'टीम': 'en'}, {'ने': 'hi'}, {'ऑस्ट्रेलिया': 'en'}, {'को': 'hi'}, {'हराया': 'hi'}, {'।': 'ot'}, {'#INDvAUS': 'ot'}]

Training Details

Training Data

COMI-LINGUA Dataset Card

Training Procedure

Preprocessing

Tokenized with base tokenizer; instruction templates + few-shot examples. Filtered: ≥5 tokens, no hate/non-Hinglish, mostly code-mixed content.

Training Hyperparameters

  • Regime: PEFT LoRA (rank=32, alpha=64, dropout=0.1)
  • Epochs: 3
  • Batch: 4 (accum=8, effective=32)
  • LR: 2e-4 (cosine + warmup=0.1)
  • Weight decay: 0.01

Evaluation

Testing Data

COMI-LINGUA LID test set (5K instances).

Metrics

Macro Precision / Recall / F1 (token-level).

Results

Setting P R F1
Zero-shot 51.08 70.55 59.05
One-shot 73.03 71.07 70.48
Fine-tuned 87.45 86.92 94.90

Summary: Competitive with or better than leading closed-source LLMs in zero/one-shot settings; establishes strong baseline/SOTA among open-weight fine-tuned models for Hinglish token-level LID.

Bias, Risks, and Limitations

This model is a research preview and is subject to ongoing iterative updates. As such, it provides only limited safety measures.

Model Card Contact

Lingo Research Group at IIT Gandhinagar, India
Mail at: lingo@iitgn.ac.in

Citation

If you use this model, please cite the following work:

@inproceedings{sheth-etal-2025-comi,
    title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
    author = "Sheth, Rajvee  and
      Beniwal, Himanshu  and
      Singh, Mayank",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.422/",
    pages = "7973--7992",
    ISBN = "979-8-89176-335-7",
}
Downloads last month
82
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LingoIITGN/COMI-LINGUA-LID

Adapter
(23)
this model

Dataset used to train LingoIITGN/COMI-LINGUA-LID