NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages
Paper
•
2412.15252
•
Published
RoBERTa is a pre-trained language model for Central Kurdish (Sorani) that provides high-quality contextual word embeddings. This model serves as a feature extractor .
Table 1: AsoSoft Kurdish Text Corpus
| Source | Number of Tokens |
|---|---|
| Crawled From Websites | 95M |
| Text Books | 45M |
| Magazines | 48M |
| Sum | 188M |
Table 2: Muhammad Azizi and AramRafeq Text Corpus
| Source | Number of Tokens |
|---|---|
| Wikipedia | 13.5M |
| Wishe Website | 11M |
| Speemedia Website | 6.5M |
| Kurdiu Website | 19M |
| Dengiamerika Website | 2M |
| Chawg Website | 8M |
| Sum | 60M |
Table 3: The Kurdish Text Corpus Used to Train BERT
| Corpus Name | Number of Tokens |
|---|---|
| Oscar 2019 corpus | 48.5M |
| AsoSoft corpus | 188M |
| Muhammad Azizi and AramRafeq corpus | 60M |
| Sum | 296.5M |
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("abdulhade/RoBERTa-large-SizeCorpus_1B")
model = AutoModel.from_pretrained("abdulhade/RoBERTa-large-SizeCorpus_1B")
text = "لیژنەی فتوا دەلێن سینگڵ زەکات وسەرفیترەی پێ دەشێت."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Contextual embeddings
embeddings = outputs.last_hidden_state
If you are using our text corpus cite us.
@article{abdullah2024ner,
title={NER-RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within Low-Resource Languages},
author={Abdullah, Aso A and Abdulla, Sana H and Toufiq, Darya M and others},
journal={arXiv preprint arXiv:2412.15252},
year={2024}
}
Base model
FacebookAI/xlm-roberta-large