Urdu-NERD: Urdu named entity recognition with BiGRU-based deep learning architecture

Zainab Rafiq; Muhammad Wasim; Fatema Sabeen Shaikh; Nahier Aldhafferi; Abdullah Alqahtani

doi:10.7717/peerj-cs.3678

Urdu-NERD: Urdu named entity recognition with BiGRU-based deep learning architecture

Zainab Rafiq
, Muhammad Wasim^*
, Fatema Sabeen Shaikh
, Nahier Aldhafferi
, Abdullah Alqahtani

^*Corresponding author for this work

Computer Information Systems Department

University of Management and Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP), focusing on identifying and extracting entities such as names, locations, organizations, and other specific labels from unstructured text data. It plays a crucial role in various NLP applications, including information retrieval, question answering, and sentiment analysis. However, while NER systems have been extensively developed for English, adapting them to languages like Urdu poses unique challenges due to linguistic differences and the scarcity of annotated data. In this research, we enhance data diversity and accessibility for Urdu NER by introducing the ZUNERA corpus, the most extensive Urdu NER dataset to date, comprising 1,189,614 tokens and 89,804 named entities. Additionally, we classify the entities into twenty-three different named entities types. We meticulously annotate the corpus, providing clear guidelines and employing the Kappa coefficient to ensure high-quality annotations. Furthermore, we propose the Urdu-Named Entity Recognition with BiGRU-based Deep Learning Architecture (NERD) framework, which facilitates efficient entity recognition in Urdu text. The proposed framework achieves an impressive F1-score of 94.6%. Comparing ZUNERA with the MK-PUCIT dataset underscores its robustness in accurately recognizing entities. Although this study centers on Urdu, the proposed NER framework and annotation pipeline are designed to be language-agnostic. They can be extended to other morphologically rich or low-resource languages, providing a replicable foundation for future cross-lingual research. Overall, our contributions significantly advance Urdu NER research by providing a comprehensive dataset, evaluating state-of-the-art techniques, and introducing a novel framework for efficient Urdu entity recognition.

Original language	English
Article number	e3678
Journal	PeerJ Computer Science
Volume	12
DOIs	https://doi.org/10.7717/peerj-cs.3678
State	Published - 2026

Keywords

Asian languages
Low-resource languages
Name entity recognition
Urdu
Word embedding

Access to Document

10.7717/peerj-cs.3678

Cite this

@article{5230b9f2cf97420e893ba60f7c4d922f,

title = "Urdu-NERD: Urdu named entity recognition with BiGRU-based deep learning architecture",

abstract = "Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP), focusing on identifying and extracting entities such as names, locations, organizations, and other specific labels from unstructured text data. It plays a crucial role in various NLP applications, including information retrieval, question answering, and sentiment analysis. However, while NER systems have been extensively developed for English, adapting them to languages like Urdu poses unique challenges due to linguistic differences and the scarcity of annotated data. In this research, we enhance data diversity and accessibility for Urdu NER by introducing the ZUNERA corpus, the most extensive Urdu NER dataset to date, comprising 1,189,614 tokens and 89,804 named entities. Additionally, we classify the entities into twenty-three different named entities types. We meticulously annotate the corpus, providing clear guidelines and employing the Kappa coefficient to ensure high-quality annotations. Furthermore, we propose the Urdu-Named Entity Recognition with BiGRU-based Deep Learning Architecture (NERD) framework, which facilitates efficient entity recognition in Urdu text. The proposed framework achieves an impressive F1-score of 94.6\%. Comparing ZUNERA with the MK-PUCIT dataset underscores its robustness in accurately recognizing entities. Although this study centers on Urdu, the proposed NER framework and annotation pipeline are designed to be language-agnostic. They can be extended to other morphologically rich or low-resource languages, providing a replicable foundation for future cross-lingual research. Overall, our contributions significantly advance Urdu NER research by providing a comprehensive dataset, evaluating state-of-the-art techniques, and introducing a novel framework for efficient Urdu entity recognition.",

keywords = "Asian languages, Low-resource languages, Name entity recognition, Urdu, Word embedding",

author = "Zainab Rafiq and Muhammad Wasim and Shaikh, \{Fatema Sabeen\} and Nahier Aldhafferi and Abdullah Alqahtani",

year = "2026",

doi = "10.7717/peerj-cs.3678",

language = "English",

volume = "12",

journal = "PeerJ Computer Science",

issn = "2376-5992",

}

TY - JOUR

T1 - Urdu-NERD

T2 - Urdu named entity recognition with BiGRU-based deep learning architecture

AU - Rafiq, Zainab

AU - Wasim, Muhammad

AU - Shaikh, Fatema Sabeen

AU - Aldhafferi, Nahier

AU - Alqahtani, Abdullah

PY - 2026

Y1 - 2026

N2 - Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP), focusing on identifying and extracting entities such as names, locations, organizations, and other specific labels from unstructured text data. It plays a crucial role in various NLP applications, including information retrieval, question answering, and sentiment analysis. However, while NER systems have been extensively developed for English, adapting them to languages like Urdu poses unique challenges due to linguistic differences and the scarcity of annotated data. In this research, we enhance data diversity and accessibility for Urdu NER by introducing the ZUNERA corpus, the most extensive Urdu NER dataset to date, comprising 1,189,614 tokens and 89,804 named entities. Additionally, we classify the entities into twenty-three different named entities types. We meticulously annotate the corpus, providing clear guidelines and employing the Kappa coefficient to ensure high-quality annotations. Furthermore, we propose the Urdu-Named Entity Recognition with BiGRU-based Deep Learning Architecture (NERD) framework, which facilitates efficient entity recognition in Urdu text. The proposed framework achieves an impressive F1-score of 94.6%. Comparing ZUNERA with the MK-PUCIT dataset underscores its robustness in accurately recognizing entities. Although this study centers on Urdu, the proposed NER framework and annotation pipeline are designed to be language-agnostic. They can be extended to other morphologically rich or low-resource languages, providing a replicable foundation for future cross-lingual research. Overall, our contributions significantly advance Urdu NER research by providing a comprehensive dataset, evaluating state-of-the-art techniques, and introducing a novel framework for efficient Urdu entity recognition.

AB - Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP), focusing on identifying and extracting entities such as names, locations, organizations, and other specific labels from unstructured text data. It plays a crucial role in various NLP applications, including information retrieval, question answering, and sentiment analysis. However, while NER systems have been extensively developed for English, adapting them to languages like Urdu poses unique challenges due to linguistic differences and the scarcity of annotated data. In this research, we enhance data diversity and accessibility for Urdu NER by introducing the ZUNERA corpus, the most extensive Urdu NER dataset to date, comprising 1,189,614 tokens and 89,804 named entities. Additionally, we classify the entities into twenty-three different named entities types. We meticulously annotate the corpus, providing clear guidelines and employing the Kappa coefficient to ensure high-quality annotations. Furthermore, we propose the Urdu-Named Entity Recognition with BiGRU-based Deep Learning Architecture (NERD) framework, which facilitates efficient entity recognition in Urdu text. The proposed framework achieves an impressive F1-score of 94.6%. Comparing ZUNERA with the MK-PUCIT dataset underscores its robustness in accurately recognizing entities. Although this study centers on Urdu, the proposed NER framework and annotation pipeline are designed to be language-agnostic. They can be extended to other morphologically rich or low-resource languages, providing a replicable foundation for future cross-lingual research. Overall, our contributions significantly advance Urdu NER research by providing a comprehensive dataset, evaluating state-of-the-art techniques, and introducing a novel framework for efficient Urdu entity recognition.

KW - Asian languages

KW - Low-resource languages

KW - Name entity recognition

KW - Urdu

KW - Word embedding

UR - https://www.scopus.com/pages/publications/105034981551

U2 - 10.7717/peerj-cs.3678

DO - 10.7717/peerj-cs.3678

M3 - Article

AN - SCOPUS:105034981551

SN - 2376-5992

VL - 12

JO - PeerJ Computer Science

JF - PeerJ Computer Science

M1 - e3678

ER -

Urdu-NERD: Urdu named entity recognition with BiGRU-based deep learning architecture

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this