The Resource Archive
A curated library of datasets, models, and research for Nepali language technology. Search, filter, and explore.
Add new resources to the archive or suggest corrections to existing entries
Nepali Text Corpus (IRIISNEPAL)
A comprehensive collection of approximately 6.4 million articles (27.5 GB) from news, blogs, and other online platforms. It is described as the largest text dataset on the Nepali Language.
By: Unknown
np20ng
A multi-class Nepali text classification dataset consisting of over 200,000 news documents categorized into 20 different Nepali news groups.
By: Unknown
Nepali Corpus (C4 Multilingual)
A massive, cleaned subset of the Common Crawl containing approximately 3.2 billion tokens (13 GB) of Nepali web text.
By: dirkgr, @adarob
16NepaliNews Corpus
A collection of 14,364 news documents partitioned across 16 different categories, inspired by the 20 Newsgroups dataset.
By: sndsabin
39K Nepali Wikipedia Articles
A cleaned dataset of 39,000 articles from Nepali Wikipedia, providing a source of formal, encyclopedic text, with a train and test set.
By: Gaurav
A LARGE SCALE NEPALI TEXT CORPUS
A large-scale text corpus for the Nepali language, available via IEEE Dataport.
By: Community
CC100 Nepali
A monolingual dataset from the Common Crawl, part of a larger collection covering 100 languages.
By: Community
350K Nepali Sentences
A collection of 350,000 Nepali sentences from various sources.
By: Unknown
Nepali Abstractive Summarization Corpus
A corpus of 286,000 article-title pairs from news sources, suitable for training abstractive summarization models.
By: Community
Nepali NER (EBIQUITY)
A dataset for Named Entity Recognition, released in two versions with IO and BIO tagging schemes. Version 2 is recommended.
By: Unknown
Large Nepali ASR training data set
A large dataset for Automatic Speech Recognition containing approximately 157,000 transcribed utterances collected by Google.
By: Unknown
High quality TTS data for Nepali
A multi-speaker dataset for Text-to-Speech synthesis containing around 2,000 high-quality transcribed sentences.
By: Unknown
FLoRes Evaluation Datasets for Low-Resource Machine Translation
Standardized evaluation datasets for low-resource Nepali-English machine translation, based on Wikipedia.
By: Unknown
nepal-brihat-sabdakosh-json
A structured JSON dump of all 122,000 words from the Nepali Brihat Sabdakosh (a comprehensive dictionary).
By: bikashpadhikari
LINCE: Nepali-English Code Switching
A dataset containing Nepali-English code-switched language, valuable for studying language mixing phenomena.
By: Community
DHCD dataset
A dataset of Devnagari (Nepali) handwritten characters for handwritten character recognition tasks.
By: Prasanna1991
Nepali Characters Dataset (NCD)
A dataset containing images of Nepali characters.
By: InspiringLab
Nepali Fonts OCR Dataset
A dataset for Optical Character Recognition (OCR) of various Nepali fonts.
By: Unknown
Nepali Handwritten Digits
A dataset containing images of Nepali handwritten digits.
By: kcnishan
Nepali Stopwords
A list of common stop words in the Nepali language.
By: sanjaalcorps
IRIIS-RESEARCH/RoBERTa_Nepali_125M
A 110-million-parameter RoBERTa-based model trained on a 27.5 GB Nepali corpus. Designed for NLU tasks like classification and NER.
By: Unknown
NepBERTa
A BERT-based NLU model trained on an extensive monolingual corpus of 0.8B words. Released with the Nep-gLUE benchmark for evaluation.
By: Unknown
NepaliGPT: A Generative Language Model for the Nepali Language
A generative large language model (GPT) for Nepali, trained on a large custom corpus called the Devanagari Corpus.
By: Unknown
patrakar (Nepali News Classifier)
A DistilBERT model fine-tuned for classifying Nepali news into 9 categories.
By: sahajrajmalla
Nepali-DistilBERT
A DistilBERT language model trained on the OSCAR Nepali corpus and fine-tuned for sentiment analysis.
By: dexhrestha
Transformer-Based Nepali Language Model
A text generation model for Nepali, trained on the Oscar corpus, with objectives including spelling correction and feature extraction.
By: Unknown
fastText Embeddings
300-dimensional word vectors for 157 languages, including Nepali, trained on Common Crawl and Wikipedia using the CBOW method.
By: Unknown
NPVec1
A suite of 25 state-of-the-art word embeddings for Nepali, derived from a large corpus using GloVe, Word2Vec, fastText, and BERT.
By: Unknown
300-D Word Embeddings (Word2Vec) for Nepali Language
A pre-trained Word2Vec model with 300-dimensional vectors for over 0.5 million Nepali words, trained on a 90M-word news corpus.
By: rabindralamsal
ELMo Embeddings
Contextualized word embeddings for many South Asian languages, including Nepali.
By: Unknown
Byte Pair Embeddings (BPEmb)
Subword embeddings for 275 languages, including Nepali, trained on Wikipedia.
By: Community
wav2vec2-nepali
A fine-tuned wav2vec2 model for Nepali Automatic Speech Recognition.
By: Unknown
Nepali NLP Toolkit
A comprehensive Python library for various NLP tasks including embeddings, tokenization, stemming, summarization, OCR, and translation.
By: sushil79g
Indic NLP Library
A library providing common NLP utilities for various Indic languages, including Nepali.
By: anoopkunchukuttan
Nepali Lemmatizer
A tool specifically for lemmatization of Nepali words.
By: dpakpdl
nepali-spell
A spell corrector for Nepali that uses Edit Distance to predict correct words.
By: nepali-bhasa
NepaliLipi
An application for text prediction and transliteration from Roman script to Devanagari.
By: AchillesKarki
Nepdict
An English-Nepali dictionary application built in Python for the terminal.
By: Unknown
Improving Nepali Document Classification by Neural Network
This paper compares different text classification methods for Nepali and demonstrates that using word2vec with a neural network improves performance.
By: Unknown
A Deep Learning Approach for Part-of-Speech Tagging in Nepali Language
This paper proposes a deep learning-based Part-of-Speech (POS) tagger for Nepali text, achieving over 99% accuracy.
By: Unknown
A Computational Analysis of Nepali Morphology: A Model For Natural Language Processing
A dissertation on the computational analysis of Nepali morphology using a finite-state approach to create a morphological analyzer.
By: Unknown
A Morphological Analyzer and a Stemmer for Nepali
This paper discusses the design, implementation, and linguistic aspects of a Morphological Analyzer and a stemmer for Nepali.
By: Unknown