Pashtoxnx 2013 Online
Article: PashtoXNX 2013 — Overview and Impact
PashtoXNX 2013 was a regional linguistic and cultural initiative focused on the Pashto language and its digital presence. Launched in 2013, the project aimed to improve Pashto-language resources, increase online accessibility, and foster community contributions to Pashto computing and digital content.
Evaluation
- Evaluate MT with BLEU, chrF; for morphology use F1 on morphological tags.
Preprocessing steps
- Normalize Unicode (NFC):
iconv -f utf-8 -t utf-8 -c corpus.txt > corpus_clean.txt - Remove non-printable chars:
tr -cd '\11\12\15\40-\176' < corpus_clean.txt > corpus_printable.txt - Tokenize (Python example using regex):
import re
def tokenize(s):
return re.findall(r'[\u0600-\u06FF]+|[A-Za-z0-9]+|[^\s]', s)
- Split train/dev/test: 80/10/10 shuffled by line.
Inspecting files
- List files:
unzip pashtoxnx2013.zip ls -lah - Check sample:
head -n 20 corpus.txt
Challenges
- Fragmentation in orthography and dialectal variation made standardization difficult.
- Limited funding and technical expertise slowed large-scale software localization.
- Low digital literacy in some rural Pashto-speaking areas constrained adoption.
- Political instability in parts of the Pashto-speaking region hampered sustained, in-person community work.