Pashtoxnx 2013 Online

Article: PashtoXNX 2013 — Overview and Impact

PashtoXNX 2013 was a regional linguistic and cultural initiative focused on the Pashto language and its digital presence. Launched in 2013, the project aimed to improve Pashto-language resources, increase online accessibility, and foster community contributions to Pashto computing and digital content.

Evaluation

Evaluate MT with BLEU, chrF; for morphology use F1 on morphological tags.

Preprocessing steps

Normalize Unicode (NFC):

iconv -f utf-8 -t utf-8 -c corpus.txt > corpus_clean.txt

Remove non-printable chars:

tr -cd '\11\12\15\40-\176' < corpus_clean.txt > corpus_printable.txt

Tokenize (Python example using regex):

import re
def tokenize(s):
    return re.findall(r'[\u0600-\u06FF]+|[A-Za-z0-9]+|[^\s]', s)

Split train/dev/test: 80/10/10 shuffled by line.

Inspecting files

List files:
```
unzip pashtoxnx2013.zip
ls -lah
```
Check sample:
```
head -n 20 corpus.txt
```

Challenges

Fragmentation in orthography and dialectal variation made standardization difficult.
Limited funding and technical expertise slowed large-scale software localization.
Low digital literacy in some rural Pashto-speaking areas constrained adoption.
Political instability in parts of the Pashto-speaking region hampered sustained, in-person community work.