Pashtoxnx 2013 Online

Article: PashtoXNX 2013 — Overview and Impact

PashtoXNX 2013 was a regional linguistic and cultural initiative focused on the Pashto language and its digital presence. Launched in 2013, the project aimed to improve Pashto-language resources, increase online accessibility, and foster community contributions to Pashto computing and digital content.

Evaluation

Preprocessing steps

  1. Normalize Unicode (NFC):
    iconv -f utf-8 -t utf-8 -c corpus.txt > corpus_clean.txt
    
  2. Remove non-printable chars:
    tr -cd '\11\12\15\40-\176' < corpus_clean.txt > corpus_printable.txt
    
  3. Tokenize (Python example using regex):
import re
def tokenize(s):
    return re.findall(r'[\u0600-\u06FF]+|[A-Za-z0-9]+|[^\s]', s)
  1. Split train/dev/test: 80/10/10 shuffled by line.

Inspecting files

  1. List files:
    unzip pashtoxnx2013.zip
    ls -lah
    
  2. Check sample:
    head -n 20 corpus.txt
    

Challenges