Punjabi is an Indo-Aryan language native language of the Punjabi people who inhabit the historical Punjab region of Pakistan and India. Punjabi developed from Sanskrit through Prakrit language and later Apabhraṃśa. Punjabi emerged as an Apabhramsha, a degenerated form of Prakrit, in the 7th century A.D. and became stable by the 10th century. By the 10th century, many Nath poets were associated with earlier Punjabi works. Arabic and Persian influence in the historical Punjab region began with the late first millennium Muslim conquests on the Indian subcontinent. (Source: Wikipedia)


Use CorpusImporter or browse the CLTK Github repository (anything beginning with punjabi_) to discover available Punjabi corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('punjabi')
In [3]: c.list_corpora

Now from the list of available corpora import any one you like.


The Punjabi digits, vowels, consonants, and symbols are placed in cltk/corpus/punjabi/alphabet.py. It is fully commented, so look there for more information about the language's phonology.

To use Punjabi's independent vowels, for example:

In [1]: from cltk.corpus.punjabi.alphabet import INDEPENDENT_VOWELS

Out[2]: ['ਆ', 'ਇ', 'ਈ', 'ਉ', 'ਊ', 'ਏ', 'ਐ', 'ਓ', 'ਔ']

These are the INDEPENDENT_VOWELS, they don't need any other consonant to be printed, they are printed as just they are, they represent the sounds "aa", "i", "iii", "u", "uuu", "a", "oo", "o" and "ou", respectively.

Similarly there are lists for DIGITS, DEPENDENT_VOWELS, CONSONANTS, BINDI_CONSONANTS (nasal pronunciation) and some OTHER_SYMBOLS (mostly for pronunciation).


These convert English numbers into Punjabi and vice-verse.

In[1]: from cltk.corpus.punjabi.numerifier import punToEnglish_number

In[2]: from cltk.corpus.punjabi.numerifier import englishToPun_number

In[3]: c = punToEnglish_number('੧੨੩੪੫੬੭੮੯੦')

In[4]: print(c)
Out[4]: 1234567890

In[5]: c = englishToPun_number(1234567890)

In[6]: print(c)
Out[6]: ੧੨੩੪੫੬੭੮੯੦

Stopword Filtering

To use the CLTK's built-in stopwords list:

In[1]: from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex

In[2]: from cltk.stop.punjabi.stops import STOPS_LIST

In[3]: sample = "ਪੰਜਾਬੀ ਪੰਜਾਬ ਦੀ ਮੁਖੱ ਬੋੋਲਣ ਜਾਣ ਵਾਲੀ ਭਾਸ਼ਾ ਹੈ।"

In[4]: x = indian_punctuation_tokenize_regex(sample)

In[5]: print(x)
Out[5]: ['ਪੰਜਾਬੀ', 'ਪੰਜਾਬ', 'ਦੀ', 'ਮੁਖੱ', 'ਬੋੋਲਣ', 'ਜਾਣ', 'ਵਾਲੀ', 'ਭਾਸ਼ਾ', 'ਹੈ', '।']

In[6]: lis = [w for w in x if not w in STOPS_LIST]

In[7]: print (lis)
Out[7]: ['ਪੰਜਾਬੀ', 'ਪੰਜਾਬ', 'ਮੁਖੱ', 'ਬੋੋਲਣ', 'ਜਾਣ', 'ਭਾਸ਼ਾ', '।']