Importing Corpora

The CLTK stores all data in the local directory cltk_data, which is created at a user’s root directory upon first initialization of the CorpusImporter() class. Within this are an originals directory, in which untouched copies of downloaded or copied files are preserved, and a directory for every language for which a corpus has been downloaded. It also contains cltk.log for all CLTK logging.

Listing corpora

To see all of the corpora available for importing, use list_corpora().

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: corpus_importer = CorpusImporter('greek')  # e.g., or CorpusImporter('latin')

In [3]: corpus_importer.list_corpora

Out[3]:
['greek_software_tlgu',
 'greek_text_perseus',
 'phi7',
 'tlg',
 'greek_proper_names_cltk',
 'greek_models_cltk',
 'greek_treebank_perseus',
 'greek_lexica_perseus',
 'greek_training_set_sentence_cltk',
 'greek_word2vec_cltk',
 'greek_text_lacus_curtius']

Importing a corpus

To download a remote corpus, use the following, for example, for the Latin Library.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: corpus_importer = CorpusImporter('latin')  # e.g., or CorpusImporter('greek')

In [3]: corpus_importer.import_corpus('latin_text_latin_library')
Downloaded 100% , 35.53 MiB | 3.28 MiB/s s

For a local corpus, such as the TLG, you must give a second argument of the filepath to the corpus, e.g.:

In [4]: corpus_importer.import_corpus('tlg', '~/Documents/corpora/TLG_E/')

User-defined, distributed corpora

Most users will want to use the CLTK’s publicly available corpora. However users can import any repository that is hosted on a Git server. The benefit of this is that users can use corpora that the CLTK organization is not able to distribute itself (because too specific, license restrictions, etc.).

Let’s say a user wants to keep a particular Git-backed corpus at git@github.com:kylepjohnson/latin_corpus_newton_example.git. It can be cloned into the ~/cltk_data/ directory by declaring it in a manually created YAML file at ~/cltk_data/distributed_corpora.yaml like the following:

example_distributed_latin_corpus:
    origin: https://github.com/kylepjohnson/latin_corpus_newton_example.git
    language: latin
    type: text

example_distributed_greek_corpus:
    origin: https://github.com/kylepjohnson/a_nonexistent_repo.git
    language: pali
    type: treebank

Each block defines a separate corpus. The first line of a block (e.g., example_distributed_latin_corpus) gives the unique name to the custom corpus. This first example block would allow a user to fetch the repo and install it at ~/cltk_data/latin/text/latin_corpus_newton_example.