Middle High German

Middle High German (abbreviated MHG, German: Mittelhochdeutsch, abbr. Mhd.) is the term for the form of German spoken in the High Middle Ages. It is conventionally dated between 1050 and 1350, developing from Old High German and into Early New High German. High German is defined as those varieties of German which were affected by the Second Sound Shift; the Middle Low German and Middle Dutch languages spoken to the North and North West, which did not participate in this sound change, are not part of MHG. (Source: Wikipedia)

ASCII Encoding

Using the Word class, you can easily convert a string to its ASCII encoding, essentialy striping it of its diacritics.

In [1]: from cltk.phonology.middle_high_german.transcription import Word

In [2]: w = Word("vogellîn")

In [3]: w.ASCII_encoding()
Out[3]: 'vogellin'

Stemming

Note

The stemming algorithm is still under developement and can sometimes produce inaccurate results.

CLTK’s stemming function, attempts to reduce inflected words to their stem by suffix stripping.

In [1]: from cltk.stem.middle_high_german.stem import stemmer_middle_high_german

In [2]: stemmer_middle_high_german("Man lūte dā zem münster nāch gewoneheit")
Out[2]: ['Man', 'lut', 'dâ', 'zem', 'munst', 'nâch', 'gewoneheit']

The stemmer strips umlauts by default, to toggle it off, simply set rem_umlauts = False

In [3]: stemmer_middle_high_german("Man lūte dā zem münster nāch gewoneheit", rem_umlauts = False)
Out[3]: ['Man', 'lût', 'dâ', 'zem', 'münst', 'nâch', 'gewoneheit']

The stemmer can also take an user-defined dictionary as an optional parameter.

In [4]: stemmer_middle_high_german("swaȥ kriuchet unde fliuget und bein zer erden biuget", rem_umlauts = False)
Out[4]: ['swaȥ', 'kriuchet', 'unde', 'fliuget', 'und', 'bein', 'zer', 'erden', 'biuget']

In [5]: stemmer_middle_high_german("swaȥ kriuchet unde fliuget und bein zer erden biuget", rem_umlauts = False, exceptions = {"biuget" : "biegen"})
Out[5]: ['swaȥ', 'kriuchet', 'unde', 'fliuget', 'und', 'bein', 'zer', 'erden', 'biegen']

Syllabification

A syllabifier is contained in the Word module:

In [1]: from cltk.phonology.middle_high_gemran import Word

In [2]: Word('entslâfen').syllabify()
Out[2]: ['ent', 'slâ', 'fen']

Note that the syllabifier is case-insensitive:

In [3]: Word('Fröude').syllabify()
Out[3]: ['fröu', 'de']

You can also load the sonority of MHG phonemes to the phonology syllabifier:

In [4]: from cltk.phonology.syllabify import Syllabifier

In [5]: s = Syllabifier(language='middle high german')

In [6]: s.syllabify('lobebæren')
Out[6]: ['lo', 'be', 'bæ', 'ren']

Stopword Filtering

CLTK offers a built-in stop word list for Middle High German.

In [1]: from cltk.stop.middle_high_german.stops import STOPS_LIST

In [2]: from cltk.tokenize.word import WordTokenizer

In [3]: word_tokenizer = WordTokenizer('middle_high_german')

In [4]: sentence = "Wol mich lieber mære diu ich hān vernomen daȥ der winter swære welle ze ende komen"

In [5]: tokens = word_tokenizer.tokenize(sentence.lower())

In [6]: [word for word in tokens if word not in STOPS_LIST]
Out[6]: ['lieber', 'mære', 'hān', 'vernomen', 'winter', 'swære', 'welle', 'komen']

Text Normalization

Text normalization attempts to narrow the disrepancies between various corpora.

Lowercase Conversion

By default, the function converts the whole string to lowercase. However, since in MHG uppercase is only used at the start of a sentence or to denote eponyms, you may also set to_lower_beginning = True to only convert the words at the beginning of a sentence.

In [1]: from cltk.corpus.middle_high_german.alphabet import normalize_middle_high_german

In [2]: normalize_middle_high_german("Dô erbiten si der nahte und fuoren über Rîn")
Out[2]: 'dô erbiten si der nahte und fuoren über rîn'

In [3]: normalize_middle_high_german("Dô erbiten si der nahte und fuoren über Rîn",to_lower_all = False, to_lower_beginning = True)
Out[3]: 'dô erbiten si der nahte und fuoren über Rîn'

Alphabet Conversion

Various online corpora use the characters ā, ō, ū, ē, ī to represent â, ô, û, ê and î respectively. Sometimes, ae and oe are also used instead of æ and œ. By default, the normalizer converts the text to the canonical form.

In [4]: normalize_middle_high_german("Mit ūf erbürten schilden in was ze strīte nōt", alpha_conv = True)
Out[4]: 'mit ûf erbürten schilden in was ze strîte nôt'

Punctuation

Punctuation is also handled by the normalizer.

In [5]: normalize_middle_high_german("Si sprach: ‘herre Sigemunt, ir sult iȥ lāȥen stān", punct = True)
Out[5]: 'si sprach herre sigemunt ir sult iȥ lâȥen stân'

Phonetic Indexing

Phonetic Indexing helps identifying and processing homophones.

Soundex

The Word class provides a modified Soundex algorithm modified for MHG.

In [1]: from cltk.phonology.middle_high_german.transcription import Word

In [2]: w1 = Word("krippe")

In [3]: w1.phonetic_index(p = "SE")
Out[3]: 'K510'

In [4]: w2 = Word("krîbbe")

In [5]: w2.phonetic_indexing(p = "SE")
Out[5]: 'K510'

Transliteration

CLTK’s transcriber rewrites a word into the International Phonetical Alphabet (IPA). As of this version, the Transcribe class doesn’t support any specific dialects and serves as a superset encompassing various regional accents.

In [1]: from cltk.phonology.middle_high_german.transcription import Transcriber

In [2]: tr = Transcriber()

In [3]: tr.transcribe("Slâfest du, friedel ziere?", punctuation = True)
Out[3]: '[Slɑːfest d̥ʊ, frɪ͡əd̥el t͡sɪ͡əre?]'

In [4]: tr.transcribe("Slâfest du, friedel ziere?", punctuation = False)
Out[4]: '[Slɑːfest d̥ʊ frɪ͡əd̥el t͡sɪ͡əre]'

Word Tokenization

The WordTokenizer class takes a string as input and returns a list of tokens.

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: word_tokenizer = WordTokenizer('middle_high_german')

In [3]: text = "Mīn ougen   wurden liebes alsō vol, \n\n\ndō ich die minneclīchen ērst gesach,\ndaȥ eȥ mir hiute und   iemer mē tuot wol."

In [4]: word_tokenizer.tokenize(text)
Out[4]: ['Mīn', 'ougen', 'wurden', 'liebes', 'alsō', 'vol', ',', 'dō', 'ich', 'die', 'minneclīchen', 'ērst', 'gesach', ',', 'daȥ', 'eȥ', 'mir', 'hiute', 'und', 'iemer', 'mē', 'tuot', 'wol', '.']

Lemmatization

The CLTK offers a series of lemmatizers that can be combined in a backoff chain, i.e. if one lemmatizer is unable to return a headword for a token, this token can be passed onto another lemmatizer until either a headword is returned or the sequence ends. There is a generic version of the backoff Middle High German lemmatizer which requires data from the CLTK Middle High German models data found here. The lemmatizer expects this model to be stored in a folder called cltk_data in the user’s home directory.

To use the generic version of the backoff Middle High German Lemmatizer:

In [1]: from cltk.lemmatize.middle_high_german.backoff import BackoffMHGLemmatizer

In [2]: lemmatizer = BackoffMHGLemmatizer()

In [3]: tokens = "uns ist in alten mæren".split(" ")

In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('uns', {'uns', 'unser', 'unz', 'wir'}), ('ist', {'sîn/wider(e)+', 'ist', 'sîn/inne+', 'sîn/mit(e)<+', 'sîn/vür(e)+', 'sîn/abe+', 'sîn/obe+', 'sîn/vor(e)+', 'sîn/vür(e)>+', 'sîn/ûze+', 'sîn/ûz+', 'sîn/bî<.+', 'sîn/vür(e)<+', 'sîn/innen+', 'sîn/âne+', 'sîn/bî+', 'sîn/ûz<+', 'sîn', 'sîn/ûf<.+'}), ('in', {'ër', 'in/hin(e)+', 'in/>+gân', 'in/+gân', 'în/+gân', 'in/+lâzen', 'în', 'in/<.+wintel(e)n', 'in/>+rinnen', 'in/dar(e)+', 'in/.>+slîzen', 'în/hin(e)+', 'în/+lèiten', 'în/+var(e)n', 'in', 'in/>+tragen', 'in/+tropfen', 'în/+lègen', 'in/>+winten', 'în/+brèngen', 'in/>+büègen', 'ërr', 'în/+zièhen', 'in/<.+gân', 'in/+zièhen', 'in/>+tûchen', 'dër', 'în/dâr+', 'in/war(e).+', 'in/<.+lâzen', 'in/>+rîten', 'în/+lâzen', 'in/>+lâzen', 'in/+stapfen', 'în/+sènten', 'in/>.+lâzen', 'in/>+stân', 'in/+drücken', 'in/>+ligen', 'in/dâr+ ', 'in/+var(e)n', 'in/+vüèren', 'in/<.+vallen', 'in/>+vlièzen', 'in/<.+rîten', 'in/hër(e).+', 'ne', 'in/>+wonen', 'in/<.+sigel(e)n', 'in/+lègen', 'în/+dringen', 'in/>+ge-trîben', 'in/+diènen', 'in/>+ge-stëchen', 'in/>+stècken', 'in/hër(e)+', 'in/>+stëchen', 'in/dâr+', 'in/+blâsen', 'în/dâr.+', 'in/>+wîsen', 'în/+îlen', 'in/>+laden', 'în/+komen', 'în/+ge-lèiten', 'in/<.+vloèzen', 'ër ', 'in/>+sètzen', 'in/hièr+', 'in/>+bûwen', 'in/>+lèiten', 'în/+ge-binten', '[!]', 'în/+trîben', 'in/<.+blâsen', 'in/+komen', 'în/+krièchen', 'in/+trîben', 'in/<.+ligen', 'in/+stëchen', 'in/<+gân', 'in/dâr.+', 'în/hër(e)+', 'in/+kêren', 'in/<.+var(e)n', 'in/+rîten', 'in/>+vallen', 'in/<.+vüèren'}), ('alten', {'alt', 'alter', 'alten'}), ('mæren', {'mæren', 'mære'})]

POS tagging

In [1]: from cltk.tag.pos import POSTag

In [2]: mhg_pos_tagger = POSTag("middle_high_german")

In [3]: mhg_pos_tagger.tag_tnt("uns ist in alten mæren wunders vil geseit")
Out[3]: [('uns', 'PPER'), ('ist', 'VAFIN'), ('in', 'APPR'), ('alten', 'ADJA'), ('mæren', 'ADJA'),
         ('wunders', 'NA'), ('vil', 'AVD'), ('geseit', 'VVPP')]