8.1.1. cltk.alphabet package¶
Modules for accessing the alphabets and character sets of in-scope CLTK languages.
8.1.1.1. Subpackages¶
8.1.1.2. Submodules¶
8.1.1.3. cltk.alphabet.ang module¶
The Old English alphabet.
>>> from cltk.alphabet import ang
>>> ang.DIGITS[:5]
['ān', 'tƿeġen', 'þrēo', 'fēoƿer', 'fīf']
>>> ang.DIPHTHONGS[:5]
['ea', 'eo', 'ie']
8.1.1.4. cltk.alphabet.arb module¶
The Arabic alphabet. Sources:
>>> from cltk.alphabet import arb
>>> arb.LETTERS[:5]
('ا', 'ب', 'ت', 'ة', 'ث')
>>> arb.PUNCTUATION_MARKS
['،', '؛', '؟']
>>> arb.ALEF
'ا'
>>> arb.WEAK
('ا', 'و', 'ي', 'ى')
8.1.1.5. cltk.alphabet.arc module¶
The Imperial Aramaic alphabet, plus simple script to transform a Hebrew transcription of an Imperial Aramaic text to its own Unicode block.
TODO: Add Hebrew-to-Aramaic converter
8.1.1.6. cltk.alphabet.ben module¶
The Bengali alphabet.
>>> from cltk.alphabet import ben
>>> ben.VOWELS[:5]
['অ', 'আ', 'ই', 'ঈ', 'উ']
>>> ben.DEPENDENT_VOWELS[:5]
['◌া', 'ি', '◌ী', '◌ু', '◌ূ']
>>> ben.CONSONANTS[:5]
['ক', 'খ', 'গ', 'ঘ ', 'ঙ']
8.1.1.7. cltk.alphabet.egy module¶
Convert MdC transliterated text to Unicode.
-
cltk.alphabet.egy.
mdc_unicode
(string, q_kopf=True)[source]¶ parameters: string: str q_kopf: boolean return: unicode_text: str The translitterated text passes to the function under the variable ‘string’. The search and replace operation is done for the related caracters. If the q_kopf parameter is False, we replace ‘q’ with ‘ḳ’
8.1.1.8. cltk.alphabet.enm module¶
The Middle English alphabet. Sources:
From Old English to Standard English, Dennis Freeborn
The produced consonant sound in Middle English are categorized as following:
Stops: ⟨/b/, /p/, /d/, /t/, /g/, /k/⟩
Affricatives: ⟨/ǰ/, /č/, /v/, /f/, /ð/, /θ/, /z/, /s/, /ž/, /š/, /c̹/, /x/, /h/⟩
Nasals: ⟨/m/, /n/, /ɳ/⟩
Later Resonants: ⟨/l/⟩
Medial Resonants: ⟨/r/, /y/, /w/⟩
Thorn (þ) was gradually replaced by the diphthong “th”, while Eth (ð), which had already fallen out of use by the 14th century, was later replaced by “d”
Wynn (ƿ) is the predecessor of “w”. Modern transliteration scripts, usually replace it with “w” as to avoid confusion with the strikingly similar p
The vowel sounds in Middle English are divided into:
Long Vowels: ⟨/a:/, /e/, /e̜/, /i/ , /ɔ:/, /o/ , /u/⟩
Short Vowels: ⟨/a/, /ɛ/, /I/, /ɔ/, /U/, /ə/⟩
As established rules for ME orthography were effectively nonexistent, compiling a definite list of diphthongs is non-trivial. The following aims to compile a list of the most commonly-used diphthongs.
>>> from cltk.alphabet import enm
>>> enm.ALPHABET[:5]
['a', 'b', 'c', 'd', 'e']
>>> enm.CONSONANTS[:5]
['b', 'c', 'd', 'f', 'g']
-
cltk.alphabet.enm.
normalize_middle_english
(text, to_lower=True, alpha_conv=True, punct=True)[source]¶ Normalizes Middle English text string and returns normalized string.
- Parameters
text (
str
) – str text to be normalizedto_lower (
bool
) – bool convert text to lower textalpha_conv (
bool
) – bool convert text to canonical form æ -> ae, þ -> th, ð -> th, ȝ -> y if at beginning, gh otherwisepunct (
bool
) – remove punctuation
>>> normalize_middle_english('Whan Phebus in the CraBbe had neRe hys cours ronne', to_lower = True) 'whan phebus in the crabbe had nere hys cours ronne' >>> normalize_middle_english('I pray ȝow þat ȝe woll', alpha_conv = True) 'i pray yow that ye woll' >>> normalize_middle_english("furst, to begynne:...", punct = True) 'furst to begynne'
- Return type
str
8.1.1.9. cltk.alphabet.fro module¶
The normalizer aims to maximally reduce the variation between the orthography of texts written in the Anglo-Norman dialect to bring it in line with “orthographe commune”. It is heavily inspired by Pope (1956). Spelling variation is not consistent enough to ensure the highest accuracy; the normalizer in its current format should therefore be used as a last resort. The normalizer, word tokenizer, stemmer, lemmatizer, and list of stopwords for OF/MF were developed as part of Google Summer of Code 2017. A full write-up of this work can be found at : https://gist.github.com/nat1881/6f134617805e2efbe5d275770e26d350 References : Pope, M.K. 1956. From Latin to Modern French with Especial Consideration of Anglo-Norman. Manchester: MUP. Anglo-French spelling variants normalized to “orthographe commune”, from M. K. Pope (1956)
word-final d - e.g. vertud vs vertu
use of <u> over <ou>
<eaus> for <eus>, <ceaus> for <ceus>
- triphtongs:
<iu> for <ieu>
<u> for <eu>
<ie> for <iee>
<ue> for <uee>
<ure> for <eure>
“epenthetic vowels” - e.g. averai for avrai
<eo> for <o>
<iw>, <ew> for <ieux>
final <a> for <e>
8.1.1.10. cltk.alphabet.gmh module¶
The alphabet for Middle High German. Source:
Schreibkonventionen des klassischen Mittelhochdeutschen, Simone Berchtold
The consonants of Middle High German are categorized as:
Stops: ⟨p t k/c/q b d g⟩
Affricates: ⟨pf/ph tz/z⟩
Fricatives: ⟨v f s ȥ sch ch h⟩
Nasals: ⟨m n⟩
Liquids: ⟨l r⟩
Semivowels: ⟨w j⟩
Misc. notes:
c is used at the beginning of only loanwords and is pronounced the same as k (e.g. calant, cappitain)
Double consonants are pronounced the same way as their corresponding letters in Modern Standard German (e.g. pp/p)
schl, schm, schn, schw are written in MHG as sw, sl, sm, sn
æ (also seen as ae), œ (also seen as oe) and iu denote the use of Umlaut over â, ô and û respectively
ȥ or ʒ is used in modern handbooks and grammars to indicate the s or s-like sound which arose from Germanic t in the High German consonant shift.
>>> from cltk.alphabet import gmh
>>> gmh.CONSONANTS[:5]
['b', 'd', 'g', 'h', 'f']
>>> gmh.VOWELS[:5]
['a', 'ë', 'e', 'i', 'o']
-
cltk.alphabet.gmh.
normalize_middle_high_german
(text, to_lower_all=True, to_lower_beginning=False, alpha_conv=True, punct=True, ascii=False)[source]¶ Normalize input string.
>>> from cltk.alphabet import gmh >>> from cltk.languages.example_texts import get_example_text >>> gmh.normalize_middle_high_german(get_example_text("gmh"))[:50] 'uns ist in alten\nmæren wunders vil geseit\nvon hele'
- Parameters
text (
str
) –to_lower_beginning (
bool
) –to_lower_all (
bool
) – convert whole text to lowercasealpha_conv (
bool
) – convert alphabet to canonical formpunct (
bool
) – remove punctuationascii (
bool
) – returns ascii form
- Returns
normalized text
8.1.1.11. cltk.alphabet.guj module¶
The Gujarati alphabet.
>>> from cltk.alphabet import guj
>>> guj.VOWELS[:5]
['અ', 'આ', 'ઇ', 'ઈ', 'ઉ']
>>> guj.CONSONANTS[:5]
['ક', 'ખ', 'ગ', 'ઘ', 'ચ']
8.1.1.12. cltk.alphabet.hin module¶
The Hindi alphabet.
>>> from cltk.alphabet import hin
>>> hin.VOWELS[:5]
['अ', 'आ', 'इ', 'ई', 'उ']
>>> hin.CONSONANTS[:5]
['क', 'ख', 'ग', 'घ', 'ङ']
>>> hin.SONORANT_CONSONANTS
['य', 'र', 'ल', 'व']
8.1.1.13. cltk.alphabet.kan module¶
The Kannada alphabet. The characters can be divided into 3 categories:
Swaras (Vowels) : 13 in modern Kannada and 14 in Classical
Vynjanas (Consonants) : They are further divided into 2 categories:
Structured Consonants : 25
Unstructured Consonants : 9 in modern Kannada and 11 in Classical
Yogavaahakas (part vowel, part consonant) : 2
Corresponding to each Swaras and Yogavaahakas there is a symbol. Thus Consonant + Vowel Symbol = Kagunita.
>>> from cltk.alphabet import kan
>>> kan.VOWELS[:5]
['ಅ', 'ಆ', 'ಇ', 'ಈ', 'ಉ']
>>> kan.STRUCTURED_CONSONANTS[:5]
['ಕ', 'ಖ', 'ಗ', 'ಘ', 'ಙಚ']
8.1.1.14. cltk.alphabet.lat module¶
Alphabet and text normalization for Latin.
8.1.1.15. cltk.alphabet.non module¶
Old Norse runes, Unicode block: 16A0–16FF. Source: Viking Language 1, Jessie L. Byock
TODO: Document and test better.
-
class
cltk.alphabet.non.
RunicAlphabetName
(value)[source]¶ Bases:
cltk.alphabet.non.AutoName
An enumeration.
-
elder_futhark
= 'elder_futhark'¶
-
younger_futhark
= 'younger_futhark'¶
-
short_twig_younger_futhark
= 'short_twig_younger_futhark'¶
-
-
class
cltk.alphabet.non.
Rune
(runic_alphabet, form, sound, transcription, name)[source]¶ Bases:
object
>>> Rune(RunicAlphabetName.elder_futhark, "ᚺ", "h", "h", "haglaz") ᚺ >>> Rune.display_runes(ELDER_FUTHARK) ['ᚠ', 'ᚢ', 'ᚦ', 'ᚨ', 'ᚱ', 'ᚲ', 'ᚷ', 'ᚹ', 'ᚺ', 'ᚾ', 'ᛁ', 'ᛃ', 'ᛇ', 'ᛈ', 'ᛉ', 'ᛊ', 'ᛏ', 'ᛒ', 'ᛖ', 'ᛗ', 'ᛚ', 'ᛜ', 'ᛟ', 'ᛞ']
-
class
cltk.alphabet.non.
Transcriber
[source]¶ Bases:
object
>>> little_jelling_stone = "᛬ᚴᚢᚱᛘᛦ᛬ᚴᚢᚾᚢᚴᛦ᛬ᚴ(ᛅᚱ)ᚦᛁ᛬ᚴᚢᛒᛚ᛬ᚦᚢᛋᛁ᛬ᛅ(ᚠᛏ)᛬ᚦᚢᚱᚢᛁ᛬ᚴᚢᚾᚢ᛬ᛋᛁᚾᛅ᛬ᛏᛅᚾᛘᛅᚱᚴᛅᛦ᛬ᛒᚢᛏ᛬" >>> Transcriber.transcribe(little_jelling_stone, YOUNGER_FUTHARK) '᛫kurmR᛫kunukR᛫k(ar)þi᛫kubl᛫þusi᛫a(ft)᛫þurui᛫kunu᛫sina᛫tanmarkaR᛫but᛫'
-
static
from_form_to_transcription
(runic_alphabet)[source]¶ Make a dictionary whose keys are forms of runes and values their transcriptions. Used by transcribe method. :type runic_alphabet:
list
:param runic_alphabet: :return: dict
-
static
transcribe
(rune_sentence, runic_alphabet)[source]¶ From a runic inscription, the transcribe method gives a conventional transcription. :type rune_sentence:
str
:param rune_sentence: str, elements of this are from runic_alphabet or are punctuations :type runic_alphabet:list
:param runic_alphabet: list :return:
-
static
8.1.1.16. cltk.alphabet.omr module¶
The alphabet for Marathi.
# Using the International Alphabet of Sanskrit Transliteration (IAST), these vowels are represented thus
>>> from cltk.alphabet import omr
>>> omr.VOWELS[:5]
['अ', 'आ', 'इ', 'ई', 'उ']
>>> omr.IAST_VOWELS[:5]
['a', 'ā', 'i', 'ī', 'u']
>>> list(zip(omr.SEMI_VOWELS, omr.IAST_SEMI_VOWELS))
[('य', 'y'), ('र', 'r'), ('ल', 'l'), ('व', 'w')]
8.1.1.17. cltk.alphabet.ory module¶
The Odia alphabet.
>>> from cltk.alphabet import ory
>>> ory.VOWELS["0B05"]
'ଅ'
>>> ory.STRUCTURED_CONSONANTS["0B15"]
'କ'
8.1.1.18. cltk.alphabet.ota module¶
Ottoman alphabet
Misc. notes:
Based off Persian Alphabet Transliteration in CLTK by Iman Nazar
Uses UTF-8 Encoding for Ottoman/Persian Letters
When printing Arabic letters, they appear in the console from left to right and inconsistently linked, but correctly link and flow right to left when inputted into a word processor. The problems only exist in the terminal.
TODO: Add tests
8.1.1.19. cltk.alphabet.oty module¶
Alphabet for Old Tamil. GRANTHA_CONSONANTS
are from
the Grantha script which was used between 6th and 20th
century to write Sanskrit and the classical language Manipravalam.
TODO: Add tests
8.1.1.22. cltk.alphabet.processes module¶
This module holds the Process
for normalizing text strings, usually
before the text is sent to other processes.
-
class
cltk.alphabet.processes.
NormalizeProcess
(language: str = None)[source]¶ Bases:
cltk.core.data_types.Process
Generic process for text normalization.
-
language
: str = None¶
-
algorithm
¶
-
-
class
cltk.alphabet.processes.
GreekNormalizeProcess
(language: str = None)[source]¶ Bases:
cltk.alphabet.processes.NormalizeProcess
Text normalization for Ancient Greek.
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> lang = "grc" >>> orig_text = get_example_text(lang) >>> non_normed_doc = Doc(raw=orig_text) >>> normalize_proc = GreekNormalizeProcess(language=lang) >>> normalized_text = normalize_proc.run(input_doc=non_normed_doc) >>> normalized_text == orig_text False
-
language
: str = 'grc'¶
-
-
class
cltk.alphabet.processes.
LatinNormalizeProcess
(language: str = None)[source]¶ Bases:
cltk.alphabet.processes.NormalizeProcess
Text normalization for Latin.
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> lang = "lat" >>> orig_text = get_example_text(lang) >>> non_normed_doc = Doc(raw=orig_text) >>> normalize_proc = GreekNormalizeProcess(language=lang) >>> normalized_text = normalize_proc.run(input_doc=non_normed_doc) >>> normalized_text == orig_text True
-
language
: str = 'lat'¶
-
8.1.1.23. cltk.alphabet.san module¶
Data module for the Sanskrit languages alphabet and related characters.
8.1.1.25. cltk.alphabet.text_normalization module¶
Functions for preprocessing texts. Not language-specific.
-
cltk.alphabet.text_normalization.
remove_non_ascii
(input_string)[source]¶ Remove non-ascii characters Source: http://stackoverflow.com/a/1342373