sign_language_translator.languages.text package

Submodules

Module contents

Module that contains Text Language Processors as classes to clean up, tokenize and tag texts of various languages.

class sign_language_translator.languages.text.English[source]

Bases: TextLanguage

NLP class for English text. Extends slt.languages.text.TextLanguage class.

English is originally a West Germanic language and potentially an international language in the 21st century. English uses the Latin script, which consists of 26 letters and is written from left to right. There are two variants of these letters: uppercase (capital letters) and lowercase. See unicode details at: https://unicode.org/charts/PDF/U0000.pdf

ALLOWED_CHARACTERS = {'\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}'}
ALPHABET: List[str] = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
BRACKETS: List[str] = ['(', ')', '[', ']', '{', '}']
CHARACTER_MAP: Dict[str, str] = {'–': '-', '—': '-', '‘': "'", '’': "'", '“': '"', '”': '"', '…': '...'}
CHARACTER_TRANSLATOR = {8211: '-', 8212: '-', 8216: "'", 8217: "'", 8220: '"', 8221: '"', 8230: '...'}
END_OF_SENTENCE_MARKS: List[str] = ['.', '?', '!']
FULL_STOPS: List[str] = ['.']
NUMBER_REGEX = '\\d+(?:[\\.:]\\d+)*'
PUNCTUATION: List[str] = ['.', '?', '!', ',', ';', ':']
QUESTION_MARKS: List[str] = ['?']
QUOTES: List[str] = ['"', "'"]
SYMBOLS: List[str] = ['.', '?', '!', ',', ';', ':', '(', ')', '[', ']', '{', '}', '"', "'", '@', '#', '$', '%', '&', '*', '+', '<', '>', '=', '^', '|', '/', '-', '_']
UNALLOWED_CHARACTERS_REGEX = '[^tr\\\ni8!DOVchq\\+_E\\^KRYyNbS\\}\\(;\\|,\\{zJuf0:C\\[@Ho3e\\?P\\$skmId491/5\\-\\*ZQjw\\&F\\#7LvG\\]gnT6=p\'M\\)>\\ Wa2<l%Ax"U\\.BX]'
UNICODE_RANGE = (32, 126)
WORD_REGEX = '[^\\W_\\d]+'
classmethod allowed_characters() Set[str][source]

Returns a set of all allowed characters in the language.

delete_unallowed_characters(text: str) str[source]
detokenize(tokens: Iterable[str]) str[source]

Joins tokens back into text.

get_tags(tokens: str | Iterable[str]) List[Any][source]

Get the classifications of all tokens in the form of a sequence of tags

get_word_senses(tokens: str | Iterable[str]) List[List[str]][source]

Get all known meanings of the ambiguous words.

static name() str[source]

Returns the name of the language used everywhere else in datasets.

preprocess(text: str) str[source]

Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.

romanize(text: str, *args, add_diacritics=True, **kwargs) str[source]

Map characters to phonetically similar characters of the English language. Transliteration is useful for readability & simple text-to-speech. First maps (n>1)-grams, then unigrams.

ALA-LC Standardized Romanization Tables (70 languages): https://www.loc.gov/catdir/cpso/roman.html

Parameters:
  • text (str) – Non-English text to be mapped to Latin script.

  • add_diacritics (bool, optional) – Whether to use diacritics over English characters to help pronunciation. (Rules: 1. The under-dot ‘ ̣’ indicates alternate soft/hard pronunciation of the letter. 2. The over-bar/macron ‘ ̄’ means long pronunciation). Defaults to True.

  • character_translation_table (Optional[Dict[int, str]], optional) – A dictionary mapping unicode of single characters to their latin equivalent. Defaults to None.

  • n_gram_map (Optional[Dict[str, str]], optional) – A dictionary mapping bigrams, trigrams or more to their latin equivalent. Keys are expected to be regular expressions. Defaults to None.

sentence_tokenize(text: str) List[str][source]

Break text into sentences.

tag(tokens: str | Iterable[str]) List[Tuple[str, Any]][source]

Classify the tokens and mark them with appropriate tags.

classmethod token_regex() str[source]

Returns a regular expression that matches words in this language.

tokenize(text: str) List[str][source]

Break apart text into words or phrases

class sign_language_translator.languages.text.Hindi[source]

Bases: TextLanguage

NLP class for Hindi text. Extends slt.languages.text.TextLanguage class.

Hindi is an Indo-Aryan language spoken mostly in India. Hindi uses the Devanagari script, which consists of 11 vowels and 33 consonants and is written from left to right. See unicode details at: https://unicode.org/charts/PDF/U0900.pdf

ACRONYM_PERIODS: List[str] = ['॰']
ALLOWED_CHARACTERS: Set[str] = {'\n', ' ', '!', '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', '{', '}', 'ऀ', 'ँ', 'ं', 'ः', 'ऄ', 'अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ऌ', 'ऍ', 'ऎ', 'ए', 'ऐ', 'ऑ', 'ऒ', 'ओ', 'औ', 'क', 'ख', 'ग', 'घ', 'ङ', 'च', 'छ', 'ज', 'झ', 'ञ', 'ट', 'ठ', 'ड', 'ढ', 'ण', 'त', 'थ', 'द', 'ध', 'न', 'ऩ', 'प', 'फ', 'ब', 'भ', 'म', 'य', 'र', 'ऱ', 'ल', 'ळ', 'ऴ', 'व', 'श', 'ष', 'स', 'ह', 'ऺ', 'ऻ', '़', 'ऽ', 'ा', 'ि', 'ी', 'ु', 'ू', 'ृ', 'ॄ', 'ॅ', 'ॆ', 'े', 'ै', 'ॉ', 'ॊ', 'ो', 'ौ', '्', 'ॎ', 'ॏ', 'ॐ', '॑', '॒', '॓', '॔', 'ॕ', 'ॖ', 'ॗ', 'क़', 'ख़', 'ग़', 'ज़', 'ड़', 'ढ़', 'फ़', 'य़', 'ॠ', 'ॡ', 'ॢ', 'ॣ', '।', '॥', '०', '१', '२', '३', '४', '५', '६', '७', '८', '९', '॰', 'ॱ', 'ॲ', 'ॳ', 'ॴ', 'ॵ', 'ॶ', 'ॷ', 'ॸ', 'ॹ', 'ॺ', 'ॻ', 'ॼ', 'ॽ', 'ॾ', 'ॿ'}
BRACKETS: List[str] = ['(', ')', '[', ']', '{', '}']
CHARACTERS: List[str] = ['ऀ', 'ँ', 'ं', 'ः', 'ऄ', 'अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ऌ', 'ऍ', 'ऎ', 'ए', 'ऐ', 'ऑ', 'ऒ', 'ओ', 'औ', 'क', 'ख', 'ग', 'घ', 'ङ', 'च', 'छ', 'ज', 'झ', 'ञ', 'ट', 'ठ', 'ड', 'ढ', 'ण', 'त', 'थ', 'द', 'ध', 'न', 'ऩ', 'प', 'फ', 'ब', 'भ', 'म', 'य', 'र', 'ऱ', 'ल', 'ळ', 'ऴ', 'व', 'श', 'ष', 'स', 'ह', 'ऺ', 'ऻ', '़', 'ऽ', 'ा', 'ि', 'ी', 'ु', 'ू', 'ृ', 'ॄ', 'ॅ', 'ॆ', 'े', 'ै', 'ॉ', 'ॊ', 'ो', 'ौ', '्', 'ॎ', 'ॏ', 'ॐ', '॑', '॒', '॓', '॔', 'ॕ', 'ॖ', 'ॗ', 'क़', 'ख़', 'ग़', 'ज़', 'ड़', 'ढ़', 'फ़', 'य़', 'ॠ', 'ॡ', 'ॢ', 'ॣ', '।', '॥', '०', '१', '२', '३', '४', '५', '६', '७', '८', '९', '॰', 'ॱ', 'ॲ', 'ॳ', 'ॴ', 'ॵ', 'ॶ', 'ॷ', 'ॸ', 'ॹ', 'ॺ', 'ॻ', 'ॼ', 'ॽ', 'ॾ', 'ॿ']
CHARACTER_TO_DECOMPOSED: Dict[str, str] = {'क़': 'क़', 'ख़': 'ख़', 'ग़': 'ग़', 'ज़': 'ज़', 'ड़': 'ड़', 'ढ़': 'ढ़', 'फ़': 'फ़', 'य़': 'य़'}
CHARACTER_TRANSLATOR = {2392: 'क़', 2393: 'ख़', 2394: 'ग़', 2395: 'ज़', 2396: 'ड़', 2397: 'ढ़', 2398: 'फ़', 2399: 'य़'}
DIACRITICS = ['ऀ', 'ँ', 'ं', 'ः', 'ॄ', 'ॅ', '़', 'ा', 'ि', 'ी', 'ु', 'ू', 'ृ', 'े', 'ै', 'ॉ', 'ो', 'ौ', '्']
END_OF_SENTENCE_MARKS: List[str] = ['.', '।', '॥', '?', '!'][source]
FULL_STOPS: List[str] = ['.', '।', '॥'][source]
NGRAM_ROMANIZATION_MAP = {'(?<=.क|.ख|.ग|.घ|घ़|.ङ|.च|.छ|.ज|.झ|.ञ|.ट|ट़|.ठ|.ड|.ढ|.ण|.त|.थ|.द|.ध|.न|.क़|.ख़|.ग़|.ज़|.ड़|.ढ़)ँ': 'n', '(?<=प|फ|फ़|ब|भ|म)ं': 'm', 'घ़': 'g̲̲h̲̲', 'ट़': 't̤', 'स़': 's̤', 'ह़': 'h̤'}[source]
NUMBER_REGEX = '\\d+(?:[\\.:]\\d+)*'[source]
PUNCTUATION: List[str] = ['.', '।', '॥', '?', '!', '॰', ',', ';', ':'][source]
QUESTION_MARKS: List[str] = ['?'][source]
ROMANIZATION_CHARACTER_TRANSLATOR = {2305: 'm̐', 2306: 'n', 2307: 'ḥ', 2308: 'ĕ', 2309: 'a', 2310: 'ā', 2311: 'i', 2312: 'ī', 2313: 'u', 2314: 'ū', 2315: 'r', 2316: 'l', 2318: 'ĕ', 2319: 'e', 2320: 'ai', 2321: 'ô', 2322: 'ŏ', 2323: 'o', 2324: 'au', 2325: 'k', 2326: 'kh', 2327: 'g', 2328: 'gh', 2329: 'ngh', 2330: 'ch', 2331: 'chh', 2332: 'j', 2333: 'jh', 2334: 'ñ', 2335: 'ṭ', 2336: 'ṭh', 2337: 'ḍ', 2338: 'ḍh', 2339: 'ṇ', 2340: 't', 2341: 'th', 2342: 'd', 2343: 'dh', 2344: 'n', 2346: 'p', 2347: 'ph', 2348: 'b', 2349: 'bh', 2350: 'm', 2351: 'y', 2352: 'r', 2354: 'l', 2357: 'v', 2358: 'sh', 2359: 's', 2360: 's', 2361: 'h', 2365: "'", 2366: 'a', 2367: 'i', 2368: 'ī', 2369: 'u', 2370: 'ū', 2371: 'r', 2372: 'r̄', 2373: 'ê', 2374: 'ĕ', 2375: 'e', 2376: 'ai', 2377: 'ô', 2378: 'ŏ', 2379: 'o', 2380: 'au', 2381: '', 2392: 'q', 2393: 'k̲h̲', 2394: 'g̲h̲', 2395: 'z', 2396: 'ṛ', 2397: 'ṛh', 2398: 'f', 2400: 'r̄', 2404: '.', 2405: '.', 2406: '0', 2407: '1', 2408: '2', 2409: '3', 2410: '4', 2411: '5', 2412: '6', 2413: '7', 2414: '8', 2415: '9', 2416: '.', 2418: 'ê'}[source]
ROMANIZATION_MAP = {'ँ': 'm̐', 'ं': 'n', 'ः': 'ḥ', 'ऄ': 'ĕ', 'अ': 'a', 'आ': 'ā', 'इ': 'i', 'ई': 'ī', 'उ': 'u', 'ऊ': 'ū', 'ऋ': 'r', 'ऌ': 'l', 'ऎ': 'ĕ', 'ए': 'e', 'ऐ': 'ai', 'ऑ': 'ô', 'ऒ': 'ŏ', 'ओ': 'o', 'औ': 'au', 'क': 'k', 'ख': 'kh', 'ग': 'g', 'घ': 'gh', 'घ़': 'g̲̲h̲̲', 'ङ': 'ngh', 'च': 'ch', 'छ': 'chh', 'ज': 'j', 'झ': 'jh', 'ञ': 'ñ', 'ट': 'ṭ', 'ट़': 't̤', 'ठ': 'ṭh', 'ड': 'ḍ', 'ढ': 'ḍh', 'ण': 'ṇ', 'त': 't', 'थ': 'th', 'द': 'd', 'ध': 'dh', 'न': 'n', 'प': 'p', 'फ': 'ph', 'ब': 'b', 'भ': 'bh', 'म': 'm', 'य': 'y', 'र': 'r', 'ल': 'l', 'व': 'v', 'श': 'sh', 'ष': 's', 'स': 's', 'स़': 's̤', 'ह': 'h', 'ह़': 'h̤', 'ऽ': "'", 'ा': 'a', 'ि': 'i', 'ी': 'ī', 'ु': 'u', 'ू': 'ū', 'ृ': 'r', 'ॄ': 'r̄', 'ॅ': 'ê', 'ॆ': 'ĕ', 'े': 'e', 'ै': 'ai', 'ॉ': 'ô', 'ॊ': 'ŏ', 'ो': 'o', 'ौ': 'au', '्': '', 'क़': 'q', 'ख़': 'k̲h̲', 'ग़': 'g̲h̲', 'ज़': 'z', 'ड़': 'ṛ', 'ढ़': 'ṛh', 'फ़': 'f', 'ॠ': 'r̄', '।': '.', '॥': '.', '०': '0', '१': '1', '२': '2', '३': '3', '४': '4', '५': '5', '६': '6', '७': '7', '८': '8', '९': '9', '॰': '.', 'ॲ': 'ê'}[source]
ROMANIZATION_MAP_CONSONANTS_ASPIRATE = {'ह': 'h', 'ह़': 'h̤'}[source]
ROMANIZATION_MAP_CONSONANTS_CEREBRALS = {'ट': 'ṭ', 'ट़': 't̤', 'ठ': 'ṭh', 'ड': 'ḍ', 'ढ': 'ḍh', 'ण': 'ṇ', 'ड़': 'ṛ', 'ढ़': 'ṛh'}[source]
ROMANIZATION_MAP_CONSONANTS_DENTALS = {'त': 't', 'थ': 'th', 'द': 'd', 'ध': 'dh', 'न': 'n'}[source]
ROMANIZATION_MAP_CONSONANTS_GUTTURALS = {'क': 'k', 'ख': 'kh', 'ग': 'g', 'घ': 'gh', 'घ़': 'g̲̲h̲̲', 'ङ': 'ngh', 'क़': 'q', 'ख़': 'k̲h̲', 'ग़': 'g̲h̲'}[source]
ROMANIZATION_MAP_CONSONANTS_LABIALS = {'प': 'p', 'फ': 'ph', 'ब': 'b', 'भ': 'bh', 'म': 'm', 'फ़': 'f'}[source]
ROMANIZATION_MAP_CONSONANTS_PALATAS = {'च': 'ch', 'छ': 'chh', 'ज': 'j', 'झ': 'jh', 'ञ': 'ñ', 'ज़': 'z'}[source]
ROMANIZATION_MAP_CONSONANTS_SEMIVOWELS = {'य': 'y', 'र': 'r', 'ल': 'l', 'व': 'v'}[source]
ROMANIZATION_MAP_CONSONANTS_SIBILANTS = {'श': 'sh', 'ष': 's', 'स': 's', 'स़': 's̤'}[source]
ROMANIZATION_MAP_VOWELS_AND_DIPHTHONGS = {'ऄ': 'ĕ', 'अ': 'a', 'आ': 'ā', 'इ': 'i', 'ई': 'ī', 'उ': 'u', 'ऊ': 'ū', 'ऋ': 'r', 'ऌ': 'l', 'ऎ': 'ĕ', 'ए': 'e', 'ऐ': 'ai', 'ऑ': 'ô', 'ऒ': 'ŏ', 'ओ': 'o', 'औ': 'au', 'ा': 'a', 'ि': 'i', 'ी': 'ī', 'ु': 'u', 'ू': 'ū', 'ृ': 'r', 'ॄ': 'r̄', 'ॅ': 'ê', 'ॆ': 'ĕ', 'े': 'e', 'ै': 'ai', 'ॉ': 'ô', 'ॊ': 'ŏ', 'ो': 'o', 'ौ': 'au', 'ॠ': 'r̄', 'ॲ': 'ê'}[source]
SYMBOLS: List[str] = ['.', '।', '॥', '?', '!', '॰', ',', ';', ':', '(', ')', '[', ']', '{', '}', '-', '_', '/'][source]
UNALLOWED_CHARACTERS_REGEX = '[^खड़भऒ8॥O_ॱEॻॢऽऎॎ॔;फदॼॸा:डब॰इ3ऺऻरओऴI9षळग़ढफ़औZठॗीॅऊॉङॿFँ१धॹजॽG०श\\]ंT6़ै॓ऩख़M>2अवॡ५य़<A३ऋूृुUBकॏझॳ\\\nॾ!४DVतेKRYन्ईॆढ़Nॊौ८S\\}\\(ॲ,ए\\{क़छJ0णऍ\\[Cऑऌॣ।ॶॖगH\\?ॵP९लॕऐॠॺ/4आ15\\-६ऀQघ॒ॴसटमॷ7Lःॐञ२य॑चज़\\ ऱ७ोWॄउथपहऄि\\)\\.X]'[source]
UNICODE_RANGE: Tuple[int, int] = (2304, 2431)[source]
WORD_REGEX = '[^\\W_\\d]([^\\W_\\d]|[ऀँंःॄॅ़ािीुूृेैॉोौ्])*'[source]
classmethod allowed_characters() Set[str][source][source]

Returns a set of all allowed characters in the language.

delete_unallowed_characters(text: str) str[source][source]
detokenize(tokens: Iterable[str]) str[source][source]

Joins tokens back into text.

get_tags(tokens: str | Iterable[str]) List[Any][source][source]

Get the classifications of all tokens in the form of a sequence of tags

get_word_senses(tokens: str | Iterable[str]) List[List[str]][source][source]

Get all known meanings of the ambiguous words.

static name() str[source][source]

Returns the name of the language used everywhere else in datasets.

normalize_characters(text: str) str[source][source]
preprocess(text: str) str[source][source]

Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.

romanize(text: str, *args, add_diacritics=True, **kwargs) str[source][source]

Map Hindi characters to phonetically similar characters of the English language. Transliteration is useful for readability.

ALA-LC Romanization Table: https://www.loc.gov/catdir/cpso/romanization/hindi.pdf

Parameters:
  • text (str) – Hindi text to be mapped to Latin script.

  • add_diacritics (bool, optional) – Whether to use diacritics over English characters to help pronunciation. Defaults to True.

Examples:

import sign_language_translator as slt

nlp = slt.languages.text.Hindi()

text = "मैंने किताब खरीदी है।"
romanized_text = nlp.romanize(text)
print(romanized_text)
# 'mainne kitab khrīdī hai.'

text = "ईशांत शर्मा को उनकी शानदार गेंदबाजी के लिए १ प्लेयर ऑफ द मैच का अवॉर्ड दिया गया।"
text = nlp.preprocess(text)
romanized_text = nlp.romanize(text)
print(romanized_text)
# 'īshant shrma ko unkī shandar gendbajī ke lie 1 pleyr ôph d maich ka avôrḍ diya gya.'
sentence_tokenize(text: str) List[str][source][source]

Break text into sentences.

tag(tokens: str | Iterable[str]) List[Tuple[str, Any]][source][source]

Classify the tokens and mark them with appropriate tags.

classmethod token_regex() str[source][source]

Returns a regular expression that matches words in this language.

tokenize(text: str) List[str][source][source]

Break apart text into words or phrases

class sign_language_translator.languages.text.Tags(value)[source][source]

Bases: Enum

Enumeration of token tags used in NLP processing.

ACRONYM = 'ACRONYM'[source]
AMBIGUOUS = 'AMBIGUOUS'[source]
DATE = 'DATE'[source]
DEFAULT = ''[source]
END_OF_SEQUENCE = 'EOS'[source]
NAME = 'NAME'[source]
NUMBER = 'NUMBER'[source]
PUNCTUATION = 'PUNCTUATION'[source]
SPACE = 'SPACE'[source]
START_OF_SEQUENCE = 'SOS'[source]
SUPPORTED_WORD = 'SUPPORTED_WORD'[source]
TIME = 'TIME'[source]
WORD = 'WORD'[source]
WORDLESS = 'WORDLESS'[source]
class sign_language_translator.languages.text.TextLanguage[source][source]

Bases: ABC

Base NLP class for a language.

Subclass it and provide the functionality to tokenize text and classify & disambiguate tokens. Each token should correspond to a sign language clip.

abstract classmethod allowed_characters() Set[str][source][source]

Returns a set of all allowed characters in the language.

abstract detokenize(tokens: Iterable[str]) str[source][source]

Joins tokens back into text.

abstract get_tags(tokens: str | Iterable[str]) List[Any][source][source]

Get the classifications of all tokens in the form of a sequence of tags

abstract get_word_senses(tokens: str | Iterable[str]) List[List[str]][source][source]

Get all known meanings of the ambiguous words.

abstract static name() str[source][source]

Returns the name of the language used everywhere else in datasets.

abstract preprocess(text: str) str[source][source]

Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.

static romanize(text: str, *args, add_diacritics=True, character_translation_table: Dict[int, str] | None = None, n_gram_map: Dict[str, str] | None = None, **kwargs) str[source][source]

Map characters to phonetically similar characters of the English language. Transliteration is useful for readability & simple text-to-speech. First maps (n>1)-grams, then unigrams.

ALA-LC Standardized Romanization Tables (70 languages): https://www.loc.gov/catdir/cpso/roman.html

Parameters:
  • text (str) – Non-English text to be mapped to Latin script.

  • add_diacritics (bool, optional) – Whether to use diacritics over English characters to help pronunciation. (Rules: 1. The under-dot ‘ ̣’ indicates alternate soft/hard pronunciation of the letter. 2. The over-bar/macron ‘ ̄’ means long pronunciation). Defaults to True.

  • character_translation_table (Optional[Dict[int, str]], optional) – A dictionary mapping unicode of single characters to their latin equivalent. Defaults to None.

  • n_gram_map (Optional[Dict[str, str]], optional) – A dictionary mapping bigrams, trigrams or more to their latin equivalent. Keys are expected to be regular expressions. Defaults to None.

abstract sentence_tokenize(text: str) List[str][source][source]

Break text into sentences.

abstract tag(tokens: str | Iterable[str]) List[Tuple[str, Any]][source][source]

Classify the tokens and mark them with appropriate tags.

abstract classmethod token_regex() str[source][source]

Returns a regular expression that matches words in this language.

abstract tokenize(text: str) List[str][source][source]

Break apart text into words or phrases

class sign_language_translator.languages.text.Urdu[source][source]

Bases: TextLanguage

NLP class for Urdu text. Extends slt.languages.text.TextLanguage class.

Urdu is an Indo-Aryan language spoken mostly in Pakistan. Urdu uses the Perso-Arabic script, which consists of 46 Alphabets, 10 Digits, 6 Punctuations & 6 Diacritics, and is written from right to left. See unicode details at: https://unicode.org/charts/PDF/U0600.pdf

ALLOWED_CHARACTERS = {'\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '<', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '{', '}', '،', 'ؐ', 'ؑ', 'ؒ', 'ؓ', '؛', '؟', 'ء', 'آ', 'أ', 'ؤ', 'ئ', 'ا', 'ب', 'ت', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ل', 'م', 'ن', 'و', 'ً', 'ٍ', 'َ', 'ُ', 'ِ', 'ّ', '٫', 'ٰ', 'ٹ', 'پ', 'چ', 'ڈ', 'ڑ', 'ژ', 'ک', 'گ', 'ں', 'ھ', 'ہ', 'ۂ', 'ۃ', 'ی', 'ے', 'ۓ', '۔', '۰', '۱', '۲', '۳', '۴', '۵', '۶', '۷', '۸', '۹', '‘', '’', '“', '”'}[source]
BRACKETS: List[str] = ['(', ')'][source]
CHARACTER_TO_WORD = {'–': '-', '—': '-', '−': '-', '⋯': '...', 'ﷲ': 'اللہ', 'ﷺ': ' صلی اللہ علیہ وسلم', '﷽': 'بسم اللہ الرحمن الرحیم'}[source]
CHARACTER_TRANSLATOR = {1577: 'ۃ', 1603: 'ک', 1607: 'ہ', 1609: 'ی', 1610: 'ی', 1632: '۰', 1633: '۱', 1634: '۲', 1635: '۳', 1636: '۴', 1637: '۵', 1638: '۶', 1639: '۷', 1640: '۸', 1641: '۹', 8211: '-', 8212: '-', 8722: '-', 8943: '...', 64342: 'پ', 64344: 'پ', 64345: 'پ', 64358: 'ٹ', 64359: 'ٹ', 64360: 'ٹ', 64361: 'ٹ', 64378: 'چ', 64379: 'چ', 64380: 'چ', 64381: 'چ', 64392: 'ڈ', 64393: 'ڈ', 64395: 'ژ', 64396: 'ڑ', 64397: 'ڑ', 64398: 'ک', 64399: 'ک', 64400: 'ک', 64401: 'ک', 64402: 'گ', 64403: 'گ', 64404: 'گ', 64405: 'گ', 64414: 'ں', 64415: 'ں', 64422: 'ہ', 64423: 'ہ', 64424: 'ہ', 64425: 'ہ', 64426: 'ھ', 64427: 'ھ', 64428: 'ھ', 64429: 'ھ', 64430: 'ے', 64431: 'ے', 64508: 'ی', 64509: 'ی', 64510: 'ی', 64511: 'ی', 65010: 'اللہ', 65018: ' صلی اللہ علیہ وسلم', 65021: 'بسم اللہ الرحمن الرحیم', 65152: 'ء', 65153: 'آ', 65154: 'آ', 65155: 'أ', 65157: 'ؤ', 65163: 'ئ', 65164: 'ئ', 65165: 'ا', 65166: 'ا', 65167: 'ب', 65168: 'ب', 65169: 'ب', 65170: 'ب', 65173: 'ت', 65174: 'ت', 65175: 'ت', 65176: 'ت', 65178: 'ث', 65179: 'ث', 65180: 'ث', 65181: 'ج', 65182: 'ج', 65183: 'ج', 65184: 'ج', 65185: 'ح', 65186: 'ح', 65187: 'ح', 65188: 'ح', 65190: 'خ', 65191: 'خ', 65192: 'خ', 65193: 'د', 65194: 'د', 65195: 'ذ', 65196: 'ذ', 65197: 'ر', 65198: 'ر', 65199: 'ز', 65200: 'ز', 65201: 'س', 65202: 'س', 65203: 'س', 65204: 'س', 65205: 'ش', 65206: 'ش', 65207: 'ش', 65208: 'ش', 65209: 'ص', 65210: 'ص', 65211: 'ص', 65212: 'ص', 65213: 'ض', 65214: 'ض', 65215: 'ض', 65216: 'ض', 65219: 'ط', 65220: 'ط', 65221: 'ظ', 65223: 'ظ', 65224: 'ظ', 65225: 'ع', 65226: 'ع', 65227: 'ع', 65228: 'ع', 65229: 'غ', 65231: 'غ', 65232: 'غ', 65233: 'ف', 65234: 'ف', 65235: 'ف', 65236: 'ف', 65237: 'ق', 65238: 'ق', 65239: 'ق', 65240: 'ق', 65243: 'ک', 65245: 'ل', 65246: 'ل', 65247: 'ل', 65248: 'ل', 65249: 'م', 65250: 'م', 65251: 'م', 65252: 'م', 65253: 'ن', 65254: 'ن', 65255: 'ن', 65256: 'ن', 65257: 'ہ', 65258: 'ہ', 65259: 'ھ', 65260: 'ھ', 65261: 'و', 65262: 'و', 65264: 'ی', 65265: 'ی', 65266: 'ی', 65267: 'ے', 65268: 'ے', 65275: 'لا', 65276: 'لا'}[source]
COMBINE_CHARACTERS_REGEX = 'آ|أ|ؤ|ۂ|یٔ|ۓ|ََ|ِِ'[source]
CORRECT_URDU_CHARACTERS_TO_INCORRECT: Dict[str, List[str]] = {'،': [], '؟': [], 'ء': ['ﺀ'], 'آ': ['ﺁ', 'ﺂ'], 'أ': ['ﺃ'], 'ؤ': ['ﺅ'], 'ئ': ['ﺋ', 'ﺌ'], 'ا': ['ﺍ', 'ﺎ'], 'ب': ['ﺏ', 'ﺐ', 'ﺑ', 'ﺒ'], 'ت': ['ﺕ', 'ﺖ', 'ﺗ', 'ﺘ'], 'ث': ['ﺛ', 'ﺜ', 'ﺚ'], 'ج': ['ﺝ', 'ﺞ', 'ﺟ', 'ﺠ'], 'ح': ['ﺡ', 'ﺣ', 'ﺤ', 'ﺢ'], 'خ': ['ﺧ', 'ﺨ', 'ﺦ'], 'د': ['ﺩ', 'ﺪ'], 'ذ': ['ﺬ', 'ﺫ'], 'ر': ['ﺭ', 'ﺮ'], 'ز': ['ﺯ', 'ﺰ'], 'س': ['ﺱ', 'ﺲ', 'ﺳ', 'ﺴ'], 'ش': ['ﺵ', 'ﺶ', 'ﺷ', 'ﺸ'], 'ص': ['ﺹ', 'ﺺ', 'ﺻ', 'ﺼ'], 'ض': ['ﺽ', 'ﺾ', 'ﺿ', 'ﻀ'], 'ط': ['ﻃ', 'ﻄ'], 'ظ': ['ﻅ', 'ﻇ', 'ﻈ'], 'ع': ['ﻉ', 'ﻊ', 'ﻋ', 'ﻌ'], 'غ': ['ﻍ', 'ﻏ', 'ﻐ'], 'ف': ['ﻑ', 'ﻒ', 'ﻓ', 'ﻔ'], 'ق': ['ﻕ', 'ﻖ', 'ﻗ', 'ﻘ'], 'ل': ['ﻝ', 'ﻞ', 'ﻟ', 'ﻠ'], 'لا': ['ﻻ', 'ﻼ'], 'م': ['ﻡ', 'ﻢ', 'ﻣ', 'ﻤ'], 'ن': ['ﻥ', 'ﻦ', 'ﻧ', 'ﻨ'], 'و': ['ﻮ', 'ﻭ', 'ﻮ'], '٫': [], 'ٹ': ['ﭦ', 'ﭧ', 'ﭨ', 'ﭩ'], 'پ': ['ﭖ', 'ﭘ', 'ﭙ'], 'چ': ['ﭺ', 'ﭻ', 'ﭼ', 'ﭽ'], 'ڈ': ['ﮈ', 'ﮉ'], 'ڑ': ['ﮍ', 'ﮌ'], 'ژ': ['ﮋ'], 'ک': ['ﮎ', 'ﮏ', 'ﮐ', 'ﮑ', 'ﻛ', 'ك'], 'گ': ['ﮒ', 'ﮓ', 'ﮔ', 'ﮕ'], 'ں': ['ﮞ', 'ﮟ'], 'ھ': ['ﮪ', 'ﮬ', 'ﮭ', 'ﻬ', 'ﻫ', 'ﮫ'], 'ہ': ['ﻩ', 'ﮦ', 'ﻪ', 'ﮧ', 'ﮩ', 'ﮨ', 'ه'], 'ۂ': [], 'ۃ': ['ة'], 'ی': ['ﯼ', 'ى', 'ﯽ', 'ﻰ', 'ﻱ', 'ﻲ', 'ﯾ', 'ﯿ', 'ي'], 'ے': ['ﮮ', 'ﮯ', 'ﻳ', 'ﻴ'], 'ۓ': [], '۔': [], '۰': ['٠'], '۱': ['١'], '۲': ['٢'], '۳': ['٣'], '۴': ['٤'], '۵': ['٥'], '۶': ['٦'], '۷': ['٧'], '۸': ['٨'], '۹': ['٩']}[source]
DIACRITICS = ['ٍ', 'ً', 'ٰ', 'َ', 'ُ', 'ِ', 'ّ'][source]
DIACRITICS_REGEX = 'ٍ|ً|ٰ|َ|ُ|ِ|ّ'[source]
END_OF_SENTENCE_MARKS: List[str] = ['.', '۔', '?', '؟', '!'][source]
FULL_STOPS: List[str] = ['.', '۔'][source]
HONORIFICS = ['ؐ', 'ؑ', 'ؒ', 'ؓ'][source]
NGRAM_ROMANIZATION_MAP = {'(?<=[ا])و(?![ں])': 'v', '(?<=[ُ])و(?![ا])': '', '(?<=\\d)\\s*ء': 'CE', '(?<=\\w)آ': "'ā", '(?<=ا)ی': 'y', '(?<=ل)ئ(?=ے)': 'ie', '(?<=ہ)ی(?!\\b)': 'e', '\\bع(?=ی)': 'ei', '\\bو': 'v', '\\bی': 'y', 'اً': 'an', 'اَ': 'a', 'اُ': 'u', 'اِ': 'i', 'و(?=[ؤ])': 'u', 'و(?=[اَےی])': 'v', 'ًا': 'an', 'ِ(?!\\w)': '-e', 'ی(?=[وای])': 'y', 'ی(?=ں)': 'ei', 'یٰ': 'a'}[source]
NUMBER_REGEX = '\\d+(?:[٫\\.:]\\d+)*'[source]
PUNCTUATION: List[str] = ['.', '۔', '?', '؟', '!', ',', '،', '؛'][source]
PUNCTUATION_REGEX = '[\\.۔\\?؟!,،؛]'[source]
QUESTION_MARKS: List[str] = ['?', '؟'][source]
QUOTATION_MARKS = ["'", '"', '”', '“', '’', '‘'][source]
ROMANIZATION_CHARACTER_TRANSLATOR = {1548: ',', 1552: '(PBUH)', 1553: '(AS)', 1554: '(RH)', 1555: '(RA)', 1563: ';', 1567: '?', 1569: "'", 1570: 'aa', 1571: 'a', 1572: 'ow', 1574: 'e', 1575: 'a', 1576: 'b', 1578: 't', 1579: 's', 1580: 'j', 1581: 'h', 1582: 'k̲h̲', 1583: 'd', 1584: 'z', 1585: 'r', 1586: 'z', 1587: 's', 1588: 'sh', 1589: 's', 1590: 'z', 1591: 't', 1592: 'z', 1593: "a'", 1594: 'g̲h̲', 1601: 'f', 1602: 'q', 1604: 'l', 1605: 'm', 1606: 'n', 1608: 'o', 1611: 'an', 1613: 'in', 1614: 'a', 1615: 'u', 1616: 'i', 1643: '.', 1648: 'a', 1657: 'ṭ', 1662: 'p', 1670: 'ch', 1672: 'ḍ', 1681: 'ṛ', 1688: 'zh', 1705: 'k', 1711: 'g', 1722: 'n̲', 1726: 'h', 1729: 'h', 1730: 'h-e', 1731: 't', 1740: 'i', 1746: 'y', 1747: 'ey', 1748: '.', 1776: '0', 1777: '1', 1778: '2', 1779: '3', 1780: '4', 1781: '5', 1782: '6', 1783: '7', 1784: '8', 1785: '9'}[source]
ROMANIZATION_MAP = {'،': ',', 'ؐ': '(PBUH)', 'ؑ': '(AS)', 'ؒ': '(RH)', 'ؓ': '(RA)', '؛': ';', '؟': '?', 'ء': "'", 'آ': 'aa', 'أ': 'a', 'ؤ': 'ow', 'ئ': 'e', 'ا': 'a', 'ب': 'b', 'ت': 't', 'ث': 's', 'ج': 'j', 'ح': 'h', 'خ': 'k̲h̲', 'د': 'd', 'ذ': 'z', 'ر': 'r', 'ز': 'z', 'س': 's', 'ش': 'sh', 'ص': 's', 'ض': 'z', 'ط': 't', 'ظ': 'z', 'ع': "a'", 'غ': 'g̲h̲', 'ف': 'f', 'ق': 'q', 'ل': 'l', 'م': 'm', 'ن': 'n', 'و': 'o', 'ً': 'an', 'ٍ': 'in', 'َ': 'a', 'ُ': 'u', 'ِ': 'i', '٫': '.', 'ٰ': 'a', 'ٹ': 'ṭ', 'پ': 'p', 'چ': 'ch', 'ڈ': 'ḍ', 'ڑ': 'ṛ', 'ژ': 'zh', 'ک': 'k', 'گ': 'g', 'ں': 'n̲', 'ھ': 'h', 'ہ': 'h', 'ۂ': 'h-e', 'ۃ': 't', 'ی': 'i', 'ے': 'y', 'ۓ': 'ey', '۔': '.', '۰': '0', '۱': '1', '۲': '2', '۳': '3', '۴': '4', '۵': '5', '۶': '6', '۷': '7', '۸': '8', '۹': '9'}[source]
SPLIT_TO_COMBINED_CHARACTERS: Dict[str, str] = {'آ': 'آ', 'أ': 'أ', 'ؤ': 'ؤ', 'ََ': 'ً', 'ِِ': 'ٍ', 'ۂ': 'ۂ', 'یٔ': 'ئ', 'ۓ': 'ۓ'}[source]
SYMBOLS: List[str] = ['.', '۔', '?', '؟', '!', ',', '،', '؛', "'", '"', '”', '“', '’', '‘', '(', ')', ' ', '-'][source]
UNALLOWED_CHARACTERS_REGEX = '[^ۓگ\\\nؐ۶84!DOخV؟EاۂKؤRYNزأ۳۱ٍS\\}\\(,\\{ٰژ۹ؑJ0قC\\[دؓہےH؛ڑ’3م\\?نPئُحصجںIشعUل\\-51۷ZطQسؒ۸ظ/ّF‘و۔7۰L9َGِ\\]بغآT6ً۲،”ف“پذتھی\'Mڈ>ٹثۃ\\ W2۵Bء<ضAر"چ٫\\)\\.ک۴X]'[source]
UNICODE_RANGE: Tuple[int, int] = (1536, 1791)[source]
WORD_REGEX = '[\\wًٍَُِّٰ]+'[source]
classmethod allowed_characters() Set[str][source][source]

Returns a set of all allowed characters in the language.

static character_normalize(text: str) str[source][source]

Replace characters that are rendered the same as Urdu characters in common fonts but actually belong to foreign unicode character ranges by Urdu characters.

Parameters:

text (str) – a piece of urdu text that may contain foreign symbols

Returns:

normalized urdu text

Return type:

str

delete_unallowed_characters(text: str) str[source][source]
detokenize(tokens: Iterable[str]) str[source][source]

Joins tokens back into text.

get_tags(tokens: str | Iterable[str]) List[Any][source][source]

Get the classifications of all tokens in the form of a sequence of tags

get_word_senses(tokens: str | Iterable[str]) List[List[str]][source][source]

Get all known meanings of the ambiguous words.

static name() str[source][source]

Returns the name of the language used everywhere else in datasets.

static passage_preprocessor(text: str) str[source][source]
static poetry_preprocessor(text: str) str[source][source]
preprocess(text: str) str[source][source]

Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.

static remove_diacritics(text: str) str[source][source]
romanize(text: str, *args, add_diacritics=True, **kwargs) str[source][source]

Map Urdu characters to phonetically similar characters of the English language. Transliteration is useful for readability.

ALA-LC Romanization Table: https://www.loc.gov/catdir/cpso/romanization/urdu.pdf

Parameters:
  • text (str) – Urdu text to be mapped to Latin script.

  • add_diacritics (bool, optional) – Whether to use diacritics over English characters to ease pronunciation. (Rules: 1. The under-dot ‘ ̣’ indicates alternate soft/hard pronunciation of the letter. 2. The over-bar/macron ‘ ̄’ means long pronunciation. 3. The consecutive underline ‘ ̲ ̲’ means the characters come from a single source letter). Defaults to True.

Examples:

import sign_language_translator as slt

nlp = slt.languages.text.Urdu()

text = "میں نے ۴۷ کتابیں خریدی ہیں۔"
romanized_text = nlp.romanize(text)
print(romanized_text)
# 'mein̲ ny 47 ktabein̲ k̲h̲ridi hen̲.'

text = "مکّهی کا زکریّاؒ کی قابلِ تعریف قوّت سے منہ کهٹّا ہو گیا ہے۔۔۔"
text = nlp.preprocess(text)
romanized_text = nlp.romanize(text, add_diacritics=False)
print(romanized_text)
# "mkkhi ka zkryya(RH) ki qabl-e ta'rif qoot sy mnh khtta ho gya hy..."
sentence_tokenize(text: str) List[str][source][source]

Break text into sentences.

tag(tokens: str | Iterable[str]) List[Any][source][source]

Classify the tokens and mark them with appropriate tags.

classmethod token_regex() str[source][source]

Returns a regular expression that matches words in this language.

tokenize(text: str) List[str][source][source]

Break apart text into words or phrases

static wikipedia_preprocessor(text: str) str[source][source]