sign_language_translator.languages.text package
Submodules
- sign_language_translator.languages.text.english module
EnglishEnglish.ALLOWED_CHARACTERSEnglish.ALPHABETEnglish.BRACKETSEnglish.CHARACTER_MAPEnglish.CHARACTER_TRANSLATOREnglish.END_OF_SENTENCE_MARKSEnglish.FULL_STOPSEnglish.NUMBER_REGEXEnglish.PUNCTUATIONEnglish.QUESTION_MARKSEnglish.QUOTESEnglish.SYMBOLSEnglish.UNALLOWED_CHARACTERS_REGEXEnglish.UNICODE_RANGEEnglish.WORD_REGEXEnglish.allowed_characters()English.delete_unallowed_characters()English.detokenize()English.get_tags()English.get_word_senses()English.name()English.preprocess()English.romanize()English.sentence_tokenize()English.tag()English.token_regex()English.tokenize()
- sign_language_translator.languages.text.hindi module
HindiHindi.ACRONYM_PERIODSHindi.ALLOWED_CHARACTERSHindi.BRACKETSHindi.CHARACTERSHindi.CHARACTER_TO_DECOMPOSEDHindi.CHARACTER_TRANSLATORHindi.DIACRITICSHindi.END_OF_SENTENCE_MARKSHindi.FULL_STOPSHindi.NGRAM_ROMANIZATION_MAPHindi.NUMBER_REGEXHindi.PUNCTUATIONHindi.QUESTION_MARKSHindi.ROMANIZATION_CHARACTER_TRANSLATORHindi.ROMANIZATION_MAPHindi.ROMANIZATION_MAP_CONSONANTS_ASPIRATEHindi.ROMANIZATION_MAP_CONSONANTS_CEREBRALSHindi.ROMANIZATION_MAP_CONSONANTS_DENTALSHindi.ROMANIZATION_MAP_CONSONANTS_GUTTURALSHindi.ROMANIZATION_MAP_CONSONANTS_LABIALSHindi.ROMANIZATION_MAP_CONSONANTS_PALATASHindi.ROMANIZATION_MAP_CONSONANTS_SEMIVOWELSHindi.ROMANIZATION_MAP_CONSONANTS_SIBILANTSHindi.ROMANIZATION_MAP_VOWELS_AND_DIPHTHONGSHindi.SYMBOLSHindi.UNALLOWED_CHARACTERS_REGEXHindi.UNICODE_RANGEHindi.WORD_REGEXHindi.allowed_characters()Hindi.delete_unallowed_characters()Hindi.detokenize()Hindi.get_tags()Hindi.get_word_senses()Hindi.name()Hindi.normalize_characters()Hindi.preprocess()Hindi.romanize()Hindi.sentence_tokenize()Hindi.tag()Hindi.token_regex()Hindi.tokenize()
- sign_language_translator.languages.text.text_language module
- sign_language_translator.languages.text.urdu module
UrduUrdu.ALLOWED_CHARACTERSUrdu.BRACKETSUrdu.CHARACTER_TO_WORDUrdu.CHARACTER_TRANSLATORUrdu.COMBINE_CHARACTERS_REGEXUrdu.CORRECT_URDU_CHARACTERS_TO_INCORRECTUrdu.DIACRITICSUrdu.DIACRITICS_REGEXUrdu.END_OF_SENTENCE_MARKSUrdu.FULL_STOPSUrdu.HONORIFICSUrdu.NGRAM_ROMANIZATION_MAPUrdu.NUMBER_REGEXUrdu.PUNCTUATIONUrdu.PUNCTUATION_REGEXUrdu.QUESTION_MARKSUrdu.QUOTATION_MARKSUrdu.ROMANIZATION_CHARACTER_TRANSLATORUrdu.ROMANIZATION_MAPUrdu.SPLIT_TO_COMBINED_CHARACTERSUrdu.SYMBOLSUrdu.UNALLOWED_CHARACTERS_REGEXUrdu.UNICODE_RANGEUrdu.WORD_REGEXUrdu.allowed_characters()Urdu.character_normalize()Urdu.delete_unallowed_characters()Urdu.detokenize()Urdu.get_tags()Urdu.get_word_senses()Urdu.name()Urdu.passage_preprocessor()Urdu.poetry_preprocessor()Urdu.preprocess()Urdu.remove_diacritics()Urdu.romanize()Urdu.sentence_tokenize()Urdu.tag()Urdu.token_regex()Urdu.tokenize()Urdu.wikipedia_preprocessor()
Module contents
Module that contains Text Language Processors as classes to clean up, tokenize and tag texts of various languages.
- class sign_language_translator.languages.text.English[source]
Bases:
TextLanguageNLP class for English text. Extends slt.languages.text.TextLanguage class.
English is originally a West Germanic language and potentially an international language in the 21st century. English uses the Latin script, which consists of 26 letters and is written from left to right. There are two variants of these letters: uppercase (capital letters) and lowercase. See unicode details at: https://unicode.org/charts/PDF/U0000.pdf
- ALLOWED_CHARACTERS = {'\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}'}
- ALPHABET: List[str] = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
- BRACKETS: List[str] = ['(', ')', '[', ']', '{', '}']
- CHARACTER_MAP: Dict[str, str] = {'–': '-', '—': '-', '‘': "'", '’': "'", '“': '"', '”': '"', '…': '...'}
- CHARACTER_TRANSLATOR = {8211: '-', 8212: '-', 8216: "'", 8217: "'", 8220: '"', 8221: '"', 8230: '...'}
- END_OF_SENTENCE_MARKS: List[str] = ['.', '?', '!']
- FULL_STOPS: List[str] = ['.']
- NUMBER_REGEX = '\\d+(?:[\\.:]\\d+)*'
- PUNCTUATION: List[str] = ['.', '?', '!', ',', ';', ':']
- QUESTION_MARKS: List[str] = ['?']
- QUOTES: List[str] = ['"', "'"]
- SYMBOLS: List[str] = ['.', '?', '!', ',', ';', ':', '(', ')', '[', ']', '{', '}', '"', "'", '@', '#', '$', '%', '&', '*', '+', '<', '>', '=', '^', '|', '/', '-', '_']
- UNALLOWED_CHARACTERS_REGEX = '[^tr\\\ni8!DOVchq\\+_E\\^KRYyNbS\\}\\(;\\|,\\{zJuf0:C\\[@Ho3e\\?P\\$skmId491/5\\-\\*ZQjw\\&F\\#7LvG\\]gnT6=p\'M\\)>\\ Wa2<l%Ax"U\\.BX]'
- UNICODE_RANGE = (32, 126)
- WORD_REGEX = '[^\\W_\\d]+'
- classmethod allowed_characters() Set[str][source]
Returns a set of all allowed characters in the language.
- get_tags(tokens: str | Iterable[str]) List[Any][source]
Get the classifications of all tokens in the form of a sequence of tags
- get_word_senses(tokens: str | Iterable[str]) List[List[str]][source]
Get all known meanings of the ambiguous words.
- preprocess(text: str) str[source]
Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.
- romanize(text: str, *args, add_diacritics=True, **kwargs) str[source]
Map characters to phonetically similar characters of the English language. Transliteration is useful for readability & simple text-to-speech. First maps (n>1)-grams, then unigrams.
ALA-LC Standardized Romanization Tables (70 languages): https://www.loc.gov/catdir/cpso/roman.html
- Parameters:
text (str) – Non-English text to be mapped to Latin script.
add_diacritics (bool, optional) – Whether to use diacritics over English characters to help pronunciation. (Rules: 1. The under-dot ‘ ̣’ indicates alternate soft/hard pronunciation of the letter. 2. The over-bar/macron ‘ ̄’ means long pronunciation). Defaults to True.
character_translation_table (Optional[Dict[int, str]], optional) – A dictionary mapping unicode of single characters to their latin equivalent. Defaults to None.
n_gram_map (Optional[Dict[str, str]], optional) – A dictionary mapping bigrams, trigrams or more to their latin equivalent. Keys are expected to be regular expressions. Defaults to None.
- tag(tokens: str | Iterable[str]) List[Tuple[str, Any]][source]
Classify the tokens and mark them with appropriate tags.
- class sign_language_translator.languages.text.Hindi[source]
Bases:
TextLanguageNLP class for Hindi text. Extends slt.languages.text.TextLanguage class.
Hindi is an Indo-Aryan language spoken mostly in India. Hindi uses the Devanagari script, which consists of 11 vowels and 33 consonants and is written from left to right. See unicode details at: https://unicode.org/charts/PDF/U0900.pdf
- ACRONYM_PERIODS: List[str] = ['॰']
- ALLOWED_CHARACTERS: Set[str] = {'\n', ' ', '!', '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', '{', '}', 'ऀ', 'ँ', 'ं', 'ः', 'ऄ', 'अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ऌ', 'ऍ', 'ऎ', 'ए', 'ऐ', 'ऑ', 'ऒ', 'ओ', 'औ', 'क', 'ख', 'ग', 'घ', 'ङ', 'च', 'छ', 'ज', 'झ', 'ञ', 'ट', 'ठ', 'ड', 'ढ', 'ण', 'त', 'थ', 'द', 'ध', 'न', 'ऩ', 'प', 'फ', 'ब', 'भ', 'म', 'य', 'र', 'ऱ', 'ल', 'ळ', 'ऴ', 'व', 'श', 'ष', 'स', 'ह', 'ऺ', 'ऻ', '़', 'ऽ', 'ा', 'ि', 'ी', 'ु', 'ू', 'ृ', 'ॄ', 'ॅ', 'ॆ', 'े', 'ै', 'ॉ', 'ॊ', 'ो', 'ौ', '्', 'ॎ', 'ॏ', 'ॐ', '॑', '॒', '॓', '॔', 'ॕ', 'ॖ', 'ॗ', 'क़', 'ख़', 'ग़', 'ज़', 'ड़', 'ढ़', 'फ़', 'य़', 'ॠ', 'ॡ', 'ॢ', 'ॣ', '।', '॥', '०', '१', '२', '३', '४', '५', '६', '७', '८', '९', '॰', 'ॱ', 'ॲ', 'ॳ', 'ॴ', 'ॵ', 'ॶ', 'ॷ', 'ॸ', 'ॹ', 'ॺ', 'ॻ', 'ॼ', 'ॽ', 'ॾ', 'ॿ'}
- BRACKETS: List[str] = ['(', ')', '[', ']', '{', '}']
- CHARACTERS: List[str] = ['ऀ', 'ँ', 'ं', 'ः', 'ऄ', 'अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ऌ', 'ऍ', 'ऎ', 'ए', 'ऐ', 'ऑ', 'ऒ', 'ओ', 'औ', 'क', 'ख', 'ग', 'घ', 'ङ', 'च', 'छ', 'ज', 'झ', 'ञ', 'ट', 'ठ', 'ड', 'ढ', 'ण', 'त', 'थ', 'द', 'ध', 'न', 'ऩ', 'प', 'फ', 'ब', 'भ', 'म', 'य', 'र', 'ऱ', 'ल', 'ळ', 'ऴ', 'व', 'श', 'ष', 'स', 'ह', 'ऺ', 'ऻ', '़', 'ऽ', 'ा', 'ि', 'ी', 'ु', 'ू', 'ृ', 'ॄ', 'ॅ', 'ॆ', 'े', 'ै', 'ॉ', 'ॊ', 'ो', 'ौ', '्', 'ॎ', 'ॏ', 'ॐ', '॑', '॒', '॓', '॔', 'ॕ', 'ॖ', 'ॗ', 'क़', 'ख़', 'ग़', 'ज़', 'ड़', 'ढ़', 'फ़', 'य़', 'ॠ', 'ॡ', 'ॢ', 'ॣ', '।', '॥', '०', '१', '२', '३', '४', '५', '६', '७', '८', '९', '॰', 'ॱ', 'ॲ', 'ॳ', 'ॴ', 'ॵ', 'ॶ', 'ॷ', 'ॸ', 'ॹ', 'ॺ', 'ॻ', 'ॼ', 'ॽ', 'ॾ', 'ॿ']
- CHARACTER_TO_DECOMPOSED: Dict[str, str] = {'क़': 'क़', 'ख़': 'ख़', 'ग़': 'ग़', 'ज़': 'ज़', 'ड़': 'ड़', 'ढ़': 'ढ़', 'फ़': 'फ़', 'य़': 'य़'}
- CHARACTER_TRANSLATOR = {2392: 'क़', 2393: 'ख़', 2394: 'ग़', 2395: 'ज़', 2396: 'ड़', 2397: 'ढ़', 2398: 'फ़', 2399: 'य़'}
- DIACRITICS = ['ऀ', 'ँ', 'ं', 'ः', 'ॄ', 'ॅ', '़', 'ा', 'ि', 'ी', 'ु', 'ू', 'ृ', 'े', 'ै', 'ॉ', 'ो', 'ौ', '्']
- NGRAM_ROMANIZATION_MAP = {'(?<=.क|.ख|.ग|.घ|घ़|.ङ|.च|.छ|.ज|.झ|.ञ|.ट|ट़|.ठ|.ड|.ढ|.ण|.त|.थ|.द|.ध|.न|.क़|.ख़|.ग़|.ज़|.ड़|.ढ़)ँ': 'n', '(?<=प|फ|फ़|ब|भ|म)ं': 'm', 'घ़': 'g̲̲h̲̲', 'ट़': 't̤', 'स़': 's̤', 'ह़': 'h̤'}[source]
- ROMANIZATION_CHARACTER_TRANSLATOR = {2305: 'm̐', 2306: 'n', 2307: 'ḥ', 2308: 'ĕ', 2309: 'a', 2310: 'ā', 2311: 'i', 2312: 'ī', 2313: 'u', 2314: 'ū', 2315: 'r', 2316: 'l', 2318: 'ĕ', 2319: 'e', 2320: 'ai', 2321: 'ô', 2322: 'ŏ', 2323: 'o', 2324: 'au', 2325: 'k', 2326: 'kh', 2327: 'g', 2328: 'gh', 2329: 'ngh', 2330: 'ch', 2331: 'chh', 2332: 'j', 2333: 'jh', 2334: 'ñ', 2335: 'ṭ', 2336: 'ṭh', 2337: 'ḍ', 2338: 'ḍh', 2339: 'ṇ', 2340: 't', 2341: 'th', 2342: 'd', 2343: 'dh', 2344: 'n', 2346: 'p', 2347: 'ph', 2348: 'b', 2349: 'bh', 2350: 'm', 2351: 'y', 2352: 'r', 2354: 'l', 2357: 'v', 2358: 'sh', 2359: 's', 2360: 's', 2361: 'h', 2365: "'", 2366: 'a', 2367: 'i', 2368: 'ī', 2369: 'u', 2370: 'ū', 2371: 'r', 2372: 'r̄', 2373: 'ê', 2374: 'ĕ', 2375: 'e', 2376: 'ai', 2377: 'ô', 2378: 'ŏ', 2379: 'o', 2380: 'au', 2381: '', 2392: 'q', 2393: 'k̲h̲', 2394: 'g̲h̲', 2395: 'z', 2396: 'ṛ', 2397: 'ṛh', 2398: 'f', 2400: 'r̄', 2404: '.', 2405: '.', 2406: '0', 2407: '1', 2408: '2', 2409: '3', 2410: '4', 2411: '5', 2412: '6', 2413: '7', 2414: '8', 2415: '9', 2416: '.', 2418: 'ê'}[source]
- ROMANIZATION_MAP = {'ँ': 'm̐', 'ं': 'n', 'ः': 'ḥ', 'ऄ': 'ĕ', 'अ': 'a', 'आ': 'ā', 'इ': 'i', 'ई': 'ī', 'उ': 'u', 'ऊ': 'ū', 'ऋ': 'r', 'ऌ': 'l', 'ऎ': 'ĕ', 'ए': 'e', 'ऐ': 'ai', 'ऑ': 'ô', 'ऒ': 'ŏ', 'ओ': 'o', 'औ': 'au', 'क': 'k', 'ख': 'kh', 'ग': 'g', 'घ': 'gh', 'घ़': 'g̲̲h̲̲', 'ङ': 'ngh', 'च': 'ch', 'छ': 'chh', 'ज': 'j', 'झ': 'jh', 'ञ': 'ñ', 'ट': 'ṭ', 'ट़': 't̤', 'ठ': 'ṭh', 'ड': 'ḍ', 'ढ': 'ḍh', 'ण': 'ṇ', 'त': 't', 'थ': 'th', 'द': 'd', 'ध': 'dh', 'न': 'n', 'प': 'p', 'फ': 'ph', 'ब': 'b', 'भ': 'bh', 'म': 'm', 'य': 'y', 'र': 'r', 'ल': 'l', 'व': 'v', 'श': 'sh', 'ष': 's', 'स': 's', 'स़': 's̤', 'ह': 'h', 'ह़': 'h̤', 'ऽ': "'", 'ा': 'a', 'ि': 'i', 'ी': 'ī', 'ु': 'u', 'ू': 'ū', 'ृ': 'r', 'ॄ': 'r̄', 'ॅ': 'ê', 'ॆ': 'ĕ', 'े': 'e', 'ै': 'ai', 'ॉ': 'ô', 'ॊ': 'ŏ', 'ो': 'o', 'ौ': 'au', '्': '', 'क़': 'q', 'ख़': 'k̲h̲', 'ग़': 'g̲h̲', 'ज़': 'z', 'ड़': 'ṛ', 'ढ़': 'ṛh', 'फ़': 'f', 'ॠ': 'r̄', '।': '.', '॥': '.', '०': '0', '१': '1', '२': '2', '३': '3', '४': '4', '५': '5', '६': '6', '७': '7', '८': '8', '९': '9', '॰': '.', 'ॲ': 'ê'}[source]
- ROMANIZATION_MAP_CONSONANTS_CEREBRALS = {'ट': 'ṭ', 'ट़': 't̤', 'ठ': 'ṭh', 'ड': 'ḍ', 'ढ': 'ḍh', 'ण': 'ṇ', 'ड़': 'ṛ', 'ढ़': 'ṛh'}[source]
- ROMANIZATION_MAP_CONSONANTS_GUTTURALS = {'क': 'k', 'ख': 'kh', 'ग': 'g', 'घ': 'gh', 'घ़': 'g̲̲h̲̲', 'ङ': 'ngh', 'क़': 'q', 'ख़': 'k̲h̲', 'ग़': 'g̲h̲'}[source]
- ROMANIZATION_MAP_CONSONANTS_LABIALS = {'प': 'p', 'फ': 'ph', 'ब': 'b', 'भ': 'bh', 'म': 'm', 'फ़': 'f'}[source]
- ROMANIZATION_MAP_CONSONANTS_PALATAS = {'च': 'ch', 'छ': 'chh', 'ज': 'j', 'झ': 'jh', 'ञ': 'ñ', 'ज़': 'z'}[source]
- ROMANIZATION_MAP_VOWELS_AND_DIPHTHONGS = {'ऄ': 'ĕ', 'अ': 'a', 'आ': 'ā', 'इ': 'i', 'ई': 'ī', 'उ': 'u', 'ऊ': 'ū', 'ऋ': 'r', 'ऌ': 'l', 'ऎ': 'ĕ', 'ए': 'e', 'ऐ': 'ai', 'ऑ': 'ô', 'ऒ': 'ŏ', 'ओ': 'o', 'औ': 'au', 'ा': 'a', 'ि': 'i', 'ी': 'ī', 'ु': 'u', 'ू': 'ū', 'ृ': 'r', 'ॄ': 'r̄', 'ॅ': 'ê', 'ॆ': 'ĕ', 'े': 'e', 'ै': 'ai', 'ॉ': 'ô', 'ॊ': 'ŏ', 'ो': 'o', 'ौ': 'au', 'ॠ': 'r̄', 'ॲ': 'ê'}[source]
- SYMBOLS: List[str] = ['.', '।', '॥', '?', '!', '॰', ',', ';', ':', '(', ')', '[', ']', '{', '}', '-', '_', '/'][source]
- UNALLOWED_CHARACTERS_REGEX = '[^खड़भऒ8॥O_ॱEॻॢऽऎॎ॔;फदॼॸा:डब॰इ3ऺऻरओऴI9षळग़ढफ़औZठॗीॅऊॉङॿFँ१धॹजॽG०श\\]ंT6़ै॓ऩख़M>2अवॡ५य़<A३ऋूृुUBकॏझॳ\\\nॾ!४DVतेKRYन्ईॆढ़Nॊौ८S\\}\\(ॲ,ए\\{क़छJ0णऍ\\[Cऑऌॣ।ॶॖगH\\?ॵP९लॕऐॠॺ/4आ15\\-६ऀQघ॒ॴसटमॷ7Lःॐञ२य॑चज़\\ ऱ७ोWॄउथपहऄि\\)\\.X]'[source]
- classmethod allowed_characters() Set[str][source][source]
Returns a set of all allowed characters in the language.
- get_tags(tokens: str | Iterable[str]) List[Any][source][source]
Get the classifications of all tokens in the form of a sequence of tags
- get_word_senses(tokens: str | Iterable[str]) List[List[str]][source][source]
Get all known meanings of the ambiguous words.
- static name() str[source][source]
Returns the name of the language used everywhere else in datasets.
- preprocess(text: str) str[source][source]
Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.
- romanize(text: str, *args, add_diacritics=True, **kwargs) str[source][source]
Map Hindi characters to phonetically similar characters of the English language. Transliteration is useful for readability.
ALA-LC Romanization Table: https://www.loc.gov/catdir/cpso/romanization/hindi.pdf
- Parameters:
text (str) – Hindi text to be mapped to Latin script.
add_diacritics (bool, optional) – Whether to use diacritics over English characters to help pronunciation. Defaults to True.
Examples:
import sign_language_translator as slt nlp = slt.languages.text.Hindi() text = "मैंने किताब खरीदी है।" romanized_text = nlp.romanize(text) print(romanized_text) # 'mainne kitab khrīdī hai.' text = "ईशांत शर्मा को उनकी शानदार गेंदबाजी के लिए १ प्लेयर ऑफ द मैच का अवॉर्ड दिया गया।" text = nlp.preprocess(text) romanized_text = nlp.romanize(text) print(romanized_text) # 'īshant shrma ko unkī shandar gendbajī ke lie 1 pleyr ôph d maich ka avôrḍ diya gya.'
- tag(tokens: str | Iterable[str]) List[Tuple[str, Any]][source][source]
Classify the tokens and mark them with appropriate tags.
- class sign_language_translator.languages.text.Tags(value)[source][source]
Bases:
EnumEnumeration of token tags used in NLP processing.
- class sign_language_translator.languages.text.TextLanguage[source][source]
Bases:
ABCBase NLP class for a language.
Subclass it and provide the functionality to tokenize text and classify & disambiguate tokens. Each token should correspond to a sign language clip.
- abstract classmethod allowed_characters() Set[str][source][source]
Returns a set of all allowed characters in the language.
- abstract get_tags(tokens: str | Iterable[str]) List[Any][source][source]
Get the classifications of all tokens in the form of a sequence of tags
- abstract get_word_senses(tokens: str | Iterable[str]) List[List[str]][source][source]
Get all known meanings of the ambiguous words.
- abstract static name() str[source][source]
Returns the name of the language used everywhere else in datasets.
- abstract preprocess(text: str) str[source][source]
Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.
- static romanize(text: str, *args, add_diacritics=True, character_translation_table: Dict[int, str] | None = None, n_gram_map: Dict[str, str] | None = None, **kwargs) str[source][source]
Map characters to phonetically similar characters of the English language. Transliteration is useful for readability & simple text-to-speech. First maps (n>1)-grams, then unigrams.
ALA-LC Standardized Romanization Tables (70 languages): https://www.loc.gov/catdir/cpso/roman.html
- Parameters:
text (str) – Non-English text to be mapped to Latin script.
add_diacritics (bool, optional) – Whether to use diacritics over English characters to help pronunciation. (Rules: 1. The under-dot ‘ ̣’ indicates alternate soft/hard pronunciation of the letter. 2. The over-bar/macron ‘ ̄’ means long pronunciation). Defaults to True.
character_translation_table (Optional[Dict[int, str]], optional) – A dictionary mapping unicode of single characters to their latin equivalent. Defaults to None.
n_gram_map (Optional[Dict[str, str]], optional) – A dictionary mapping bigrams, trigrams or more to their latin equivalent. Keys are expected to be regular expressions. Defaults to None.
- abstract tag(tokens: str | Iterable[str]) List[Tuple[str, Any]][source][source]
Classify the tokens and mark them with appropriate tags.
- class sign_language_translator.languages.text.Urdu[source][source]
Bases:
TextLanguageNLP class for Urdu text. Extends slt.languages.text.TextLanguage class.
Urdu is an Indo-Aryan language spoken mostly in Pakistan. Urdu uses the Perso-Arabic script, which consists of 46 Alphabets, 10 Digits, 6 Punctuations & 6 Diacritics, and is written from right to left. See unicode details at: https://unicode.org/charts/PDF/U0600.pdf
- ALLOWED_CHARACTERS = {'\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '<', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '{', '}', '،', 'ؐ', 'ؑ', 'ؒ', 'ؓ', '؛', '؟', 'ء', 'آ', 'أ', 'ؤ', 'ئ', 'ا', 'ب', 'ت', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ل', 'م', 'ن', 'و', 'ً', 'ٍ', 'َ', 'ُ', 'ِ', 'ّ', '٫', 'ٰ', 'ٹ', 'پ', 'چ', 'ڈ', 'ڑ', 'ژ', 'ک', 'گ', 'ں', 'ھ', 'ہ', 'ۂ', 'ۃ', 'ی', 'ے', 'ۓ', '۔', '۰', '۱', '۲', '۳', '۴', '۵', '۶', '۷', '۸', '۹', '‘', '’', '“', '”'}[source]
- CHARACTER_TO_WORD = {'–': '-', '—': '-', '−': '-', '⋯': '...', 'ﷲ': 'اللہ', 'ﷺ': ' صلی اللہ علیہ وسلم', '﷽': 'بسم اللہ الرحمن الرحیم'}[source]
- CHARACTER_TRANSLATOR = {1577: 'ۃ', 1603: 'ک', 1607: 'ہ', 1609: 'ی', 1610: 'ی', 1632: '۰', 1633: '۱', 1634: '۲', 1635: '۳', 1636: '۴', 1637: '۵', 1638: '۶', 1639: '۷', 1640: '۸', 1641: '۹', 8211: '-', 8212: '-', 8722: '-', 8943: '...', 64342: 'پ', 64344: 'پ', 64345: 'پ', 64358: 'ٹ', 64359: 'ٹ', 64360: 'ٹ', 64361: 'ٹ', 64378: 'چ', 64379: 'چ', 64380: 'چ', 64381: 'چ', 64392: 'ڈ', 64393: 'ڈ', 64395: 'ژ', 64396: 'ڑ', 64397: 'ڑ', 64398: 'ک', 64399: 'ک', 64400: 'ک', 64401: 'ک', 64402: 'گ', 64403: 'گ', 64404: 'گ', 64405: 'گ', 64414: 'ں', 64415: 'ں', 64422: 'ہ', 64423: 'ہ', 64424: 'ہ', 64425: 'ہ', 64426: 'ھ', 64427: 'ھ', 64428: 'ھ', 64429: 'ھ', 64430: 'ے', 64431: 'ے', 64508: 'ی', 64509: 'ی', 64510: 'ی', 64511: 'ی', 65010: 'اللہ', 65018: ' صلی اللہ علیہ وسلم', 65021: 'بسم اللہ الرحمن الرحیم', 65152: 'ء', 65153: 'آ', 65154: 'آ', 65155: 'أ', 65157: 'ؤ', 65163: 'ئ', 65164: 'ئ', 65165: 'ا', 65166: 'ا', 65167: 'ب', 65168: 'ب', 65169: 'ب', 65170: 'ب', 65173: 'ت', 65174: 'ت', 65175: 'ت', 65176: 'ت', 65178: 'ث', 65179: 'ث', 65180: 'ث', 65181: 'ج', 65182: 'ج', 65183: 'ج', 65184: 'ج', 65185: 'ح', 65186: 'ح', 65187: 'ح', 65188: 'ح', 65190: 'خ', 65191: 'خ', 65192: 'خ', 65193: 'د', 65194: 'د', 65195: 'ذ', 65196: 'ذ', 65197: 'ر', 65198: 'ر', 65199: 'ز', 65200: 'ز', 65201: 'س', 65202: 'س', 65203: 'س', 65204: 'س', 65205: 'ش', 65206: 'ش', 65207: 'ش', 65208: 'ش', 65209: 'ص', 65210: 'ص', 65211: 'ص', 65212: 'ص', 65213: 'ض', 65214: 'ض', 65215: 'ض', 65216: 'ض', 65219: 'ط', 65220: 'ط', 65221: 'ظ', 65223: 'ظ', 65224: 'ظ', 65225: 'ع', 65226: 'ع', 65227: 'ع', 65228: 'ع', 65229: 'غ', 65231: 'غ', 65232: 'غ', 65233: 'ف', 65234: 'ف', 65235: 'ف', 65236: 'ف', 65237: 'ق', 65238: 'ق', 65239: 'ق', 65240: 'ق', 65243: 'ک', 65245: 'ل', 65246: 'ل', 65247: 'ل', 65248: 'ل', 65249: 'م', 65250: 'م', 65251: 'م', 65252: 'م', 65253: 'ن', 65254: 'ن', 65255: 'ن', 65256: 'ن', 65257: 'ہ', 65258: 'ہ', 65259: 'ھ', 65260: 'ھ', 65261: 'و', 65262: 'و', 65264: 'ی', 65265: 'ی', 65266: 'ی', 65267: 'ے', 65268: 'ے', 65275: 'لا', 65276: 'لا'}[source]
- CORRECT_URDU_CHARACTERS_TO_INCORRECT: Dict[str, List[str]] = {'،': [], '؟': [], 'ء': ['ﺀ'], 'آ': ['ﺁ', 'ﺂ'], 'أ': ['ﺃ'], 'ؤ': ['ﺅ'], 'ئ': ['ﺋ', 'ﺌ'], 'ا': ['ﺍ', 'ﺎ'], 'ب': ['ﺏ', 'ﺐ', 'ﺑ', 'ﺒ'], 'ت': ['ﺕ', 'ﺖ', 'ﺗ', 'ﺘ'], 'ث': ['ﺛ', 'ﺜ', 'ﺚ'], 'ج': ['ﺝ', 'ﺞ', 'ﺟ', 'ﺠ'], 'ح': ['ﺡ', 'ﺣ', 'ﺤ', 'ﺢ'], 'خ': ['ﺧ', 'ﺨ', 'ﺦ'], 'د': ['ﺩ', 'ﺪ'], 'ذ': ['ﺬ', 'ﺫ'], 'ر': ['ﺭ', 'ﺮ'], 'ز': ['ﺯ', 'ﺰ'], 'س': ['ﺱ', 'ﺲ', 'ﺳ', 'ﺴ'], 'ش': ['ﺵ', 'ﺶ', 'ﺷ', 'ﺸ'], 'ص': ['ﺹ', 'ﺺ', 'ﺻ', 'ﺼ'], 'ض': ['ﺽ', 'ﺾ', 'ﺿ', 'ﻀ'], 'ط': ['ﻃ', 'ﻄ'], 'ظ': ['ﻅ', 'ﻇ', 'ﻈ'], 'ع': ['ﻉ', 'ﻊ', 'ﻋ', 'ﻌ'], 'غ': ['ﻍ', 'ﻏ', 'ﻐ'], 'ف': ['ﻑ', 'ﻒ', 'ﻓ', 'ﻔ'], 'ق': ['ﻕ', 'ﻖ', 'ﻗ', 'ﻘ'], 'ل': ['ﻝ', 'ﻞ', 'ﻟ', 'ﻠ'], 'لا': ['ﻻ', 'ﻼ'], 'م': ['ﻡ', 'ﻢ', 'ﻣ', 'ﻤ'], 'ن': ['ﻥ', 'ﻦ', 'ﻧ', 'ﻨ'], 'و': ['ﻮ', 'ﻭ', 'ﻮ'], '٫': [], 'ٹ': ['ﭦ', 'ﭧ', 'ﭨ', 'ﭩ'], 'پ': ['ﭖ', 'ﭘ', 'ﭙ'], 'چ': ['ﭺ', 'ﭻ', 'ﭼ', 'ﭽ'], 'ڈ': ['ﮈ', 'ﮉ'], 'ڑ': ['ﮍ', 'ﮌ'], 'ژ': ['ﮋ'], 'ک': ['ﮎ', 'ﮏ', 'ﮐ', 'ﮑ', 'ﻛ', 'ك'], 'گ': ['ﮒ', 'ﮓ', 'ﮔ', 'ﮕ'], 'ں': ['ﮞ', 'ﮟ'], 'ھ': ['ﮪ', 'ﮬ', 'ﮭ', 'ﻬ', 'ﻫ', 'ﮫ'], 'ہ': ['ﻩ', 'ﮦ', 'ﻪ', 'ﮧ', 'ﮩ', 'ﮨ', 'ه'], 'ۂ': [], 'ۃ': ['ة'], 'ی': ['ﯼ', 'ى', 'ﯽ', 'ﻰ', 'ﻱ', 'ﻲ', 'ﯾ', 'ﯿ', 'ي'], 'ے': ['ﮮ', 'ﮯ', 'ﻳ', 'ﻴ'], 'ۓ': [], '۔': [], '۰': ['٠'], '۱': ['١'], '۲': ['٢'], '۳': ['٣'], '۴': ['٤'], '۵': ['٥'], '۶': ['٦'], '۷': ['٧'], '۸': ['٨'], '۹': ['٩']}[source]
- NGRAM_ROMANIZATION_MAP = {'(?<=[ا])و(?![ں])': 'v', '(?<=[ُ])و(?![ا])': '', '(?<=\\d)\\s*ء': 'CE', '(?<=\\w)آ': "'ā", '(?<=ا)ی': 'y', '(?<=ل)ئ(?=ے)': 'ie', '(?<=ہ)ی(?!\\b)': 'e', '\\bع(?=ی)': 'ei', '\\bو': 'v', '\\bی': 'y', 'اً': 'an', 'اَ': 'a', 'اُ': 'u', 'اِ': 'i', 'و(?=[ؤ])': 'u', 'و(?=[اَےی])': 'v', 'ًا': 'an', 'ِ(?!\\w)': '-e', 'ی(?=[وای])': 'y', 'ی(?=ں)': 'ei', 'یٰ': 'a'}[source]
- ROMANIZATION_CHARACTER_TRANSLATOR = {1548: ',', 1552: '(PBUH)', 1553: '(AS)', 1554: '(RH)', 1555: '(RA)', 1563: ';', 1567: '?', 1569: "'", 1570: 'aa', 1571: 'a', 1572: 'ow', 1574: 'e', 1575: 'a', 1576: 'b', 1578: 't', 1579: 's', 1580: 'j', 1581: 'h', 1582: 'k̲h̲', 1583: 'd', 1584: 'z', 1585: 'r', 1586: 'z', 1587: 's', 1588: 'sh', 1589: 's', 1590: 'z', 1591: 't', 1592: 'z', 1593: "a'", 1594: 'g̲h̲', 1601: 'f', 1602: 'q', 1604: 'l', 1605: 'm', 1606: 'n', 1608: 'o', 1611: 'an', 1613: 'in', 1614: 'a', 1615: 'u', 1616: 'i', 1643: '.', 1648: 'a', 1657: 'ṭ', 1662: 'p', 1670: 'ch', 1672: 'ḍ', 1681: 'ṛ', 1688: 'zh', 1705: 'k', 1711: 'g', 1722: 'n̲', 1726: 'h', 1729: 'h', 1730: 'h-e', 1731: 't', 1740: 'i', 1746: 'y', 1747: 'ey', 1748: '.', 1776: '0', 1777: '1', 1778: '2', 1779: '3', 1780: '4', 1781: '5', 1782: '6', 1783: '7', 1784: '8', 1785: '9'}[source]
- ROMANIZATION_MAP = {'،': ',', 'ؐ': '(PBUH)', 'ؑ': '(AS)', 'ؒ': '(RH)', 'ؓ': '(RA)', '؛': ';', '؟': '?', 'ء': "'", 'آ': 'aa', 'أ': 'a', 'ؤ': 'ow', 'ئ': 'e', 'ا': 'a', 'ب': 'b', 'ت': 't', 'ث': 's', 'ج': 'j', 'ح': 'h', 'خ': 'k̲h̲', 'د': 'd', 'ذ': 'z', 'ر': 'r', 'ز': 'z', 'س': 's', 'ش': 'sh', 'ص': 's', 'ض': 'z', 'ط': 't', 'ظ': 'z', 'ع': "a'", 'غ': 'g̲h̲', 'ف': 'f', 'ق': 'q', 'ل': 'l', 'م': 'm', 'ن': 'n', 'و': 'o', 'ً': 'an', 'ٍ': 'in', 'َ': 'a', 'ُ': 'u', 'ِ': 'i', '٫': '.', 'ٰ': 'a', 'ٹ': 'ṭ', 'پ': 'p', 'چ': 'ch', 'ڈ': 'ḍ', 'ڑ': 'ṛ', 'ژ': 'zh', 'ک': 'k', 'گ': 'g', 'ں': 'n̲', 'ھ': 'h', 'ہ': 'h', 'ۂ': 'h-e', 'ۃ': 't', 'ی': 'i', 'ے': 'y', 'ۓ': 'ey', '۔': '.', '۰': '0', '۱': '1', '۲': '2', '۳': '3', '۴': '4', '۵': '5', '۶': '6', '۷': '7', '۸': '8', '۹': '9'}[source]
- SPLIT_TO_COMBINED_CHARACTERS: Dict[str, str] = {'آ': 'آ', 'أ': 'أ', 'ؤ': 'ؤ', 'ََ': 'ً', 'ِِ': 'ٍ', 'ۂ': 'ۂ', 'یٔ': 'ئ', 'ۓ': 'ۓ'}[source]
- SYMBOLS: List[str] = ['.', '۔', '?', '؟', '!', ',', '،', '؛', "'", '"', '”', '“', '’', '‘', '(', ')', ' ', '-'][source]
- UNALLOWED_CHARACTERS_REGEX = '[^ۓگ\\\nؐ۶84!DOخV؟EاۂKؤRYNزأ۳۱ٍS\\}\\(,\\{ٰژ۹ؑJ0قC\\[دؓہےH؛ڑ’3م\\?نPئُحصجںIشعUل\\-51۷ZطQسؒ۸ظ/ّF‘و۔7۰L9َGِ\\]بغآT6ً۲،”ف“پذتھی\'Mڈ>ٹثۃ\\ W2۵Bء<ضAر"چ٫\\)\\.ک۴X]'[source]
- classmethod allowed_characters() Set[str][source][source]
Returns a set of all allowed characters in the language.
- static character_normalize(text: str) str[source][source]
Replace characters that are rendered the same as Urdu characters in common fonts but actually belong to foreign unicode character ranges by Urdu characters.
- Parameters:
text (str) – a piece of urdu text that may contain foreign symbols
- Returns:
normalized urdu text
- Return type:
str
- get_tags(tokens: str | Iterable[str]) List[Any][source][source]
Get the classifications of all tokens in the form of a sequence of tags
- get_word_senses(tokens: str | Iterable[str]) List[List[str]][source][source]
Get all known meanings of the ambiguous words.
- static name() str[source][source]
Returns the name of the language used everywhere else in datasets.
- preprocess(text: str) str[source][source]
Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.
- romanize(text: str, *args, add_diacritics=True, **kwargs) str[source][source]
Map Urdu characters to phonetically similar characters of the English language. Transliteration is useful for readability.
ALA-LC Romanization Table: https://www.loc.gov/catdir/cpso/romanization/urdu.pdf
- Parameters:
text (str) – Urdu text to be mapped to Latin script.
add_diacritics (bool, optional) – Whether to use diacritics over English characters to ease pronunciation. (Rules: 1. The under-dot ‘ ̣’ indicates alternate soft/hard pronunciation of the letter. 2. The over-bar/macron ‘ ̄’ means long pronunciation. 3. The consecutive underline ‘ ̲ ̲’ means the characters come from a single source letter). Defaults to True.
Examples:
import sign_language_translator as slt nlp = slt.languages.text.Urdu() text = "میں نے ۴۷ کتابیں خریدی ہیں۔" romanized_text = nlp.romanize(text) print(romanized_text) # 'mein̲ ny 47 ktabein̲ k̲h̲ridi hen̲.' text = "مکّهی کا زکریّاؒ کی قابلِ تعریف قوّت سے منہ کهٹّا ہو گیا ہے۔۔۔" text = nlp.preprocess(text) romanized_text = nlp.romanize(text, add_diacritics=False) print(romanized_text) # "mkkhi ka zkryya(RH) ki qabl-e ta'rif qoot sy mnh khtta ho gya hy..."
- tag(tokens: str | Iterable[str]) List[Any][source][source]
Classify the tokens and mark them with appropriate tags.