sign_language_translator.languages.text.english module
- class sign_language_translator.languages.text.english.English[source][source]
Bases:
TextLanguageNLP class for English text. Extends slt.languages.text.TextLanguage class.
English is originally a West Germanic language and potentially an international language in the 21st century. English uses the Latin script, which consists of 26 letters and is written from left to right. There are two variants of these letters: uppercase (capital letters) and lowercase. See unicode details at: https://unicode.org/charts/PDF/U0000.pdf
- ALLOWED_CHARACTERS = {'\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}'}[source]
- ALPHABET: List[str] = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'][source]
- CHARACTER_MAP: Dict[str, str] = {'–': '-', '—': '-', '‘': "'", '’': "'", '“': '"', '”': '"', '…': '...'}[source]
- CHARACTER_TRANSLATOR = {8211: '-', 8212: '-', 8216: "'", 8217: "'", 8220: '"', 8221: '"', 8230: '...'}[source]
- SYMBOLS: List[str] = ['.', '?', '!', ',', ';', ':', '(', ')', '[', ']', '{', '}', '"', "'", '@', '#', '$', '%', '&', '*', '+', '<', '>', '=', '^', '|', '/', '-', '_'][source]
- UNALLOWED_CHARACTERS_REGEX = '[^tr\\\ni8!DOVchq\\+_E\\^KRYyNbS\\}\\(;\\|,\\{zJuf0:C\\[@Ho3e\\?P\\$skmId491/5\\-\\*ZQjw\\&F\\#7LvG\\]gnT6=p\'M\\)>\\ Wa2<l%Ax"U\\.BX]'[source]
- classmethod allowed_characters() Set[str][source][source]
Returns a set of all allowed characters in the language.
- get_tags(tokens: str | Iterable[str]) List[Any][source][source]
Get the classifications of all tokens in the form of a sequence of tags
- get_word_senses(tokens: str | Iterable[str]) List[List[str]][source][source]
Get all known meanings of the ambiguous words.
- static name() str[source][source]
Returns the name of the language used everywhere else in datasets.
- preprocess(text: str) str[source][source]
Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.
- romanize(text: str, *args, add_diacritics=True, **kwargs) str[source][source]
Map characters to phonetically similar characters of the English language. Transliteration is useful for readability & simple text-to-speech. First maps (n>1)-grams, then unigrams.
ALA-LC Standardized Romanization Tables (70 languages): https://www.loc.gov/catdir/cpso/roman.html
- Parameters:
text (str) – Non-English text to be mapped to Latin script.
add_diacritics (bool, optional) – Whether to use diacritics over English characters to help pronunciation. (Rules: 1. The under-dot ‘ ̣’ indicates alternate soft/hard pronunciation of the letter. 2. The over-bar/macron ‘ ̄’ means long pronunciation). Defaults to True.
character_translation_table (Optional[Dict[int, str]], optional) – A dictionary mapping unicode of single characters to their latin equivalent. Defaults to None.
n_gram_map (Optional[Dict[str, str]], optional) – A dictionary mapping bigrams, trigrams or more to their latin equivalent. Keys are expected to be regular expressions. Defaults to None.
- tag(tokens: str | Iterable[str]) List[Tuple[str, Any]][source][source]
Classify the tokens and mark them with appropriate tags.