sign_language_translator.languages.text.english module

class sign_language_translator.languages.text.english.English[source][source]

Bases: TextLanguage

NLP class for English text. Extends slt.languages.text.TextLanguage class.

English is originally a West Germanic language and potentially an international language in the 21st century. English uses the Latin script, which consists of 26 letters and is written from left to right. There are two variants of these letters: uppercase (capital letters) and lowercase. See unicode details at: https://unicode.org/charts/PDF/U0000.pdf

ALLOWED_CHARACTERS = {'\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}'}[source]
ALPHABET: List[str] = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'][source]
BRACKETS: List[str] = ['(', ')', '[', ']', '{', '}'][source]
CHARACTER_MAP: Dict[str, str] = {'–': '-', '—': '-', '‘': "'", '’': "'", '“': '"', '”': '"', '…': '...'}[source]
CHARACTER_TRANSLATOR = {8211: '-', 8212: '-', 8216: "'", 8217: "'", 8220: '"', 8221: '"', 8230: '...'}[source]
END_OF_SENTENCE_MARKS: List[str] = ['.', '?', '!'][source]
FULL_STOPS: List[str] = ['.'][source]
NUMBER_REGEX = '\\d+(?:[\\.:]\\d+)*'[source]
PUNCTUATION: List[str] = ['.', '?', '!', ',', ';', ':'][source]
QUESTION_MARKS: List[str] = ['?'][source]
QUOTES: List[str] = ['"', "'"][source]
SYMBOLS: List[str] = ['.', '?', '!', ',', ';', ':', '(', ')', '[', ']', '{', '}', '"', "'", '@', '#', '$', '%', '&', '*', '+', '<', '>', '=', '^', '|', '/', '-', '_'][source]
UNALLOWED_CHARACTERS_REGEX = '[^tr\\\ni8!DOVchq\\+_E\\^KRYyNbS\\}\\(;\\|,\\{zJuf0:C\\[@Ho3e\\?P\\$skmId491/5\\-\\*ZQjw\\&F\\#7LvG\\]gnT6=p\'M\\)>\\ Wa2<l%Ax"U\\.BX]'[source]
UNICODE_RANGE = (32, 126)[source]
WORD_REGEX = '[^\\W_\\d]+'[source]
classmethod allowed_characters() Set[str][source][source]

Returns a set of all allowed characters in the language.

delete_unallowed_characters(text: str) str[source][source]
detokenize(tokens: Iterable[str]) str[source][source]

Joins tokens back into text.

get_tags(tokens: str | Iterable[str]) List[Any][source][source]

Get the classifications of all tokens in the form of a sequence of tags

get_word_senses(tokens: str | Iterable[str]) List[List[str]][source][source]

Get all known meanings of the ambiguous words.

static name() str[source][source]

Returns the name of the language used everywhere else in datasets.

preprocess(text: str) str[source][source]

Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.

romanize(text: str, *args, add_diacritics=True, **kwargs) str[source][source]

Map characters to phonetically similar characters of the English language. Transliteration is useful for readability & simple text-to-speech. First maps (n>1)-grams, then unigrams.

ALA-LC Standardized Romanization Tables (70 languages): https://www.loc.gov/catdir/cpso/roman.html

Parameters:
  • text (str) – Non-English text to be mapped to Latin script.

  • add_diacritics (bool, optional) – Whether to use diacritics over English characters to help pronunciation. (Rules: 1. The under-dot ‘ ̣’ indicates alternate soft/hard pronunciation of the letter. 2. The over-bar/macron ‘ ̄’ means long pronunciation). Defaults to True.

  • character_translation_table (Optional[Dict[int, str]], optional) – A dictionary mapping unicode of single characters to their latin equivalent. Defaults to None.

  • n_gram_map (Optional[Dict[str, str]], optional) – A dictionary mapping bigrams, trigrams or more to their latin equivalent. Keys are expected to be regular expressions. Defaults to None.

sentence_tokenize(text: str) List[str][source][source]

Break text into sentences.

tag(tokens: str | Iterable[str]) List[Tuple[str, Any]][source][source]

Classify the tokens and mark them with appropriate tags.

classmethod token_regex() str[source][source]

Returns a regular expression that matches words in this language.

tokenize(text: str) List[str][source][source]

Break apart text into words or phrases