sign_language_translator.languages.text.text_language module
text_language.py
This module defines the Base NLP class for text format of a spoken language. It defines the interface for text processing functions needed by the rule-based translator.
- class sign_language_translator.languages.text.text_language.TextLanguage[source][source]
Bases:
ABCBase NLP class for a language.
Subclass it and provide the functionality to tokenize text and classify & disambiguate tokens. Each token should correspond to a sign language clip.
- abstract classmethod allowed_characters() Set[str][source][source]
Returns a set of all allowed characters in the language.
- abstract get_tags(tokens: str | Iterable[str]) List[Any][source][source]
Get the classifications of all tokens in the form of a sequence of tags
- abstract get_word_senses(tokens: str | Iterable[str]) List[List[str]][source][source]
Get all known meanings of the ambiguous words.
- abstract static name() str[source][source]
Returns the name of the language used everywhere else in datasets.
- abstract preprocess(text: str) str[source][source]
Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.
- static romanize(text: str, *args, add_diacritics=True, character_translation_table: Dict[int, str] | None = None, n_gram_map: Dict[str, str] | None = None, **kwargs) str[source][source]
Map characters to phonetically similar characters of the English language. Transliteration is useful for readability & simple text-to-speech. First maps (n>1)-grams, then unigrams.
ALA-LC Standardized Romanization Tables (70 languages): https://www.loc.gov/catdir/cpso/roman.html
- Parameters:
text (str) – Non-English text to be mapped to Latin script.
add_diacritics (bool, optional) – Whether to use diacritics over English characters to help pronunciation. (Rules: 1. The under-dot ‘ ̣’ indicates alternate soft/hard pronunciation of the letter. 2. The over-bar/macron ‘ ̄’ means long pronunciation). Defaults to True.
character_translation_table (Optional[Dict[int, str]], optional) – A dictionary mapping unicode of single characters to their latin equivalent. Defaults to None.
n_gram_map (Optional[Dict[str, str]], optional) – A dictionary mapping bigrams, trigrams or more to their latin equivalent. Keys are expected to be regular expressions. Defaults to None.
- abstract tag(tokens: str | Iterable[str]) List[Tuple[str, Any]][source][source]
Classify the tokens and mark them with appropriate tags.