sign_language_translator.languages.text.text_language module

text_language.py

This module defines the Base NLP class for text format of a spoken language. It defines the interface for text processing functions needed by the rule-based translator.

class sign_language_translator.languages.text.text_language.TextLanguage[source][source]

Bases: ABC

Base NLP class for a language.

Subclass it and provide the functionality to tokenize text and classify & disambiguate tokens. Each token should correspond to a sign language clip.

abstract classmethod allowed_characters() → Set[str][source][source]: Returns a set of all allowed characters in the language.

abstract detokenize(tokens: Iterable[str]) → str[source][source]: Joins tokens back into text.

abstract get_tags(tokens: str | Iterable[str]) → List[Any][source][source]: Get the classifications of all tokens in the form of a sequence of tags

abstract get_word_senses(tokens: str | Iterable[str]) → List[List[str]][source][source]: Get all known meanings of the ambiguous words.

abstract static name() → str[source][source]: Returns the name of the language used everywhere else in datasets.

abstract preprocess(text: str) → str[source][source]: Preprocesses text before tokenization. Make sure no different unicode characters are used for the same word. Remove unnecessary symbols, spaces, etc.

static romanize(text: str, *args, add_diacritics=True, character_translation_table: Dict[int, str] | None = None, n_gram_map: Dict[str, str] | None = None, **kwargs) → str[source][source]

Map characters to phonetically similar characters of the English language. Transliteration is useful for readability & simple text-to-speech. First maps (n>1)-grams, then unigrams.

ALA-LC Standardized Romanization Tables (70 languages): https://www.loc.gov/catdir/cpso/roman.html

Parameters:

text (str) – Non-English text to be mapped to Latin script.
add_diacritics (bool, optional) – Whether to use diacritics over English characters to help pronunciation. (Rules: 1. The under-dot ‘ ̣’ indicates alternate soft/hard pronunciation of the letter. 2. The over-bar/macron ‘ ̄’ means long pronunciation). Defaults to True.
character_translation_table (Optional[Dict[int, str]], optional) – A dictionary mapping unicode of single characters to their latin equivalent. Defaults to None.
n_gram_map (Optional[Dict[str, str]], optional) – A dictionary mapping bigrams, trigrams or more to their latin equivalent. Keys are expected to be regular expressions. Defaults to None.

abstract sentence_tokenize(text: str) → List[str][source][source]: Break text into sentences.

abstract tag(tokens: str | Iterable[str]) → List[Tuple[str, Any]][source][source]: Classify the tokens and mark them with appropriate tags.

abstract classmethod token_regex() → str[source][source]: Returns a regular expression that matches words in this language.

abstract tokenize(text: str) → List[str][source][source]: Break apart text into words or phrases