sign_language_translator.text package

Submodules

Module contents

class sign_language_translator.text.Rule(matcher: Callable[[str], bool], tag: Any, priority: int)[source]

Bases: object

A rule for token classification based on a matching function.

Parameters:
  • matcher (Callable[[str], bool]) – A function that takes a token (str) as input and returns a boolean indicating whether the token matches the rule.

  • tag (Any) – The tag associated with tokens that match the rule.

  • priority (int) – The priority level of the rule.

is_match(token

str) -> bool: Checks if the given token matches the rule.

get_tag() str[source]

Retrieves the tag associated with the rule.

get_priority() int[source]

Retrieves the priority level of the rule.

from_pattern(pattern

str, tag: str, priority: int) -> Rule: Creates a rule from a regular expression pattern, tag, and priority. The created rule will use the pattern to match tokens.

Note

  • Rules with higher priority are applied first when classifying tokens.

  • The matcher function should return True if the token matches the rule, and False otherwise.

static from_pattern(pattern: str, tag: Any, priority: int)[source]
get_priority()[source]
get_tag()[source]
is_match(token: str)[source]
class sign_language_translator.text.SignTokenizer(word_regex: str = '\\w+', compound_words: Iterable[str] = (), end_of_sentence_tokens: Iterable[str] = ('.', '?', '!'), acronym_periods=('.',), non_sentence_end_words: Iterable[str] = ('A', 'B', 'C'), tokenized_word_sense_pattern: List | None = None)[source]

Bases: object

detokenize(tokens: Iterable[str]) str[source]
sentence_tokenize(text: str) List[str][source]
tokenize(text: str, join_compound_words: bool = True, join_word_sense: bool = False) List[str][source]
class sign_language_translator.text.SynonymFinder(language: str = 'en')[source]

Bases: object

This class provides methods for finding synonyms of a given text using two different approaches: 1. Translation and back-translation through the ‘synonyms_by_translation’ method (requires internet). 2. Embedding-based similarity search through the ‘synonyms_by_similarity’ method.

language

The target language for translation. Use 2-letter codes (ISO 639-1).

Type:

str

translator

The translator object for language translation.

Type:

GoogleTranslator

intermediate_languages

List of languages supported by the translator, excluding the current language.

Type:

List[str]

embedding_model

The embedding model for similarity-based synonym finding.

Type:

str

synonyms_by_translation()[source]

Finds synonyms by translating text into an intermediate language and then back-translation.

synonyms_by_similarity()[source]

Finds synonyms based on embedding vector similarity.

translate()[source]

Translates text to the specified target language.

Example

# Instantiate SynonymFinder with the target language
synonym_finder = SynonymFinder("en")

# Find synonyms using translation and back-translation
text = "happy"
synonyms = synonym_finder.synonyms_by_translation(text)
print(f"Synonyms by Translation: {synonyms}")

# Find synonyms using similarity based on embedding vectors
text = "joyful"
synonyms = synonym_finder.synonyms_by_similarity(text)
print(f"Synonyms by Similarity: {synonyms}")
property embedding_model
property intermediate_languages: List[str]

Returns a list of languages supported by the translator, excluding the current language. They are used to find synonyms by translation and back-translation. These are 2-letter codes (ISO 639-1).

property language: str

The target language for translation. Use 2-letter codes (ISO 639-1).

synonyms_by_similarity(text: str, top_k=10, min_similarity=0.5) List[str][source]

Looks into a vector database and returns the closest matches to the input text.

Parameters:
  • text (str) – The input text to find synonyms for.

  • top_k (int, optional) – The maximum number of synonyms to return. Defaults to 10.

  • min_similarity (float, optional) – Cut off value for similarity between embedding vectors. Words with greater similarity score than this value are returned as synonyms. Defaults to 0.8.

Returns:

A list of synonyms for the input text.

Return type:

List[str]

Example

# Instantiate SynonymFinder with the target language
synonym_finder = SynonymFinder("ur")

# Find synonyms using similarity based on embedding vectors
text = "تعلیم"
synonyms = synonym_finder.synonyms_by_similarity(text, 3)
print(synonyms)
# ["تعلیم", "تربیت", "تعلیمی"]
synonyms_by_translation(text: str, intermediate_languages: List[str] | None = None, min_frequency: int = 1, time_delay: float = 0.01, timeout: float | None = 10, max_n_threads: int = 132, lower_case: bool = True, progress_bar: bool = True, leave: bool = False, cache: Dict[str, Dict[str, str]] | None = None) List[str][source]

Translates the given text into intermediate languages and performs back-translation to obtain synonyms. Translation is done via the internet using web scraping by the deep_translator library.

Parameters:
  • text (str) – The text to be translated.

  • intermediate_languages (Optional[List[str]]) – List of intermediate languages to translate the text into. Use 2-letter codes (ISO 639-1). If None, all supported languages of the translator will be used. Defaults to None.

  • min_frequency (int) – Minimum occurrence count for synonyms to get considered. Value is inclusive. Defaults to 1.

  • time_delay (float) – Time delay between translation requests (in seconds). Defaults to 1e-2.

  • timeout (float | None) – The maximum amount of time (in seconds) to wait for a thread to finish. None means wait indefinitely. Defaults to 10.

  • max_n_threads (int) – Maximum number of threads to use for parallel translation. Defaults to 128.

  • lower_case (bool) – Whether to convert the synonyms to lowercase. Defaults to True.

  • progress_bar (bool) – Whether to display a progress bar during translation. Defaults to True.

  • leave (bool) – Whether to leave the progress bar after translation. Defaults to True.

  • cache (Optional[Dict[str, Dict[str, str]]]) – A dictionary to save or retrieve the intermediate translations of the text. Structure is {“text”: {“language”: “translation”, …}, …} where each input maps to a dict mapping language code to the text’s translation. Defaults to None.

Returns:

A list of synonyms obtained through back-translation from other languages.

Return type:

List[str]

translate(text: str, target_language: str) str[source]

Translates the given text to the specified target language.

Parameters:
  • text (str) – The text to be translated.

  • target_language (str) – The target language for translation. Use 2-letter codes (ISO 639-1).

Returns:

The translated text.

Return type:

str

property translator

The deep_translator.GoogleTranslator object with the source language as “auto” and the target language as the __init__ argument or according to the current state.

class sign_language_translator.text.Tagger(rules: List[Rule], default=Tags.DEFAULT)[source]

Bases: object

A tagger that applies a set of rules to classify tokens.

Parameters:
  • rules (List[Rule]) – A list of Rule objects representing the classification rules. Smaller priority value rules overwrite the others.

  • default (Tags, optional) – The default tag to assign when no rule matches a token. Defaults to Tags.DEFAULT.

tag(tokens

List[str]) -> List[Tuple[str, Any]]: Assigns tags to a list of tokens based on the defined rules. Returns a list of tuples containing the token and its corresponding tag.

get_tags(tokens

List[str]) -> List[Any]: Retrieves the tags for a list of tokens based on the defined rules. Returns a list of tags corresponding to the input tokens.

Note

  • The rules are applied in the order they appear in the list

    but higher priority (smaller value) rules overpower.

  • The default tag is assigned to tokens that do not match any rule.

get_tags(tokens: Iterable[str]) List[Any][source]
tag(tokens: Iterable[str]) List[Tuple[str, Any]][source]
class sign_language_translator.text.Tags(value)[source]

Bases: Enum

Enumeration of token tags used in NLP processing.

ACRONYM = 'ACRONYM'
AMBIGUOUS = 'AMBIGUOUS'
DATE = 'DATE'
DEFAULT = ''
END_OF_SEQUENCE = 'EOS'
NAME = 'NAME'
NUMBER = 'NUMBER'
PUNCTUATION = 'PUNCTUATION'
SPACE = 'SPACE'
START_OF_SEQUENCE = 'SOS'
SUPPORTED_WORD = 'SUPPORTED_WORD'
TIME = 'TIME'
WORD = 'WORD'
WORDLESS = 'WORDLESS'
sign_language_translator.text.remove_space_before_punctuation(text: str, punctuation={'!', ',', '.', '?'})[source]
sign_language_translator.text.replace_words(text: str, word_map: Dict[str, str], word_regex: str = '\\w+') str[source]