sign_language_translator.text package

Submodules

Module contents

class sign_language_translator.text.Rule(matcher: Callable[[str], bool], tag: Any, priority: int)[source]

Bases: object

A rule for token classification based on a matching function.

Parameters:

matcher (Callable[[str], bool]) – A function that takes a token (str) as input and returns a boolean indicating whether the token matches the rule.
tag (Any) – The tag associated with tokens that match the rule.
priority (int) – The priority level of the rule.

is_match(token: str) -> bool: Checks if the given token matches the rule.

get_tag() → str[source]: Retrieves the tag associated with the rule.

get_priority() → int[source]: Retrieves the priority level of the rule.

from_pattern(pattern: str, tag: str, priority: int) -> Rule: Creates a rule from a regular expression pattern, tag, and priority. The created rule will use the pattern to match tokens.

Note

Rules with higher priority are applied first when classifying tokens.
The matcher function should return True if the token matches the rule, and False otherwise.

static from_pattern(pattern: str, tag: Any, priority: int)[source]

get_priority()[source]

get_tag()[source]

is_match(token: str)[source]

class sign_language_translator.text.SignTokenizer(word_regex: str = '\\w+', compound_words: Iterable[str] = (), end_of_sentence_tokens: Iterable[str] = ('.', '?', '!'), acronym_periods=('.',), non_sentence_end_words: Iterable[str] = ('A', 'B', 'C'), tokenized_word_sense_pattern: List | None = None)[source]

Bases: object

detokenize(tokens: Iterable[str]) → str[source]

sentence_tokenize(text: str) → List[str][source]

tokenize(text: str, join_compound_words: bool = True, join_word_sense: bool = False) → List[str][source]

class sign_language_translator.text.SynonymFinder(language: str = 'en')[source]

Bases: object

This class provides methods for finding synonyms of a given text using two different approaches: 1. Translation and back-translation through the ‘synonyms_by_translation’ method (requires internet). 2. Embedding-based similarity search through the ‘synonyms_by_similarity’ method.

language

The target language for translation. Use 2-letter codes (ISO 639-1).

Type:: str

translator

The translator object for language translation.

Type:: GoogleTranslator

intermediate_languages

List of languages supported by the translator, excluding the current language.

Type:: List[str]

embedding_model

The embedding model for similarity-based synonym finding.

Type:: str

synonyms_by_translation()[source]: Finds synonyms by translating text into an intermediate language and then back-translation.

synonyms_by_similarity()[source]: Finds synonyms based on embedding vector similarity.

translate()[source]: Translates text to the specified target language.

Example

# Instantiate SynonymFinder with the target language
synonym_finder = SynonymFinder("en")

# Find synonyms using translation and back-translation
text = "happy"
synonyms = synonym_finder.synonyms_by_translation(text)
print(f"Synonyms by Translation: {synonyms}")

# Find synonyms using similarity based on embedding vectors
text = "joyful"
synonyms = synonym_finder.synonyms_by_similarity(text)
print(f"Synonyms by Similarity: {synonyms}")

property embedding_model

property intermediate_languages: List[str]: Returns a list of languages supported by the translator, excluding the current language. They are used to find synonyms by translation and back-translation. These are 2-letter codes (ISO 639-1).

property language: str: The target language for translation. Use 2-letter codes (ISO 639-1).

synonyms_by_similarity(text: str, top_k=10, min_similarity=0.5) → List[str][source]

Looks into a vector database and returns the closest matches to the input text.

Parameters:

text (str) – The input text to find synonyms for.
top_k (int, optional) – The maximum number of synonyms to return. Defaults to 10.
min_similarity (float, optional) – Cut off value for similarity between embedding vectors. Words with greater similarity score than this value are returned as synonyms. Defaults to 0.8.

Returns:

A list of synonyms for the input text.

Return type:

List[str]

Example

# Instantiate SynonymFinder with the target language
synonym_finder = SynonymFinder("ur")

# Find synonyms using similarity based on embedding vectors
text = "تعلیم"
synonyms = synonym_finder.synonyms_by_similarity(text, 3)
print(synonyms)
# ["تعلیم", "تربیت", "تعلیمی"]

synonyms_by_translation(text: str, intermediate_languages: List[str] | None = None, min_frequency: int = 1, time_delay: float = 0.01, timeout: float | None = 10, max_n_threads: int = 132, lower_case: bool = True, progress_bar: bool = True, leave: bool = False, cache: Dict[str, Dict[str, str]] | None = None) → List[str][source]

Translates the given text into intermediate languages and performs back-translation to obtain synonyms. Translation is done via the internet using web scraping by the deep_translator library.

Parameters:

text (str) – The text to be translated.
intermediate_languages (Optional[List[str]]) – List of intermediate languages to translate the text into. Use 2-letter codes (ISO 639-1). If None, all supported languages of the translator will be used. Defaults to None.
min_frequency (int) – Minimum occurrence count for synonyms to get considered. Value is inclusive. Defaults to 1.
time_delay (float) – Time delay between translation requests (in seconds). Defaults to 1e-2.
timeout (float | None) – The maximum amount of time (in seconds) to wait for a thread to finish. None means wait indefinitely. Defaults to 10.
max_n_threads (int) – Maximum number of threads to use for parallel translation. Defaults to 128.
lower_case (bool) – Whether to convert the synonyms to lowercase. Defaults to True.
progress_bar (bool) – Whether to display a progress bar during translation. Defaults to True.
leave (bool) – Whether to leave the progress bar after translation. Defaults to True.
cache (Optional[Dict[str, Dict[str, str]]]) – A dictionary to save or retrieve the intermediate translations of the text. Structure is {“text”: {“language”: “translation”, …}, …} where each input maps to a dict mapping language code to the text’s translation. Defaults to None.

Returns:

A list of synonyms obtained through back-translation from other languages.

Return type:

List[str]

translate(text: str, target_language: str) → str[source]

Translates the given text to the specified target language.

Parameters:

text (str) – The text to be translated.
target_language (str) – The target language for translation. Use 2-letter codes (ISO 639-1).

Returns:

The translated text.

Return type:

str

property translator: The deep_translator.GoogleTranslator object with the source language as “auto” and the target language as the __init__ argument or according to the current state.

class sign_language_translator.text.Tagger(rules: List[Rule], default=Tags.DEFAULT)[source]

Bases: object

A tagger that applies a set of rules to classify tokens.

Parameters:

rules (List[Rule]) – A list of Rule objects representing the classification rules. Smaller priority value rules overwrite the others.
default (Tags, optional) – The default tag to assign when no rule matches a token. Defaults to Tags.DEFAULT.

tag(tokens: List[str]) -> List[Tuple[str, Any]]: Assigns tags to a list of tokens based on the defined rules. Returns a list of tuples containing the token and its corresponding tag.

get_tags(tokens: List[str]) -> List[Any]: Retrieves the tags for a list of tokens based on the defined rules. Returns a list of tags corresponding to the input tokens.

Note

The rules are applied in the order they appear in the list
but higher priority (smaller value) rules overpower.
The default tag is assigned to tokens that do not match any rule.

get_tags(tokens: Iterable[str]) → List[Any][source]

tag(tokens: Iterable[str]) → List[Tuple[str, Any]][source]

class sign_language_translator.text.Tags(value)[source]

Bases: Enum

Enumeration of token tags used in NLP processing.

ACRONYM = 'ACRONYM'

AMBIGUOUS = 'AMBIGUOUS'

DATE = 'DATE'

DEFAULT = ''

END_OF_SEQUENCE = 'EOS'

NAME = 'NAME'

NUMBER = 'NUMBER'

PUNCTUATION = 'PUNCTUATION'

SPACE = 'SPACE'

START_OF_SEQUENCE = 'SOS'

SUPPORTED_WORD = 'SUPPORTED_WORD'

TIME = 'TIME'

WORD = 'WORD'

WORDLESS = 'WORDLESS'

sign_language_translator.text.remove_space_before_punctuation(text: str, punctuation={'!', ',', '.', '?'})[source]

sign_language_translator.text.replace_words(text: str, word_map: Dict[str, str], word_regex: str = '\\w+') → str[source]