sign_language_translator.text package
Submodules
- sign_language_translator.text.metrics module
- sign_language_translator.text.preprocess module
- sign_language_translator.text.subtitles module
- sign_language_translator.text.synonyms module
SynonymFinderSynonymFinder.languageSynonymFinder.translatorSynonymFinder.intermediate_languagesSynonymFinder.embedding_modelSynonymFinder.synonyms_by_translation()SynonymFinder.synonyms_by_similarity()SynonymFinder.translate()SynonymFinder.embedding_modelSynonymFinder.intermediate_languagesSynonymFinder.languageSynonymFinder.synonyms_by_similarity()SynonymFinder.synonyms_by_translation()SynonymFinder.translate()SynonymFinder.translator
- sign_language_translator.text.tagger module
- sign_language_translator.text.tokenizer module
- sign_language_translator.text.utils module
Module contents
- class sign_language_translator.text.Rule(matcher: Callable[[str], bool], tag: Any, priority: int)[source]
Bases:
objectA rule for token classification based on a matching function.
- Parameters:
matcher (Callable[[str], bool]) – A function that takes a token (str) as input and returns a boolean indicating whether the token matches the rule.
tag (Any) – The tag associated with tokens that match the rule.
priority (int) – The priority level of the rule.
- is_match(token
str) -> bool: Checks if the given token matches the rule.
- from_pattern(pattern
str, tag: str, priority: int) -> Rule: Creates a rule from a regular expression pattern, tag, and priority. The created rule will use the pattern to match tokens.
Note
Rules with higher priority are applied first when classifying tokens.
The matcher function should return True if the token matches the rule, and False otherwise.
- class sign_language_translator.text.SignTokenizer(word_regex: str = '\\w+', compound_words: Iterable[str] = (), end_of_sentence_tokens: Iterable[str] = ('.', '?', '!'), acronym_periods=('.',), non_sentence_end_words: Iterable[str] = ('A', 'B', 'C'), tokenized_word_sense_pattern: List | None = None)[source]
Bases:
object
- class sign_language_translator.text.SynonymFinder(language: str = 'en')[source]
Bases:
objectThis class provides methods for finding synonyms of a given text using two different approaches: 1. Translation and back-translation through the ‘synonyms_by_translation’ method (requires internet). 2. Embedding-based similarity search through the ‘synonyms_by_similarity’ method.
- language
The target language for translation. Use 2-letter codes (ISO 639-1).
- Type:
str
- translator
The translator object for language translation.
- Type:
GoogleTranslator
- intermediate_languages
List of languages supported by the translator, excluding the current language.
- Type:
List[str]
- embedding_model
The embedding model for similarity-based synonym finding.
- Type:
str
- synonyms_by_translation()[source]
Finds synonyms by translating text into an intermediate language and then back-translation.
Example
# Instantiate SynonymFinder with the target language synonym_finder = SynonymFinder("en") # Find synonyms using translation and back-translation text = "happy" synonyms = synonym_finder.synonyms_by_translation(text) print(f"Synonyms by Translation: {synonyms}") # Find synonyms using similarity based on embedding vectors text = "joyful" synonyms = synonym_finder.synonyms_by_similarity(text) print(f"Synonyms by Similarity: {synonyms}")
- property embedding_model
- property intermediate_languages: List[str]
Returns a list of languages supported by the translator, excluding the current language. They are used to find synonyms by translation and back-translation. These are 2-letter codes (ISO 639-1).
- property language: str
The target language for translation. Use 2-letter codes (ISO 639-1).
- synonyms_by_similarity(text: str, top_k=10, min_similarity=0.5) List[str][source]
Looks into a vector database and returns the closest matches to the input text.
- Parameters:
text (str) – The input text to find synonyms for.
top_k (int, optional) – The maximum number of synonyms to return. Defaults to 10.
min_similarity (float, optional) – Cut off value for similarity between embedding vectors. Words with greater similarity score than this value are returned as synonyms. Defaults to 0.8.
- Returns:
A list of synonyms for the input text.
- Return type:
List[str]
Example
# Instantiate SynonymFinder with the target language synonym_finder = SynonymFinder("ur") # Find synonyms using similarity based on embedding vectors text = "تعلیم" synonyms = synonym_finder.synonyms_by_similarity(text, 3) print(synonyms) # ["تعلیم", "تربیت", "تعلیمی"]
- synonyms_by_translation(text: str, intermediate_languages: List[str] | None = None, min_frequency: int = 1, time_delay: float = 0.01, timeout: float | None = 10, max_n_threads: int = 132, lower_case: bool = True, progress_bar: bool = True, leave: bool = False, cache: Dict[str, Dict[str, str]] | None = None) List[str][source]
Translates the given text into intermediate languages and performs back-translation to obtain synonyms. Translation is done via the internet using web scraping by the deep_translator library.
- Parameters:
text (str) – The text to be translated.
intermediate_languages (Optional[List[str]]) – List of intermediate languages to translate the text into. Use 2-letter codes (ISO 639-1). If None, all supported languages of the translator will be used. Defaults to None.
min_frequency (int) – Minimum occurrence count for synonyms to get considered. Value is inclusive. Defaults to 1.
time_delay (float) – Time delay between translation requests (in seconds). Defaults to 1e-2.
timeout (float | None) – The maximum amount of time (in seconds) to wait for a thread to finish. None means wait indefinitely. Defaults to 10.
max_n_threads (int) – Maximum number of threads to use for parallel translation. Defaults to 128.
lower_case (bool) – Whether to convert the synonyms to lowercase. Defaults to True.
progress_bar (bool) – Whether to display a progress bar during translation. Defaults to True.
leave (bool) – Whether to leave the progress bar after translation. Defaults to True.
cache (Optional[Dict[str, Dict[str, str]]]) – A dictionary to save or retrieve the intermediate translations of the text. Structure is {“text”: {“language”: “translation”, …}, …} where each input maps to a dict mapping language code to the text’s translation. Defaults to None.
- Returns:
A list of synonyms obtained through back-translation from other languages.
- Return type:
List[str]
- translate(text: str, target_language: str) str[source]
Translates the given text to the specified target language.
- Parameters:
text (str) – The text to be translated.
target_language (str) – The target language for translation. Use 2-letter codes (ISO 639-1).
- Returns:
The translated text.
- Return type:
str
- property translator
The deep_translator.GoogleTranslator object with the source language as “auto” and the target language as the __init__ argument or according to the current state.
- class sign_language_translator.text.Tagger(rules: List[Rule], default=Tags.DEFAULT)[source]
Bases:
objectA tagger that applies a set of rules to classify tokens.
- Parameters:
- tag(tokens
List[str]) -> List[Tuple[str, Any]]: Assigns tags to a list of tokens based on the defined rules. Returns a list of tuples containing the token and its corresponding tag.
- get_tags(tokens
List[str]) -> List[Any]: Retrieves the tags for a list of tokens based on the defined rules. Returns a list of tags corresponding to the input tokens.
Note
- The rules are applied in the order they appear in the list
but higher priority (smaller value) rules overpower.
The default tag is assigned to tokens that do not match any rule.
- class sign_language_translator.text.Tags(value)[source]
Bases:
EnumEnumeration of token tags used in NLP processing.
- ACRONYM = 'ACRONYM'
- AMBIGUOUS = 'AMBIGUOUS'
- DATE = 'DATE'
- DEFAULT = ''
- END_OF_SEQUENCE = 'EOS'
- NAME = 'NAME'
- NUMBER = 'NUMBER'
- PUNCTUATION = 'PUNCTUATION'
- SPACE = 'SPACE'
- START_OF_SEQUENCE = 'SOS'
- SUPPORTED_WORD = 'SUPPORTED_WORD'
- TIME = 'TIME'
- WORD = 'WORD'
- WORDLESS = 'WORDLESS'