sign_language_translator.text.utils module
Utility Functions for Text Processing
This module contains utility functions for text processing tasks.
- Functions:
make_ngrams: Creates n-grams from a given sequence. extract_supported_subsequences_indexes: Extracts the indexes of subsequences based on provided tags and skipped items. extract_supported_subsequences: Extracts subsequences from a given sequence based on provided tags and skipped items. concatenate_sentence_terminals: Concatenates start and end of sentence tokens to a list of sentences.
- Classes:
ListRegex: A utility class for finding sub-lists within a list of strings that match specified patterns.
- class sign_language_translator.text.utils.ListRegex[source]
Bases:
objectA utility class for finding sub-lists within a list of strings that match specified patterns.
ListRegex provides methods for matching patterns against items in a list, searching for the first occurrence of patterns, finding all occurrences of patterns, and retrieving the starting and ending indices of matches.
- Patterns can be defined using:
regular expressions (str)
lists of patterns (regex (str) or a nested list of patterns)
tuple containing the pattern and its interval quantifier (”w+”, (2,None)).
When using regular expressions, each pattern is matched against an individual item in the list. When using a list of patterns, any of the patterns in the list can match an item. When using a tuple of pattern and counts, items in the specified range can match the pattern.
Examples:
items = ["apple", "banana", "orange", "orange", "grape", "melon", "orange", "kiwi"] # Match the patterns against the items patterns = ["apple", "\w+"] result = ListRegex.match(items, patterns) # Output: (0, 2) # Search for the first occurrence of the patterns patterns = [r"ba(na){2}", ("orange", (0,3))] result = ListRegex.search(items, patterns) # Output: (1, 4) # Find all occurrences of the patterns patterns = ["orange", ["grape", "kiwi"]] result = ListRegex.find_all(items, patterns) # Output: [['orange', 'grape'], ['orange', 'kiwi']]
- static find_all(items: List[str], patterns: List) List[List[str]][source]
Finds all occurrences of the patterns in the list of items.
- Parameters:
items (List[str]) – The list of strings to be searched.
patterns (List[str]) – The patterns to be searched for in the items.
- Returns:
A list of matched subsequences of items.
- Return type:
List[List[str]]
- static find_all_spans(items: List[str], patterns: List) List[Tuple[int, int]][source]
Finds the starting and ending indices of all occurrences of the patterns in the list of items.
- Parameters:
items (List[str]) – The list of strings to be searched.
patterns (List[str]) – The patterns to be searched for in the items.
- Returns:
A list of tuples containing the starting and ending indices of the matched items.
- Return type:
List[Tuple[int,int]]
- static match(items: List[str], patterns: List[str | List | Tuple]) Tuple[int, int] | None[source]
Matches the given patterns against the items in the list. Applies the patterns at the start of the list of string.
- Parameters:
items (List[str]) – The sequence of strings to be matched.
patterns (List[str|List]) – The patterns to be matched against the items.
- Returns:
A tuple containing the starting and ending indices of the matched items, or None if no match is found.
- Return type:
Tuple[int, int] or None
- static search(items: List[str], patterns) Tuple[int, int] | None[source]
Searches for the first occurrence of the patterns in the list of items.
- Parameters:
items (List[str]) – The list of strings to be searched.
patterns (List[str]) – The patterns to be searched for in the items.
- Returns:
A tuple containing the starting and ending indices of the matched items, or None if no match is found.
- Return type:
Tuple[int, int] or None
- sign_language_translator.text.utils.concatenate_sentence_terminals(sentences: List, start_token, end_token)[source]
Inserts start and end tokens between the sentences the input list and concatenates them to the sentences (useful when the input is coming from a sentence tokenizer.)
This function takes a list of sentences and adds a start token to the beginning of each sentence except the first and an end token to the end of each sentence except the last.
- Parameters:
sentences (List) – A list of sentences to be processed. Sentences can be strings or list of tokens or any type but it must support + operator for concatenation.
start_token – The token to be added at the start of sentences. Must be same type as a sentence.
end_token – The token to be added at the end of sentences. Must be same type as a sentence.
- Returns:
A new list of sentences with start and end tokens inserted.
- Return type:
List
Example:
sentences = ["Hello!", "How are you?", "Goodbye."] start_token = "<start>" end_token = "<end>" result = concatenate_sentence_terminals(sentences, start_token, end_token) # Output: ["Hello!<end>", "<start>How are you?<end>", "<start>Goodbye."]
- sign_language_translator.text.utils.extract_supported_subsequences(sequence: Iterable[Any], tags: Iterable[Any], supported_tags: Set[Any], skipped_items: Set[Any]) List[List[Any]][source]
Extract supported subsequences from a sequence based on tags and skipped items.
- Parameters:
sequence (Iterable[Any]) – The input sequence.
tags (Iterable[Any]) – Tags corresponding to each item in the sequence.
supported_tags (Set[Any]) – Set of tags indicating support for a subsequence.
skipped_items (Set[Any]) – Set of items to be skipped.
- Returns:
A list of supported subsequences, where each inner list represents a subsequence.
- Return type:
List[List[Any]]
Examples:
sequence = [1, 2, 3, 4, 5, 6] tags = ['A', 'A', 'B', 'A', 'A', 'C'] supported_tags = {'A'} skipped_items = {2} extract_supported_subsequences(sequence, tags, supported_tags, skipped_items) # [[1], [4, 5]]
- sign_language_translator.text.utils.extract_supported_subsequences_indexes(sequence: Iterable[Any], tags: Iterable[Any], supported_tags: Set[Any], skipped_items: Set[Any]) List[List[int]][source]
Extract indexes of supported subsequences from a sequence based on tags and skipped items.
- Parameters:
sequence (Iterable[Any]) – The input sequence.
tags (Iterable[Any]) – Tags corresponding to each item in the sequence.
supported_tags (Set[Any]) – Set of tags indicating support for a subsequence.
skipped_items (Set[Any]) – Set of items to be skipped.
- Returns:
A list indices of supported subsequences, where each inner list represents a subsequence.
- Return type:
List[List[int]]
Examples:
sequence = [1, 2, 3, 4, 5, 6] tags = ['A', 'A', 'B', 'A', 'A', 'C'] supported_tags = {'A'} skipped_items = {2} extract_supported_subsequences(sequence, tags, supported_tags, skipped_items) # [[0], [3, 4]]
- sign_language_translator.text.utils.make_ngrams(sequence: Iterable, n: int) List[Iterable][source]
Create all possible slices of the given iterable of size n. for example, sequence=”1234” and n=2 would create [“12”,”23”,”34”].
- Parameters:
sequence (Iterable) – The iterable sequence from which the n-grams will be created.
n (int) – The size of the n-grams.
- Returns:
- A list of Iterables representing the n-grams created from the sequence.
The type of list items is same as sequence argument.
- Return type:
List[Iterable]