sign_language_translator.text.utils module

Utility Functions for Text Processing

This module contains utility functions for text processing tasks.

Functions:

make_ngrams: Creates n-grams from a given sequence. extract_supported_subsequences_indexes: Extracts the indexes of subsequences based on provided tags and skipped items. extract_supported_subsequences: Extracts subsequences from a given sequence based on provided tags and skipped items. concatenate_sentence_terminals: Concatenates start and end of sentence tokens to a list of sentences.

Classes:

ListRegex: A utility class for finding sub-lists within a list of strings that match specified patterns.

class sign_language_translator.text.utils.ListRegex[source]

Bases: object

A utility class for finding sub-lists within a list of strings that match specified patterns.

ListRegex provides methods for matching patterns against items in a list, searching for the first occurrence of patterns, finding all occurrences of patterns, and retrieving the starting and ending indices of matches.

Patterns can be defined using:
  1. regular expressions (str)

  2. lists of patterns (regex (str) or a nested list of patterns)

  3. tuple containing the pattern and its interval quantifier (”w+”, (2,None)).

When using regular expressions, each pattern is matched against an individual item in the list. When using a list of patterns, any of the patterns in the list can match an item. When using a tuple of pattern and counts, items in the specified range can match the pattern.

Examples:

items = ["apple", "banana", "orange", "orange", "grape", "melon", "orange", "kiwi"]

# Match the patterns against the items
patterns = ["apple", "\w+"]
result = ListRegex.match(items, patterns)
# Output: (0, 2)

# Search for the first occurrence of the patterns
patterns = [r"ba(na){2}", ("orange", (0,3))]
result = ListRegex.search(items, patterns)
# Output: (1, 4)

# Find all occurrences of the patterns
patterns = ["orange", ["grape", "kiwi"]]
result = ListRegex.find_all(items, patterns)
# Output: [['orange', 'grape'], ['orange', 'kiwi']]
static find_all(items: List[str], patterns: List) List[List[str]][source]

Finds all occurrences of the patterns in the list of items.

Parameters:
  • items (List[str]) – The list of strings to be searched.

  • patterns (List[str]) – The patterns to be searched for in the items.

Returns:

A list of matched subsequences of items.

Return type:

List[List[str]]

static find_all_spans(items: List[str], patterns: List) List[Tuple[int, int]][source]

Finds the starting and ending indices of all occurrences of the patterns in the list of items.

Parameters:
  • items (List[str]) – The list of strings to be searched.

  • patterns (List[str]) – The patterns to be searched for in the items.

Returns:

A list of tuples containing the starting and ending indices of the matched items.

Return type:

List[Tuple[int,int]]

static match(items: List[str], patterns: List[str | List | Tuple]) Tuple[int, int] | None[source]

Matches the given patterns against the items in the list. Applies the patterns at the start of the list of string.

Parameters:
  • items (List[str]) – The sequence of strings to be matched.

  • patterns (List[str|List]) – The patterns to be matched against the items.

Returns:

A tuple containing the starting and ending indices of the matched items, or None if no match is found.

Return type:

Tuple[int, int] or None

static search(items: List[str], patterns) Tuple[int, int] | None[source]

Searches for the first occurrence of the patterns in the list of items.

Parameters:
  • items (List[str]) – The list of strings to be searched.

  • patterns (List[str]) – The patterns to be searched for in the items.

Returns:

A tuple containing the starting and ending indices of the matched items, or None if no match is found.

Return type:

Tuple[int, int] or None

sign_language_translator.text.utils.concatenate_sentence_terminals(sentences: List, start_token, end_token)[source]

Inserts start and end tokens between the sentences the input list and concatenates them to the sentences (useful when the input is coming from a sentence tokenizer.)

This function takes a list of sentences and adds a start token to the beginning of each sentence except the first and an end token to the end of each sentence except the last.

Parameters:
  • sentences (List) – A list of sentences to be processed. Sentences can be strings or list of tokens or any type but it must support + operator for concatenation.

  • start_token – The token to be added at the start of sentences. Must be same type as a sentence.

  • end_token – The token to be added at the end of sentences. Must be same type as a sentence.

Returns:

A new list of sentences with start and end tokens inserted.

Return type:

List

Example:

sentences = ["Hello!", "How are you?", "Goodbye."]
start_token = "<start>"
end_token = "<end>"
result = concatenate_sentence_terminals(sentences, start_token, end_token)
# Output: ["Hello!<end>", "<start>How are you?<end>", "<start>Goodbye."]
sign_language_translator.text.utils.extract_supported_subsequences(sequence: Iterable[Any], tags: Iterable[Any], supported_tags: Set[Any], skipped_items: Set[Any]) List[List[Any]][source]

Extract supported subsequences from a sequence based on tags and skipped items.

Parameters:
  • sequence (Iterable[Any]) – The input sequence.

  • tags (Iterable[Any]) – Tags corresponding to each item in the sequence.

  • supported_tags (Set[Any]) – Set of tags indicating support for a subsequence.

  • skipped_items (Set[Any]) – Set of items to be skipped.

Returns:

A list of supported subsequences, where each inner list represents a subsequence.

Return type:

List[List[Any]]

Examples:

sequence = [1, 2, 3, 4, 5, 6]
tags = ['A', 'A', 'B', 'A', 'A', 'C']
supported_tags = {'A'}
skipped_items = {2}
extract_supported_subsequences(sequence, tags, supported_tags, skipped_items)
# [[1], [4, 5]]
sign_language_translator.text.utils.extract_supported_subsequences_indexes(sequence: Iterable[Any], tags: Iterable[Any], supported_tags: Set[Any], skipped_items: Set[Any]) List[List[int]][source]

Extract indexes of supported subsequences from a sequence based on tags and skipped items.

Parameters:
  • sequence (Iterable[Any]) – The input sequence.

  • tags (Iterable[Any]) – Tags corresponding to each item in the sequence.

  • supported_tags (Set[Any]) – Set of tags indicating support for a subsequence.

  • skipped_items (Set[Any]) – Set of items to be skipped.

Returns:

A list indices of supported subsequences, where each inner list represents a subsequence.

Return type:

List[List[int]]

Examples:

sequence = [1, 2, 3, 4, 5, 6]
tags = ['A', 'A', 'B', 'A', 'A', 'C']
supported_tags = {'A'}
skipped_items = {2}
extract_supported_subsequences(sequence, tags, supported_tags, skipped_items)
# [[0], [3, 4]]
sign_language_translator.text.utils.make_ngrams(sequence: Iterable, n: int) List[Iterable][source]

Create all possible slices of the given iterable of size n. for example, sequence=”1234” and n=2 would create [“12”,”23”,”34”].

Parameters:
  • sequence (Iterable) – The iterable sequence from which the n-grams will be created.

  • n (int) – The size of the n-grams.

Returns:

A list of Iterables representing the n-grams created from the sequence.

The type of list items is same as sequence argument.

Return type:

List[Iterable]