sign_language_translator.models.text_embedding package
Submodules
Module contents
- class sign_language_translator.models.text_embedding.TextEmbeddingModel[source]
Bases:
ABCAbstract class for text embedding models.
- embed(text
str) -> torch.Tensor: Embeds text into a vector.
- class sign_language_translator.models.text_embedding.VectorLookupModel(tokens: List[str], vectors: Tensor, alignment_matrix: Tensor | None = None, description: str = '')[source]
Bases:
TextEmbeddingModelVectorLookupModel class extends TextEmbeddingModel to provide text embedding based on pre-defined token vectors.
- - index_to_token
A list containing tokens in the same order as the vectors.
- Type:
List[str]
- - known_tokens
A frozenset containing unique known tokens.
- Type:
frozenset
- - token_to_index
A dictionary mapping tokens to their corresponding indices.
- Type:
Dict[str, int]
- - vectors
A 2D tensor representing the token vectors.
- Type:
torch.Tensor
- - update(self, tokens
List[str], vectors: torch.Tensor) -> None: Updates existing tokens & hash-table with new vectors.
- - embed(self, text
- str, pre_normalize=False, post_normalize=False,
tokenizer: Callable[[str], Iterable[str]] = lambda x: x.split()) -> torch.Tensor:
Returns the pretrained embedding vector for a token or average embedding of sub tokens.
- - __getitem__(self, token
str) -> torch.Tensor: Returns the vector for a specific token.
- - save(self, path
str): Saves the model state (tokens & vectors) to a file.
- - load(cls, path
str): Loads a saved model state (tokens & vectors) from a file.
Example:
..code-block:: python
from sign_language_translator.models import VectorLookupModel import torch
tokens = [“example”, “text”] vectors = torch.tensor([[1, 2, 3], [4, 5, 6]]) model = VectorLookupModel(tokens, vectors)
embedding = model.embed(“example text”) # [2.5, 3.5, 4.5]
model.update([“hello”], torch.tensor([[7, 8, 9]]))
model.save(“model.pt”) loaded_model = VectorLookupModel.load(“model.pt”)
- embed(text: str, pre_normalize=False, post_normalize=False, align=False, tokenizer: ~typing.Callable[[str], ~typing.Iterable[str]] = <function VectorLookupModel.<lambda>>) Tensor[source]
Embeds the given text into a vector representation by lookup or averaging pre-computed embeddings.
- Parameters:
text (str) – The input text to be embedded, (can be in the model vocabulary or be a string of tokens from the model dictionary). If unknown, returns a zero vector.
pre_normalize (bool, optional) – Whether to normalize the vectors of tokens in the text before averaging. Defaults to False.
post_normalize (bool, optional) – Whether to normalize the vector after embedding. Defaults to False.
align (bool, optional) – Whether to transform the final vector using the alignment matrix. Defaults to False.
tokenizer (Callable[[str], Iterable[str]], optional) – A callable function to tokenize the text. Only used if the text is not present in the model vocabulary. Defaults to splitting on whitespace.
- Returns:
The embedded vector representation of the input text.
- Return type:
torch.Tensor
- classmethod load(path: str)[source]
Load a VectorLookupModel from a saved checkpoint. If the path ends with ‘.zip’ the file will be decompressed.
- Parameters:
path (str) – The path to the saved checkpoint.
- Returns:
The loaded VectorLookupModel instance.
- Return type:
- property normalized_vectors
- save(path: str)[source]
Serialize the tokens list and corresponding vectors to a file. If the path ends with ‘.zip’ the file will be compressed.
- Parameters:
path (str) – The path to save the model file.
- similar(vector: Tensor, k: int = 1) Tuple[List[str], List[float]][source]
Find the k most similar tokens to the given vector.
- Parameters:
vector (torch.Tensor) – The 1D vector for which to find similar tokens.
k (int, optional) – The number of similar tokens to return. Defaults to 1.
- Returns:
A tuple containing the k most similar tokens and their corresponding cosine similarities.
- Return type:
Tuple[List[str], List[float]]
- property tokens_array
- update(tokens: List[str], vectors: Tensor) None[source]
Update the vector lookup model with new tokens and their corresponding vectors.
- Parameters:
tokens (List[str]) – The list of new tokens to be added or updated.
vectors (torch.Tensor) – The tensor of corresponding vectors for the new tokens.
alignment_matrix (Optional[torch.Tensor], optional) – A 2D Tensor to transform the final vectors. (e.g. some orthogonal matrix can be used to align the word vector to an embedding for some other language or model). Defaults to None.
description (str, optional) – A description of the model. Defaults to “”.
- Raises:
ValueError – If the dimensions of the new vectors do not match the dimensions of the existing vectors.