sign_language_translator.models.video_embedding package
Submodules
Module contents
Video Embedding Models
This module provides a collection of deep learning models pretrained on video based tasks. These models are designed to capture essential features and characteristics from videos, which can be used for various applications such as gesture recognition, action analysis, and sign language translation.
Available Models:
- `VideoEmbeddingModel`: An abstract base class representing a video embedding model.
This class defines common attributes and methods (such as embed()) for video embedding models.
- `MediaPipeLandmarksModel`: A model that utilizes MediaPipe’s pose & hand solution to generate video embeddings.
It detects keypoints representing body joints and estimates their position in 3D world coordinates and in the frame pixels.
Usage:
from sign_language_translator.models import MediaPipeLandmarksModel
model = MediaPipeLandmarksModel()
# Define 'frames' as a list of numpy arrays (Width, Height, Channels)
frames = [...] # Replace with actual video frames
# Generate video embeddings using the MediaPipeLandmarksModel
embeddings = model.embed(frames, landmark_type = "world")
print(embeddings.shape) # (n_frames, n_landmarks * 5)
- class sign_language_translator.models.video_embedding.MediaPipeLandmarksModel(pose_model_name='pose_landmarker_heavy.task', hand_model_name='hand_landmarker.task', number_of_persons: int = 1)[source]
Bases:
VideoEmbeddingModelA video embedding model using MediaPipe to extract pose and hand landmarks from video frames.
- Parameters:
pose_model_name (str) – The name of the pose estimation model.
hand_model_name (str) – The name of the hand estimation model.
number_of_persons (int) – The maximum number of persons to detect in each frame.
- n_persons
The maximum number of persons to detect in each frame.
- Type:
int
- embed(frame_sequence: Iterable[Tensor | ndarray[Any, dtype[uint8]]], landmark_type: str = 'world', progress_callback: ProgressStatusCallback | None = None, total_frames: int | None = None, **kwargs) Tensor[source]
Embed a sequence of frames (video) into a sequence of pose & hand landmarks.
- Parameters:
frame_sequence (Iterable[torch.Tensor | NDArray[np.uint8]]) – A sequence of video frames as 3D arrays (W, H, c).
landmark_type (str) – The type of landmarks to include in the embedding (“world”, “image”, “all”).
- Returns:
A tensor containing the frame embeddings.
- Return type:
torch.Tensor
- class sign_language_translator.models.video_embedding.VideoEmbeddingModel[source]
Bases:
ABCAbstract base class for video embedding models.
This class defines the interface for video embedding models, which transform a sequence of video frames into an embedding tensor.
- None
- abstract embed(frame_sequence: Iterable[Tensor | ndarray[Any, dtype[uint8]]], **kwargs) Tensor[source]
Embed a sequence of video frames into an embedding tensor.
- Parameters:
frame_sequence (Iterable[Union[Tensor, NDArray[uint8]]]) – A sequence of video frames, where each frame can be either a Tensor or a numpy array of uint8 values of shape (W, H, C).
**kwargs – Additional keyword arguments specific to the embedding model.
- Returns:
An embedding tensor representing the sequence of video frames.
- Return type:
Tensor