sign_language_translator.models.video_embedding package

Submodules

Module contents

Video Embedding Models

This module provides a collection of deep learning models pretrained on video based tasks. These models are designed to capture essential features and characteristics from videos, which can be used for various applications such as gesture recognition, action analysis, and sign language translation.

Available Models:

`VideoEmbeddingModel`: An abstract base class representing a video embedding model.
This class defines common attributes and methods (such as embed()) for video embedding models.
`MediaPipeLandmarksModel`: A model that utilizes MediaPipe’s pose & hand solution to generate video embeddings.
It detects keypoints representing body joints and estimates their position in 3D world coordinates and in the frame pixels.

Usage:

from sign_language_translator.models import MediaPipeLandmarksModel

model = MediaPipeLandmarksModel()

# Define 'frames' as a list of numpy arrays (Width, Height, Channels)
frames = [...]  # Replace with actual video frames

# Generate video embeddings using the MediaPipeLandmarksModel
embeddings = model.embed(frames, landmark_type = "world")
print(embeddings.shape) # (n_frames, n_landmarks * 5)

class sign_language_translator.models.video_embedding.MediaPipeLandmarksModel(pose_model_name='pose_landmarker_heavy.task', hand_model_name='hand_landmarker.task', number_of_persons: int = 1)[source]

Bases: VideoEmbeddingModel

A video embedding model using MediaPipe to extract pose and hand landmarks from video frames.

Parameters:

pose_model_name (str) – The name of the pose estimation model.
hand_model_name (str) – The name of the hand estimation model.
number_of_persons (int) – The maximum number of persons to detect in each frame.

n_persons

The maximum number of persons to detect in each frame.

Type:: int

embed()[source]: Embeds a sequence of frames using pose and hand landmarks.

embed(frame_sequence: Iterable[Tensor | ndarray[Any, dtype[uint8]]], landmark_type: str = 'world', progress_callback: ProgressStatusCallback | None = None, total_frames: int | None = None, **kwargs) → Tensor[source]

Embed a sequence of frames (video) into a sequence of pose & hand landmarks.

Parameters:

frame_sequence (Iterable[torch.Tensor | NDArray[np.uint8]]) – A sequence of video frames as 3D arrays (W, H, c).
landmark_type (str) – The type of landmarks to include in the embedding (“world”, “image”, “all”).

Returns:

A tensor containing the frame embeddings.

Return type:

torch.Tensor

class sign_language_translator.models.video_embedding.VideoEmbeddingModel[source]

Bases: ABC

Abstract base class for video embedding models.

This class defines the interface for video embedding models, which transform a sequence of video frames into an embedding tensor.

None

embed(frame_sequence, **kwargs)[source]: Abstract method to embed a sequence of video frames.

abstract embed(frame_sequence: Iterable[Tensor | ndarray[Any, dtype[uint8]]], **kwargs) → Tensor[source]

Embed a sequence of video frames into an embedding tensor.

Parameters:

frame_sequence (Iterable[Union[Tensor, NDArray[uint8]]]) – A sequence of video frames, where each frame can be either a Tensor or a numpy array of uint8 values of shape (W, H, C).
**kwargs – Additional keyword arguments specific to the embedding model.

Returns:

An embedding tensor representing the sequence of video frames.

Return type:

Tensor