sign_language_translator.models.language_models.transformer_language_model.layers module

Custom Layers for Decoder-only Transformers

This module contains custom layers used in the Transformer Decoder architecture.

Classes:
FeedForward(torch.nn.Module): Implements a simple feedforward neural network module

with one hidden layer.

CausalMultiHeadSelfAttention(torch.nn.Module): Implements the causal multi-head self-attention

mechanism used in transformer decoders.

DecoderBlock(torch.nn.Module): Implements a single transformer decoder block with

multi-head self-attention and feedforward neural network layers but no cross-attention.

Example:

import torch
from sign_language_translator.models.language_models.transformer_language_model.layers import FeedForward, DecoderBlock, CausalMultiHeadSelfAttention

model = FeedForward(n_embed=256, hidden_size=512, dropout=0.2, activation='relu')
input_tensor = torch.randn(32, 256)
output_tensor = model(input_tensor)

decoder_block = DecoderBlock(n_embed=256, hidden_size=512, n_heads=8, max_seq_len=32, dropout=0.2, activation='relu')
input_tensor = torch.randn(16, 32, 256)
output_tensor = decoder_block(input_tensor)

attention_layer = CausalMultiHeadSelfAttention(n_heads=8, embed_size=256, dropout=0.2)
input_tensor = torch.randn(16, 32, 256)
output_tensor = attention_layer(input_tensor)
class sign_language_translator.models.language_models.transformer_language_model.layers.CausalMultiHeadSelfAttention(n_heads, embed_size, dropout=0.25, max_seq_len: int = 64, attention_bias=False)[source]

Bases: Module

Causal Multi-Head Self-Attention Module.

This class implements the causal multi-head self-attention mechanism. It takes an input tensor of shape (batch_size, seq_len, embed_size) and applies causal attention, where each token can attend only to itself and the previous tokens in the sequence. The input tensor is transformed into queries, keys, and values, and then passed through the scaled dot-product attention mechanism. The final output tensor is obtained by concatenating the heads and applying a linear projection with dropout.

Parameters:
  • n_heads (int) – The number of attention heads.

  • embed_size (int) – The size of the input embedding dimension. Must be divisible by n_heads.

  • dropout (float, optional) – The dropout probability applied in the attention and projection layers. Default is 0.25.

  • max_seq_len (int, optional) – The maximum input sequence length (used only in custom dot-product attention (pytorch<2.0.0)). Default is 64.

  • attention_bias (bool, optional) – If True, enables trainable bias parameter in the query, key & value layer. Default is False.

Inputs:

x (torch.Tensor): Input tensor of shape (batch_size, seq_len, embed_size).

Returns:

Output tensor of shape (batch_size, seq_len, embed_size).

Return type:

torch.Tensor

Example:

model = CausalMultiHeadSelfAttention(n_heads=8, embed_size=256, dropout=0.2)
input_tensor = torch.randn(16, 32, 256)
output_tensor = model(input_tensor)
forward(x: Tensor)[source]

Forward pass of the Causal Multi-Head Self-Attention.

This method applies the forward pass of the causal multi-head self-attention to the input tensor x. The input tensor is transformed into queries, keys, and values, which are then passed through the scaled dot-product attention mechanism. The final output tensor is obtained by concatenating the attention heads and applying a linear projection with dropout.

Parameters:

x (torch.Tensor) – Input tensor of shape (batch_size, seq_len, embed_size).

Returns:

Output tensor of shape (batch_size, seq_len, embed_size).

Return type:

torch.Tensor

Example:

model = CausalMultiHeadSelfAttention(n_heads=8, embed_size=256, dropout=0.2)
input_tensor = torch.randn(16, 32, 256)
output_tensor = model.forward(input_tensor)
class sign_language_translator.models.language_models.transformer_language_model.layers.DecoderBlock(n_embed, hidden_size, n_heads, max_seq_len, dropout=0.25, activation='gelu', attention_bias=False)[source]

Bases: Module

Transformer Decoder Block Module.

This class implements a single transformer decoder block, consisting of causal multi-head self-attention and feedforward neural network layers but no cross-attention. The input tensor x goes through the layer norm & attention mechanism and also forms a skip connection over them into another layer norm & feedforward neural network. The output also contains a residual connection over these two operations.

Parameters:
  • n_embed (int) – The size of the input feature dimension and also the output feature dimension.

  • hidden_size (int) – The number of neurons in the feedforward neural network’s hidden layer.

  • n_heads (int) – The number of attention heads for multi-head self-attention.

  • max_seq_len (int) – The maximum sequence length of the input tensor.

  • dropout (float, optional) – The dropout probability applied in both attention and feedforward layers. Default is 0.25.

  • activation (str, optional) – The activation function to be used in the feedforward neural network. Supported values are ‘gelu’ for GELU activation and ‘relu’ for ReLU activation. Default is ‘gelu’.

  • device (torch.device, optional) – If provided, the attention and feedforward layers will be moved to this device. Default is None.

Inputs:

x (torch.Tensor): Input tensor of shape (batch_size, seq_len, n_embed).

Returns:

Output tensor of shape (batch_size, seq_len, n_embed).

Return type:

torch.Tensor

Example:

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class sign_language_translator.models.language_models.transformer_language_model.layers.FeedForward(n_embed, hidden_size, dropout=0.25, activation='gelu')[source]

Bases: Module

FeedForward Neural Network Module.

This class implements a simple feedforward neural network module with one hidden layer. It takes an input tensor of shape (batch_size, n_embed) and applies a linear transformation, followed by an activation function (GELU or ReLU), and then another linear transformation with dropout applied. The final output tensor has the same shape as the input tensor.

Parameters:
  • n_embed (int) – The size of the input feature dimension.

  • hidden_size (int) – The number of neurons in the hidden layer.

  • dropout (float, optional) – The dropout probability applied after the second linear layer. Default is 0.25.

  • activation (str, optional) – The activation function to be used. Supported values are ‘gelu’ for GELU activation and ‘relu’ for ReLU activation. Default is ‘gelu’.

Inputs:

x (torch.Tensor): Input tensor of shape (batch_size, n_embed).

Returns:

Output tensor of shape (batch_size, n_embed).

Return type:

torch.Tensor

Example:

model = FeedForward(n_embed=256, hidden_size=512, dropout=0.2, activation='relu')
input_tensor = torch.randn(32, 256)
output_tensor = model(input_tensor)
forward(x)[source]

Forward pass of the FeedForward neural network.

This method applies the forward pass of the feedforward neural network to the input tensor x. The forward pass involves passing the input tensor through the hidden layer, followed by an activation function, and then through the output layer with dropout applied.

Parameters:

x (torch.Tensor) – Input tensor of shape (batch_size, n_embed).

Returns:

Output tensor of shape (batch_size, n_embed).

Return type:

torch.Tensor

Example:

model = FeedForward(n_embed=256, hidden_size=512, dropout=0.2, activation='relu')
input_tensor = torch.randn(32, 256)
output_tensor = model.forward(input_tensor)