Welcome to Sign Language Translator’s documentation!#

Sign Language Translator is a python package built to allow developers to create and integrate custom & state of the art sign language translation solutions into their applications. It brings you the power of building a translator for any region’s sign language.

All you have to do is to override the sign_language_translator.languages.SignLanguage class and pass its object to the rule-based text-to-sign translator (sign_language_translator.models.ConcatenativeSynthesis).

This package also enables you to easily train & finetune deep learning models on custom sign language datasets which can be hand-crafted, scrapped, or generated via the rule-based translator. See the datasets page for more details about training.

Installation #

Install the package from pypi.org

pip install sign-language-translator

or editable mode from github.

git clone https://github.com/sign-language-translator/sign-language-translator.git
cd sign-language-translator
pip install -e .

Note

This package is currently available for Python 3.9, 3.10 & 3.11

Usage #

The package is available as a python module, via command line interface (CLI), and as a gradio based GUI as well.

Python #

Translation #

Basics#

import sign_language_translator as slt

# download dataset or models (if you need them for personal use)
# (by default, resources are auto-downloaded within the install directory)
# slt.Assets.set_root_dir("path/to/folder")  # Helps preventing duplication across environments or using cloud synced data
# slt.Assets.download(".*.json")  # downloads into resource_dir
# print(slt.Settings.FILE_TO_URL.keys())  # All downloadable resources

print("All available models:")
print(list(slt.ModelCodes))  # slt.ModelCodeGroups
# print(list(slt.TextLanguageCodes))
# print(list(slt.SignLanguageCodes))
# print(list(slt.SignFormatCodes))

Text to Sign Language Translation#

import sign_language_translator as slt

# print(slt.ModelCodes)
# model = slt.get_model("transformer-text-to-sign")
model = slt.models.ConcatenativeSynthesis(
   text_language = "urdu", # or object of any child of slt.languages.text.text_language.TextLanguage class
   sign_language = "pakistan-sign-language", # or object of any child of slt.languages.sign.sign_language.SignLanguage class
   sign_format = "video" # or object of any child of slt.vision.sign.Sign class
)

sign_language_sentence = model.translate("یہ اچھا ہے۔")
sign_language_sentence.show()
# sign_language_sentence.save("output.mp4")

Sign Language to Text Translation (dummy code until v0.8)#

import sign_language_translator as slt

# load sign
video = slt.Video("video.mp4")
# features = slt.LandmarksSign("landmarks.csv", landmark_type="world")

# embed
embedding_model = slt.get_model("mediapipe_pose_v2_hand_v1")
features = embedding_model.embed(video.iter_frames())

# Load sign-to-text model
deep_s2t_model = slt.get_model(slt.ModelCodes.TRANSFORMER_MP_S2T) # pytorch

# translate via single call to pipeline
# text = deep_s2t_model.translate(video)

# translate via individual steps
encoding = deep_s2t_model.encoder(features)
token_ids = [0] # start_token
for _ in range(5):
   logits = deep_s2t_model.decoder(encoding, token_ids=token_ids)
   token_ids.append(logits.argmax(dim=-1).item())

tokens = deep_s2t_model.decode(token_ids)
text = "".join(tokens) # deep_s2t_model.detokenize(tokens)

print(features.shape)
print(logits.shape)
print(text)

Building Custom Translators #

Problem understanding is the most crucial part of solving it. Translation is a sequence-to-sequence (seq2seq) problem, not classification or segmentation or any other. For translation, we need parallel corpora of text language sentences and corresponding sign language videos. Since there is no universal sign language so signs vary even within the same city. Hence, no significant datasets would be available for your regional sign language.

Rule Based Translator #

Note

This approach can only work for text-to-sign language translation for a limited unambiguous vocabulary.

We start by building our sign language dataset for sign language recognition & sign language production (both require 1:1 mapping of one language to the other). First gather sign language video dictionaries for various regions of the world to eventually train a multilingual model. These can be scraped off the internet or recorded manually against reference clips or images. Label the videos with all the text language words that have the same meaning as the sign. If there are multiple signs in the video, make sure to write the gloss (words in text have 1:1 correspondence with the signs in the video) and the translation (text follows the grammar of the spoken language).

Here is the format used in the library to store mappings. But you only need to add a dict to your language processing classes.

Word mappings to signs#

mappings = {
   "hello": [  # token to list of video-file-name sequences
      ["pk-org-1_hello"]  # this sequence contains only one clip
   ],
   "world": [
      ["xx-yyy-#_part-1", "xx-yyy-#_part-2"]  # two clips played consecutively make the right sign
      ["pk-org-1_world"],  # This is another possible sign less commonly used for the same word
   ],
}

Place the actual video files in the assets/videos or assets/datasets/xx-yyy-#.videos-mp4.zip (or preprocessed files in similar directory structure e.g. assets/landmarks). Otherwise update the asset manager with the urls as follows:

Fetching signs for tokens#

   import sign-language-translator as slt

   # help(slt.Assets)
   # print(slt.Assets.ROOT_DIR)
   # slt.Assets.set_root_dir("path/to/centralized/folder")

   slt.Assets.FILE_TO_URL.update({
      "videos/xx-yyy-#_word.mp4": "https://...",
      "datasets/xx-yyy-#.videos-mp4.zip": "https://...",
   })

   # paths = slt.Assets.download("videos/xx-yyy-#_word.mp4")
   # paths = slt.Assets.extract("...mp4", "datasets/...zip")

Now use our rule-based translator (slt.models.ConcatenativeSynthesis) as follows:

Custom rule-based Translator (text to sign)#

import sign_language_translator as slt

class MySignLanguage(slt.languages.SignLanguage):
   # load and save mappings in __init__
   # override the abstract functions
   def restructure_sentence(...):
      # according to the sign language grammar
      ...

   def tokens_to_sign_dicts(...):
      # map words to all possible videos
      ...

   # for reference, see the implementation inside slt.languages.sign.pakistan_sign_language

# optionally implement a text language processor as well
# class MyChinese(slt.languages.TextLanguage):
    # ...

model = slt.models.ConcatenativeSynthesis(
   text_language = slt.languages.text.English(),
   sign_format = "video",
   sign_language = MySignLanguage(),
)

text = """Some text that will be tokenized, tagged, rearranged, mapped to video files
          (which will be downloaded, concatenated and returned)."""
video = model.translate(text)
video.show()
# video.save(f"{text}.mp4")

Deep Learning based #

Note

This approach can work for both sign-to-text and text-to-sign-language translation

You can use 3 types of Parallel corpus as training data (more details): #. Sentences (bunch of signs performed consecutively in a single video to form a meaningful message.) #. Synthetic sentences (made by the rule-based translator) #. Replications (recordings of people performing the signs in dictionary videos and sentences.)

Here is a format that you can use for data labeling. You can use language models to write sentences for synthetic data. The language models can be masked to use only specified words in the output so that the rule-based translator can translate them.

Get the best out of your model by training it for multiple languages and multiple tasks. For example, provide task as start-of-sequence-token and target-text-language as the second token to the decoder of the seq2seq model.

Custom rule-based Translator (text to sign)#

import sign_language_translator as slt

pretrained_model = slt.get_model(slt.ModelCodes.Gesture)  # sign landmarks to text

# pytorch training loop to finetune our model on your dataset
for epoch in range(10):
   for sign, text in train_dataset:
      ...

See more in the package readme.

Text Language Processing #

Process text strings using language specific classes:

Urdu Text Processor#

from sign_language_translator.languages.text import Urdu
ur_nlp = Urdu()

text = "hello جاؤں COVID-19."

normalized_text = ur_nlp.preprocess(text)
# normalized_text = 'جاؤں COVID-19.' # replace/remove unicode characters

tokens = ur_nlp.tokenize(normalized_text)
# tokens = ['جاؤں', ' ', 'COVID', '-', '19', '.']

# tagged = ur_nlp.tag(tokens)
# tagged = [('جاؤں', Tags.SUPPORTED_WORD), (' ', Tags.SPACE), ...]

tags = ur_nlp.get_tags(tokens)
# tags = [Tags.SUPPORTED_WORD, Tags.SPACE, Tags.ACRONYM, ...]

# word_senses = ur_nlp.get_word_senses("میں")
# word_senses = [["میں(i)", "میں(in)"]]

Sign Language Processing #

This processes a representation of sign language which mainly consists of the file names of videos. There are two main parts: 1) word to video mapping and 2) word rearrangement according to grammar.

For video processing, see Vision section.

Pakistan Sign Language Processor#

from sign_language_translator.languages.sign import PakistanSignLanguage

psl = PakistanSignLanguage()

tokens = ["he", " ", "went", " ", "to", " ", "school", "."]
tags = 3 * [Tags.WORD, Tags.SPACE] + [Tags.WORD, Tags.PUNCTUATION]
tokens, tags, _ = psl.restructure_sentence(tokens, tags, None) # ["he", "school", "go"]
signs  = psl.tokens_to_sign_dicts(tokens, tags)
# signs = [
#   {'signs': [['pk-hfad-1_وہ']], 'weights': [1.0]},
#   {'signs': [['pk-hfad-1_school']], 'weights': [1.0]},
#   {'signs': [['pk-hfad-1_گیا']], 'weights': [1.0]}
# ]

Vision #

This covers the functionalities of representing sign language as objects (sequence of frames e.g. video or sequence of vectors e.g. pose landmarks). Those objects have built-in functions for data augmentation and visualization etc.

Sign Processing (dummy code until v0.7)#

import sign_language_translator as slt

# load video
video = slt.Video("sign.mp4")
print(video.duration, video.shape)

# extract features
model = slt.models.MediaPipeLandmarksModel()  # default args
embedding = model.embed(video.iter_frames(), landmark_type="world") # torch.Tensor
print(embedding.shape)  # (n_frames, n_landmarks * 5)

# embed dataset
# slt.models.utils.VideoEmbeddingPipeline(model).process_videos_parallel(
#     ["dataset/*.mp4"], n_processes=12, save_format="csv", ...
# )

# transform / augment data
sign = slt.MediaPipeSign(embedding, landmark_type="world")
sign = sign.rotate(60, 10, 90, degrees=True)
sign = sign.transform(slt.vision.transformations.ZoomLandmarks(1.1, 0.9, 1.0))

# plot
video_visualization = sign.video()
image_visualization = sign.frames_grid(steps=5)
# overlay_visualization = sign.overlay(video) # needs landmark_type="image"

# display
video_visualization.show()
image_visualization.show()
# overlay_visualization.show()

Language models #

In order to generate synthetic training data via the rule-based model, we need a lot of sentences consisting of supported words only. (Supported word: a piece of text for which sign language video is available.) These language models were built to write such sentences. See the datasets page for training process.

Simple Character-level N-Gram Language Model (uses statistics based hashmaps)#

from sign_language_translator.models.language_models import NgramLanguageModel

names_data = [
   '[abeera]', '[areej]',  '[farida]',  '[hiba]',    '[kinza]',
   '[mishal]', '[nimra]',  '[rabbia]',  '[tehmina]', '[zoya]',
   '[amjad]',  '[atif]',   '[farhan]',  '[huzaifa]', '[mudassar]',
   '[nasir]',  '[rizwan]', '[shahzad]', '[tayyab]',  '[zain]',
]

# train an n-gram model (considers previous n tokens to predict)
model = NgramLanguageModel(window_size=2, unknown_token="")
model.fit(names_data)

# inference loop
name = '[r'
for _ in range(10):
   # select next token randomly from learnt probability distribution
   nxt, prob = model.next(name)

   name += nxt
   if nxt in [']' , model.unknown_token]:
      break

print(name)
# '[rabeej]'

# see ngram model's implementation
# print(model.__dict__)

Mash up multiple language models & complete generation through beam search:

Model Mixer & Beam Search#

from sign_language_translator.models.language_models import MixerLM, BeamSampling, NgramLanguageModel

names_data = names_data = [
   '[abeera]', '[areej]',  '[farida]',  '[hiba]',    '[kinza]',
   '[mishal]', '[nimra]',  '[rabbia]',  '[tehmina]', '[zoya]',
   '[amjad]',  '[atif]',   '[farhan]',  '[huzaifa]', '[mudassar]',
   '[nasir]',  '[rizwan]', '[shahzad]', '[tayyab]',  '[zain]',
] # or slt.languages.English().vocab.person_names # gotta concat start/end symbols tho

# train models
LMs = [
   NgramLanguageModel(window_size=size, unknown_token="")
   for size in range(1,4)
]
for lm in LMs:
   lm.fit(names_data)

# combine the models into one object
mixed_model = MixerLM(
   models=LMs,
   selection_probabilities=[1,2,4],
   unknown_token="",
   model_selection_strategy = "choose", # "merge"
)
print(mixed_model)
# Mixer LM: unk_tok=""[3]
# ├── Ngram LM: unk_tok="", window=1, params=85 | prob=14.3%
# ├── Ngram LM: unk_tok="", window=2, params=113 | prob=28.6%
# └── Ngram LM: unk_tok="", window=3, params=96 | prob=57.1%

# randomly select a LM and infer through it
print(mixed_model.next("[m"))

# use Beam Search to find high likelihood names
sampler = BeamSampling(mixed_model, beam_width=3) #, scoring_function = ...)
name = sampler.complete('[')
print(name)
# [rabbia]

Use a pre-trained language model:

Transformer Language Model#

from sign_language_translator.models.language_models import TransformerLanguageModel

# model = slt.get_model("ur-supported-gpt")
model = TransformerLanguageModel.load("models/tlm_14.0M.pt")
# sampler = BeamSampling(model, ...)
# sampler.complete(["<"])

# see probabilities of all tokens
model.next_all(["میں", " ", "وزیراعظم", " ",])
# (["سے", "عمران", ...], [0.1415926535, 0.7182818284, ...])

Text Embedding #

Embed text words & phrases into pre-trained vectors using a selected embedding model. It is useful for finding synonyms in other languages and for building controllable language models.

Pretrained Text Embedding#

import torch
from sign_language_translator.models import VectorLookupModel, get_model

# Custom Model
model = VectorLookupModel(["hello", "world"], torch.Tensor([[0, 1], [2, 3]]))
vector = model.embed("hello")  # torch.Tensor([0, 1])
vector = model.embed("hello world")  # torch.Tensor([1., 2.])  # average of two tokens
model.save("vectors.pt")

# Pretrained Model
model = get_model("lookup-ur-fasttext-cc")
print(model.description)
vector = model.embed("تعلیم", align=True, post_normalize=True)
print(vector.shape)  # (300,)

# find similar words but in a different language
en_model = get_model("lookup-en-fasttext-cc")
en_vectors = en_model.vectors / en_model.vectors.norm(dim=-1, keepdim=True)
similarities = en_vectors @ vector
similar_words = [
   (en_model.index_to_token[i], similarities[i].item())
   for i in similarities.argsort(descending=True)[:5]
]
print(similar_words)  # [(education, 0.5469), ...]

Command line (CLI)#

You can use the following functionalities of the SLT package via CLI as well. A command entered without any arguments will show the help. The useable model-codes are listed in help.

Note: Objects & models do not persist in memory across commands, so this is a quick but inefficient way to use this package. In production, create a server which uses the python interface.

Assets #

Download or view the resources needed by the sign language translator package using the following commands.

Path #

Print the root path where the assets are being stored.

slt assets path

Download #

Download dataset files or models if you need them. The parameters are regular expressions.

slt assets download --overwrite true '.*\.json' '.*\.mp4'

slt assets download --progress-bar true '.*/tlm_14.0M.pt'

By default, auto-download is enabled. Default download directory is /install-directory/sign_language_translator/sign-language-resources/. (See slt.config.settings.Settings)

Tree #

View the directory structure of the present state of the assets folder.

$ slt assets tree
assets
├── checksum.json
├── pk-dictionary-mapping.json
├── text_preprocessing.json
├── datasets
│   ├── pk-hfad-1.landmarks-mediapipe-world-csv.zip
│   └── pk-hfad-1.videos-mp4.zip
│
├── landmarks
│   ├── pk-hfad-1_1.landmarks-mediapipe-pose-2-hand-1.csv
│   └── pk-hfad-1_10.landmarks-mediapipe-pose-2-hand-1.csv
│
├── models
│   ├── ur-supported-token-unambiguous-mixed-ngram-w1-w6-lm.pkl
│   └── mediapipe
│       └── pose_landmarker_heavy.task
│
└── videos
   ├── pk-hfad-1_1.mp4
   ├── pk-hfad-1_10.mp4
   └── pk-hfad-1_مجھے.mp4

Directories only#

slt assets tree --files false

omit files matching the regex#

slt assets tree --ignore ".*mp4" -i ".*csv"

Tree of a custom directory#

slt assets tree -d "path/to/your/assets"

Translate #

Translate text to sign language using a rule-based model:

slt translate --model-code "rule-based" \
--text-lang urdu --sign-lang psl --sign-format 'video' \
"وہ سکول گیا تھا۔"

Complete #

Auto complete a sentence using our language models. This model can write sentences composed of supported words only:

$ slt complete --end-token ">" --model-code urdu-mixed-ngram "<"
('<', 'وہ', ' ', 'یہ', ' ', 'نہیں', ' ', 'چاہتا', ' ', 'تھا', '۔', '>')

These models predict next characters until a specified token appears. (e.g. generating names using a mixture of models):

$ slt complete \
   --model-code unigram-names --model-weight 1 \
   --model-code bigram-names -w 2 \
   -m trigram-names -w 3 \
   --selection-strategy merge --beam-width 2.5 --end-token "]" \
   "[s"
[shazala]

Embed Videos #

Embed videos into a sequence of vectors using a selected embedding model:

slt embed videos/*.mp4 --model-code mediapipe-pose-2-hand-1 --embedding-type world \
   --processes 4 --save-format csv --output-dir ./embeddings

Embed texts #

Embed texts into pyTorch state dict pickled file using a selected embedding model. The target file is a .pt containing {“tokens”: …, “vectors”: …}.

slt embed "hello" "world" "more-tokens.txt" --model-code lookup-en-fasttext-cc

GUI #

This functionality is not available yet! (but will probably be a gradio based light frontend)

import sign_language_translator as slt

slt.launch_gui()

slt gui