Welcome to Sign Language Translator’s documentation!#

Sign Language Translator is a python package built to allow developers to create and integrate custom & state of the art sign language translation solutions into their applications. It brings you the power of building a translator for any region’s sign language.

All you have to do is to override the sign_language_translator.languages.SignLanguage class and pass its object to the rule-based text-to-sign translator (sign_language_translator.models.ConcatenativeSynthesis).

This package also enables you to easily train & finetune deep learning models on custom sign language datasets which can be hand-crafted, scrapped, or generated via the rule-based translator. See the datasets page for more details about training.

Installation#

Install the package from pypi.org

pip install sign-language-translator

or editable mode from github.

git clone https://github.com/sign-language-translator/sign-language-translator.git
cd sign-language-translator
pip install -e .

Note

This package is currently available for Python 3.9, 3.10 & 3.11

Usage#

The package is available as a python module, via command line interface (CLI), and as a gradio based GUI as well.

Python#

Translation#

Basics#
 1import sign_language_translator as slt
 2
 3# download dataset or models (if you need them for personal use)
 4# (by default, resources are auto-downloaded within the install directory)
 5# slt.Assets.set_root_dir("path/to/folder")  # Helps preventing duplication across environments or using cloud synced data
 6# slt.Assets.download(".*.json")  # downloads into resource_dir
 7# print(slt.Settings.FILE_TO_URL.keys())  # All downloadable resources
 8
 9print("All available models:")
10print(list(slt.ModelCodes))  # slt.ModelCodeGroups
11# print(list(slt.TextLanguageCodes))
12# print(list(slt.SignLanguageCodes))
13# print(list(slt.SignFormatCodes))
Text to Sign Language Translation#
 1import sign_language_translator as slt
 2
 3# print(slt.ModelCodes)
 4# model = slt.get_model("transformer-text-to-sign")
 5model = slt.models.ConcatenativeSynthesis(
 6   text_language = "urdu", # or object of any child of slt.languages.text.text_language.TextLanguage class
 7   sign_language = "pakistan-sign-language", # or object of any child of slt.languages.sign.sign_language.SignLanguage class
 8   sign_format = "video" # or object of any child of slt.vision.sign.Sign class
 9)
10
11sign_language_sentence = model.translate("یہ اچھا ہے۔")
12sign_language_sentence.show()
13# sign_language_sentence.save("output.mp4")
Sign Language to Text Translation (dummy code until v0.8)#
 1import sign_language_translator as slt
 2
 3# load sign
 4video = slt.Video("video.mp4")
 5# features = slt.LandmarksSign("landmarks.csv", landmark_type="world")
 6
 7# embed
 8embedding_model = slt.get_model("mediapipe_pose_v2_hand_v1")
 9features = embedding_model.embed(video.iter_frames())
10
11# Load sign-to-text model
12deep_s2t_model = slt.get_model(slt.ModelCodes.TRANSFORMER_MP_S2T) # pytorch
13
14# translate via single call to pipeline
15# text = deep_s2t_model.translate(video)
16
17# translate via individual steps
18encoding = deep_s2t_model.encoder(features)
19token_ids = [0] # start_token
20for _ in range(5):
21   logits = deep_s2t_model.decoder(encoding, token_ids=token_ids)
22   token_ids.append(logits.argmax(dim=-1).item())
23
24tokens = deep_s2t_model.decode(token_ids)
25text = "".join(tokens) # deep_s2t_model.detokenize(tokens)
26
27print(features.shape)
28print(logits.shape)
29print(text)

Building Custom Translators#

Problem understanding is the most crucial part of solving it. Translation is a sequence-to-sequence (seq2seq) problem, not classification or segmentation or any other. For translation, we need parallel corpora of text language sentences and corresponding sign language videos. Since there is no universal sign language so signs vary even within the same city. Hence, no significant datasets would be available for your regional sign language.

Rule Based Translator#

Note

This approach can only work for text-to-sign language translation for a limited unambiguous vocabulary.

Open In Colab

We start by building our sign language dataset for sign language recognition & sign language production (both require 1:1 mapping of one language to the other). First gather sign language video dictionaries for various regions of the world to eventually train a multilingual model. These can be scraped off the internet or recorded manually against reference clips or images. Label the videos with all the text language words that have the same meaning as the sign. If there are multiple signs in the video, make sure to write the gloss (words in text have 1:1 correspondence with the signs in the video) and the translation (text follows the grammar of the spoken language).

Here is the format used in the library to store mappings. But you only need to add a dict to your language processing classes.

Word mappings to signs#
mappings = {
   "hello": [  # token to list of video-file-name sequences
      ["pk-org-1_hello"]  # this sequence contains only one clip
   ],
   "world": [
      ["xx-yyy-#_part-1", "xx-yyy-#_part-2"]  # two clips played consecutively make the right sign
      ["pk-org-1_world"],  # This is another possible sign less commonly used for the same word
   ],
}

Place the actual video files in the assets/videos or assets/datasets/xx-yyy-#.videos-mp4.zip (or preprocessed files in similar directory structure e.g. assets/landmarks). Otherwise update the asset manager with the urls as follows:

Fetching signs for tokens#
   import sign-language-translator as slt

   # help(slt.Assets)
   # print(slt.Assets.ROOT_DIR)
   # slt.Assets.set_root_dir("path/to/centralized/folder")

   slt.Assets.FILE_TO_URL.update({
      "videos/xx-yyy-#_word.mp4": "https://...",
      "datasets/xx-yyy-#.videos-mp4.zip": "https://...",
   })

   # paths = slt.Assets.download("videos/xx-yyy-#_word.mp4")
   # paths = slt.Assets.extract("...mp4", "datasets/...zip")

Now use our rule-based translator (slt.models.ConcatenativeSynthesis) as follows:

Custom rule-based Translator (text to sign)#
 1import sign_language_translator as slt
 2
 3class MySignLanguage(slt.languages.SignLanguage):
 4   # load and save mappings in __init__
 5   # override the abstract functions
 6   def restructure_sentence(...):
 7      # according to the sign language grammar
 8      ...
 9
10   def tokens_to_sign_dicts(...):
11      # map words to all possible videos
12      ...
13
14   # for reference, see the implementation inside slt.languages.sign.pakistan_sign_language
15
16# optionally implement a text language processor as well
17# class MyChinese(slt.languages.TextLanguage):
18    # ...
19
20model = slt.models.ConcatenativeSynthesis(
21   text_language = slt.languages.text.English(),
22   sign_format = "video",
23   sign_language = MySignLanguage(),
24)
25
26text = """Some text that will be tokenized, tagged, rearranged, mapped to video files
27          (which will be downloaded, concatenated and returned)."""
28video = model.translate(text)
29video.show()
30# video.save(f"{text}.mp4")

Deep Learning based#

Note

This approach can work for both sign-to-text and text-to-sign-language translation

You can use 3 types of Parallel corpus as training data (more details): #. Sentences (bunch of signs performed consecutively in a single video to form a meaningful message.) #. Synthetic sentences (made by the rule-based translator) #. Replications (recordings of people performing the signs in dictionary videos and sentences.)

Here is a format that you can use for data labeling. You can use language models to write sentences for synthetic data. The language models can be masked to use only specified words in the output so that the rule-based translator can translate them.

Get the best out of your model by training it for multiple languages and multiple tasks. For example, provide task as start-of-sequence-token and target-text-language as the second token to the decoder of the seq2seq model.

Custom rule-based Translator (text to sign)#
1import sign_language_translator as slt
2
3pretrained_model = slt.get_model(slt.ModelCodes.Gesture)  # sign landmarks to text
4
5# pytorch training loop to finetune our model on your dataset
6for epoch in range(10):
7   for sign, text in train_dataset:
8      ...

See more in the package readme.

Text Language Processing#

Process text strings using language specific classes:

Urdu Text Processor#
 1from sign_language_translator.languages.text import Urdu
 2ur_nlp = Urdu()
 3
 4text = "hello جاؤں COVID-19."
 5
 6normalized_text = ur_nlp.preprocess(text)
 7# normalized_text = 'جاؤں COVID-19.' # replace/remove unicode characters
 8
 9tokens = ur_nlp.tokenize(normalized_text)
10# tokens = ['جاؤں', ' ', 'COVID', '-', '19', '.']
11
12# tagged = ur_nlp.tag(tokens)
13# tagged = [('جاؤں', Tags.SUPPORTED_WORD), (' ', Tags.SPACE), ...]
14
15tags = ur_nlp.get_tags(tokens)
16# tags = [Tags.SUPPORTED_WORD, Tags.SPACE, Tags.ACRONYM, ...]
17
18# word_senses = ur_nlp.get_word_senses("میں")
19# word_senses = [["میں(i)", "میں(in)"]]

Sign Language Processing#

This processes a representation of sign language which mainly consists of the file names of videos. There are two main parts: 1) word to video mapping and 2) word rearrangement according to grammar.

For video processing, see Vision section.

Pakistan Sign Language Processor#
 1from sign_language_translator.languages.sign import PakistanSignLanguage
 2
 3psl = PakistanSignLanguage()
 4
 5tokens = ["he", " ", "went", " ", "to", " ", "school", "."]
 6tags = 3 * [Tags.WORD, Tags.SPACE] + [Tags.WORD, Tags.PUNCTUATION]
 7tokens, tags, _ = psl.restructure_sentence(tokens, tags, None) # ["he", "school", "go"]
 8signs  = psl.tokens_to_sign_dicts(tokens, tags)
 9# signs = [
10#   {'signs': [['pk-hfad-1_وہ']], 'weights': [1.0]},
11#   {'signs': [['pk-hfad-1_school']], 'weights': [1.0]},
12#   {'signs': [['pk-hfad-1_گیا']], 'weights': [1.0]}
13# ]

Vision#

This covers the functionalities of representing sign language as objects (sequence of frames e.g. video or sequence of vectors e.g. pose landmarks). Those objects have built-in functions for data augmentation and visualization etc.

Sign Processing (dummy code until v0.7)#
 1import sign_language_translator as slt
 2
 3# load video
 4video = slt.Video("sign.mp4")
 5print(video.duration, video.shape)
 6
 7# extract features
 8model = slt.models.MediaPipeLandmarksModel()  # default args
 9embedding = model.embed(video.iter_frames(), landmark_type="world") # torch.Tensor
10print(embedding.shape)  # (n_frames, n_landmarks * 5)
11
12# embed dataset
13# slt.models.utils.VideoEmbeddingPipeline(model).process_videos_parallel(
14#     ["dataset/*.mp4"], n_processes=12, save_format="csv", ...
15# )
16
17# transform / augment data
18sign = slt.MediaPipeSign(embedding, landmark_type="world")
19sign = sign.rotate(60, 10, 90, degrees=True)
20sign = sign.transform(slt.vision.transformations.ZoomLandmarks(1.1, 0.9, 1.0))
21
22# plot
23video_visualization = sign.video()
24image_visualization = sign.frames_grid(steps=5)
25# overlay_visualization = sign.overlay(video) # needs landmark_type="image"
26
27# display
28video_visualization.show()
29image_visualization.show()
30# overlay_visualization.show()

Language models#

In order to generate synthetic training data via the rule-based model, we need a lot of sentences consisting of supported words only. (Supported word: a piece of text for which sign language video is available.) These language models were built to write such sentences. See the datasets page for training process.

Simple Character-level N-Gram Language Model (uses statistics based hashmaps)#
 1from sign_language_translator.models.language_models import NgramLanguageModel
 2
 3names_data = [
 4   '[abeera]', '[areej]',  '[farida]',  '[hiba]',    '[kinza]',
 5   '[mishal]', '[nimra]',  '[rabbia]',  '[tehmina]', '[zoya]',
 6   '[amjad]',  '[atif]',   '[farhan]',  '[huzaifa]', '[mudassar]',
 7   '[nasir]',  '[rizwan]', '[shahzad]', '[tayyab]',  '[zain]',
 8]
 9
10# train an n-gram model (considers previous n tokens to predict)
11model = NgramLanguageModel(window_size=2, unknown_token="")
12model.fit(names_data)
13
14# inference loop
15name = '[r'
16for _ in range(10):
17   # select next token randomly from learnt probability distribution
18   nxt, prob = model.next(name)
19
20   name += nxt
21   if nxt in [']' , model.unknown_token]:
22      break
23
24print(name)
25# '[rabeej]'
26
27# see ngram model's implementation
28# print(model.__dict__)

Mash up multiple language models & complete generation through beam search:

Model Mixer & Beam Search#
 1from sign_language_translator.models.language_models import MixerLM, BeamSampling, NgramLanguageModel
 2
 3names_data = names_data = [
 4   '[abeera]', '[areej]',  '[farida]',  '[hiba]',    '[kinza]',
 5   '[mishal]', '[nimra]',  '[rabbia]',  '[tehmina]', '[zoya]',
 6   '[amjad]',  '[atif]',   '[farhan]',  '[huzaifa]', '[mudassar]',
 7   '[nasir]',  '[rizwan]', '[shahzad]', '[tayyab]',  '[zain]',
 8] # or slt.languages.English().vocab.person_names # gotta concat start/end symbols tho
 9
10# train models
11LMs = [
12   NgramLanguageModel(window_size=size, unknown_token="")
13   for size in range(1,4)
14]
15for lm in LMs:
16   lm.fit(names_data)
17
18# combine the models into one object
19mixed_model = MixerLM(
20   models=LMs,
21   selection_probabilities=[1,2,4],
22   unknown_token="",
23   model_selection_strategy = "choose", # "merge"
24)
25print(mixed_model)
26# Mixer LM: unk_tok=""[3]
27# ├── Ngram LM: unk_tok="", window=1, params=85 | prob=14.3%
28# ├── Ngram LM: unk_tok="", window=2, params=113 | prob=28.6%
29# └── Ngram LM: unk_tok="", window=3, params=96 | prob=57.1%
30
31# randomly select a LM and infer through it
32print(mixed_model.next("[m"))
33
34# use Beam Search to find high likelihood names
35sampler = BeamSampling(mixed_model, beam_width=3) #, scoring_function = ...)
36name = sampler.complete('[')
37print(name)
38# [rabbia]

Use a pre-trained language model:

Transformer Language Model#
 1from sign_language_translator.models.language_models import TransformerLanguageModel
 2
 3# model = slt.get_model("ur-supported-gpt")
 4model = TransformerLanguageModel.load("models/tlm_14.0M.pt")
 5# sampler = BeamSampling(model, ...)
 6# sampler.complete(["<"])
 7
 8# see probabilities of all tokens
 9model.next_all(["میں", " ", "وزیراعظم", " ",])
10# (["سے", "عمران", ...], [0.1415926535, 0.7182818284, ...])

Text Embedding#

Embed text words & phrases into pre-trained vectors using a selected embedding model. It is useful for finding synonyms in other languages and for building controllable language models.

Pretrained Text Embedding#
 1import torch
 2from sign_language_translator.models import VectorLookupModel, get_model
 3
 4# Custom Model
 5model = VectorLookupModel(["hello", "world"], torch.Tensor([[0, 1], [2, 3]]))
 6vector = model.embed("hello")  # torch.Tensor([0, 1])
 7vector = model.embed("hello world")  # torch.Tensor([1., 2.])  # average of two tokens
 8model.save("vectors.pt")
 9
10# Pretrained Model
11model = get_model("lookup-ur-fasttext-cc")
12print(model.description)
13vector = model.embed("تعلیم", align=True, post_normalize=True)
14print(vector.shape)  # (300,)
15
16# find similar words but in a different language
17en_model = get_model("lookup-en-fasttext-cc")
18en_vectors = en_model.vectors / en_model.vectors.norm(dim=-1, keepdim=True)
19similarities = en_vectors @ vector
20similar_words = [
21   (en_model.index_to_token[i], similarities[i].item())
22   for i in similarities.argsort(descending=True)[:5]
23]
24print(similar_words)  # [(education, 0.5469), ...]

Command line (CLI)#

You can use the following functionalities of the SLT package via CLI as well. A command entered without any arguments will show the help. The useable model-codes are listed in help.

Note: Objects & models do not persist in memory across commands, so this is a quick but inefficient way to use this package. In production, create a server which uses the python interface.

Assets#

Download or view the resources needed by the sign language translator package using the following commands.

Path#

Print the root path where the assets are being stored.

slt assets path

Download#

Download dataset files or models if you need them. The parameters are regular expressions.

slt assets download --overwrite true '.*\.json' '.*\.mp4'
slt assets download --progress-bar true '.*/tlm_14.0M.pt'

By default, auto-download is enabled. Default download directory is /install-directory/sign_language_translator/sign-language-resources/. (See slt.config.settings.Settings)

Tree#

View the directory structure of the present state of the assets folder.

$ slt assets tree
assets
├── checksum.json
├── pk-dictionary-mapping.json
├── text_preprocessing.json
├── datasets
│   ├── pk-hfad-1.landmarks-mediapipe-world-csv.zip
│   └── pk-hfad-1.videos-mp4.zip
│
├── landmarks
│   ├── pk-hfad-1_1.landmarks-mediapipe-pose-2-hand-1.csv
│   └── pk-hfad-1_10.landmarks-mediapipe-pose-2-hand-1.csv
│
├── models
│   ├── ur-supported-token-unambiguous-mixed-ngram-w1-w6-lm.pkl
│   └── mediapipe
│       └── pose_landmarker_heavy.task
│
└── videos
   ├── pk-hfad-1_1.mp4
   ├── pk-hfad-1_10.mp4
   └── pk-hfad-1_مجھے.mp4
Directories only#
slt assets tree --files false
omit files matching the regex#
slt assets tree --ignore ".*mp4" -i ".*csv"
Tree of a custom directory#
slt assets tree -d "path/to/your/assets"

Translate#

Translate text to sign language using a rule-based model:

slt translate --model-code "rule-based" \
--text-lang urdu --sign-lang psl --sign-format 'video' \
"وہ سکول گیا تھا۔"

Complete#

Auto complete a sentence using our language models. This model can write sentences composed of supported words only:

$ slt complete --end-token ">" --model-code urdu-mixed-ngram "<"
('<', 'وہ', ' ', 'یہ', ' ', 'نہیں', ' ', 'چاہتا', ' ', 'تھا', '۔', '>')

These models predict next characters until a specified token appears. (e.g. generating names using a mixture of models):

$ slt complete \
   --model-code unigram-names --model-weight 1 \
   --model-code bigram-names -w 2 \
   -m trigram-names -w 3 \
   --selection-strategy merge --beam-width 2.5 --end-token "]" \
   "[s"
[shazala]

Embed Videos#

Embed videos into a sequence of vectors using a selected embedding model:

slt embed videos/*.mp4 --model-code mediapipe-pose-2-hand-1 --embedding-type world \
   --processes 4 --save-format csv --output-dir ./embeddings

Embed texts#

Embed texts into pyTorch state dict pickled file using a selected embedding model. The target file is a .pt containing {“tokens”: …, “vectors”: …}.

slt embed "hello" "world" "more-tokens.txt" --model-code lookup-en-fasttext-cc

GUI#

This functionality is not available yet! (but will probably be a gradio based light frontend)

1import sign_language_translator as slt
2
3slt.launch_gui()
slt gui