sign_language_translator.models.language_models.ngram_language_model module

This module provides a simple n-gram-based statistical language model implementation.

Classes: - NgramLanguageModel: A simple n-gram-based statistical language model.

class sign_language_translator.models.language_models.ngram_language_model.NgramLanguageModel(window_size=1, unknown_token='<unk>', sampling_temperature=1.0, name=None)[source]

Bases: LanguageModel

NgramLanguageModel is a statistical language model based on n-grams. It provides functionality for training the model on a given training corpus, generating the next token based on a context, and saving/loading the model.

Attributes: - window_size (int): The size of the context window for predicting the next token. - unknown_token (str): The token representation used for unknown or out-of-vocabulary tokens. - sampling_temperature (float): A temperature parameter controlling the sampling probabilities during token generation. - name (str): The name of the language model object (optional).

Methods: - train(self, training_corpus): Alias for the fit() method. Trains the language model on the given training corpus. - fit(self, training_corpus): Trains the language model on the given training corpus. - finetune(self, training_corpus, weightage: float): Fine-tunes the language model on an additional training corpus with a specified weightage. - next(self, context: Iterable) -> Tuple[Any, float]: Samples the next token from the learned distribution based on the given context. - next_all(self, context: Iterable) -> Tuple[List[Any], List[float]]: Returns a list of possible next tokens and their associated probabilities based on the given context. - load(model_path: str) -> NgramLanguageModel: Deserializes the model from a JSON file. - save(self, model_path: str, indent=None, ensure_ascii=False): Serializes the model to a JSON file. - __str__(self) -> str: Returns a string representation of the NgramLanguageModel instance.

Private Methods: - _to_key_datatype(self, item: Iterable) -> Tuple: Converts an iterable item to the appropriate datatype for use as a key in the model dictionary. - _count_ngrams(self, training_corpus: List[Iterable], n: int) -> Dict[Tuple, int]: Counts the occurrences of n-grams in the training corpus. - _group_by_context(self, counts: Dict[Tuple, int]): Groups the n-grams by context and calculates the weights for each next token. - _count_parameters(self): Counts the total number of weights/probabilities in the model.

finetune(training_corpus, weightage: float) None[source]

Fine-tunes the language model on an additional training corpus with a specified weightage.

Parameters:
  • training_corpus (Iterable[Iterable]) – The additional training corpus, an iterable of sequences representing the text data.

  • weightage (float) – The weightage for the additional training corpus, a value between 0.0 and 1.0 (inclusive). A weightage of 0.0 means no impact from the additional corpus, while a weightage of 1.0 means the model is completely updated based on the additional corpus.

Returns:

None

Raises:

AssertionError – If the weightage is outside the valid range [0.0, 1.0].

fit(training_corpus) None[source]

Trains the language model on the given training corpus.

Parameters:

training_corpus (Iterable[Iterable]) – The training corpus, an iterable of sequences representing the text data.

Returns:

None

static load(model_path: str) NgramLanguageModel[source]

Deserializes the model (from JSON).

Parameters:

model_path (str) – The source file path.

Returns:

The deserialized NgramLanguageModel instance.

Return type:

NgramLanguageModel

next(context: Iterable) Tuple[Any, float][source]

Generates the next token based on the given context and also returns its probability.

Parameters:

context (Iterable) – A piece of sequence used as the context for generating the next token.

Returns:

The next token and its associated probability.

Token has the same type as the items in the context iterable.

Return type:

Tuple[Any, float]

next_all(context: Iterable) Tuple[List[Any], List[float]][source]

Computes probabilities for all next tokens based on the given context and returns them both.

Parameters:

context (Iterable) – A piece of sequence used as the context for generating the next tokens.

Returns:

All next tokens and their probabilities.

The tokens have the same type as the items in the context iterable.

Return type:

Tuple[Iterable[Any], Iterable[float]]

save(model_path: str, indent=None, ensure_ascii=False, overwrite=False) None[source]

Serializes the model (as JSON).

Parameters:
  • model_path (str) – The target file path. It will silently overwrite if a file already exists at this path.

  • indent (Optional[int]) – The indentation level for formatting the JSON data (optional).

  • ensure_ascii (bool) – Controls whether non-ASCII characters are escaped (optional).

  • overwrite (bool) – If False, raises FileExistsError if the model already exists. Defaults to False.

train(training_corpus)[source]

Alias for fit(). Trains the language model on the given training corpus.

Parameters:

training_corpus (Iterable[Iterable]) – The training corpus, an iterable of sequences representing the text data.

Returns:

None