sign_language_translator.models.language_models.ngram_language_model module
This module provides a simple n-gram-based statistical language model implementation.
Classes: - NgramLanguageModel: A simple n-gram-based statistical language model.
- class sign_language_translator.models.language_models.ngram_language_model.NgramLanguageModel(window_size=1, unknown_token='<unk>', sampling_temperature=1.0, name=None)[source]
Bases:
LanguageModelNgramLanguageModel is a statistical language model based on n-grams. It provides functionality for training the model on a given training corpus, generating the next token based on a context, and saving/loading the model.
Attributes: - window_size (int): The size of the context window for predicting the next token. - unknown_token (str): The token representation used for unknown or out-of-vocabulary tokens. - sampling_temperature (float): A temperature parameter controlling the sampling probabilities during token generation. - name (str): The name of the language model object (optional).
Methods: - train(self, training_corpus): Alias for the fit() method. Trains the language model on the given training corpus. - fit(self, training_corpus): Trains the language model on the given training corpus. - finetune(self, training_corpus, weightage: float): Fine-tunes the language model on an additional training corpus with a specified weightage. - next(self, context: Iterable) -> Tuple[Any, float]: Samples the next token from the learned distribution based on the given context. - next_all(self, context: Iterable) -> Tuple[List[Any], List[float]]: Returns a list of possible next tokens and their associated probabilities based on the given context. - load(model_path: str) -> NgramLanguageModel: Deserializes the model from a JSON file. - save(self, model_path: str, indent=None, ensure_ascii=False): Serializes the model to a JSON file. - __str__(self) -> str: Returns a string representation of the NgramLanguageModel instance.
Private Methods: - _to_key_datatype(self, item: Iterable) -> Tuple: Converts an iterable item to the appropriate datatype for use as a key in the model dictionary. - _count_ngrams(self, training_corpus: List[Iterable], n: int) -> Dict[Tuple, int]: Counts the occurrences of n-grams in the training corpus. - _group_by_context(self, counts: Dict[Tuple, int]): Groups the n-grams by context and calculates the weights for each next token. - _count_parameters(self): Counts the total number of weights/probabilities in the model.
- finetune(training_corpus, weightage: float) None[source]
Fine-tunes the language model on an additional training corpus with a specified weightage.
- Parameters:
training_corpus (Iterable[Iterable]) – The additional training corpus, an iterable of sequences representing the text data.
weightage (float) – The weightage for the additional training corpus, a value between 0.0 and 1.0 (inclusive). A weightage of 0.0 means no impact from the additional corpus, while a weightage of 1.0 means the model is completely updated based on the additional corpus.
- Returns:
None
- Raises:
AssertionError – If the weightage is outside the valid range [0.0, 1.0].
- fit(training_corpus) None[source]
Trains the language model on the given training corpus.
- Parameters:
training_corpus (Iterable[Iterable]) – The training corpus, an iterable of sequences representing the text data.
- Returns:
None
- static load(model_path: str) NgramLanguageModel[source]
Deserializes the model (from JSON).
- Parameters:
model_path (str) – The source file path.
- Returns:
The deserialized NgramLanguageModel instance.
- Return type:
- next(context: Iterable) Tuple[Any, float][source]
Generates the next token based on the given context and also returns its probability.
- Parameters:
context (Iterable) – A piece of sequence used as the context for generating the next token.
- Returns:
- The next token and its associated probability.
Token has the same type as the items in the context iterable.
- Return type:
Tuple[Any, float]
- next_all(context: Iterable) Tuple[List[Any], List[float]][source]
Computes probabilities for all next tokens based on the given context and returns them both.
- Parameters:
context (Iterable) – A piece of sequence used as the context for generating the next tokens.
- Returns:
- All next tokens and their probabilities.
The tokens have the same type as the items in the context iterable.
- Return type:
Tuple[Iterable[Any], Iterable[float]]
- save(model_path: str, indent=None, ensure_ascii=False, overwrite=False) None[source]
Serializes the model (as JSON).
- Parameters:
model_path (str) – The target file path. It will silently overwrite if a file already exists at this path.
indent (Optional[int]) – The indentation level for formatting the JSON data (optional).
ensure_ascii (bool) – Controls whether non-ASCII characters are escaped (optional).
overwrite (bool) – If False, raises FileExistsError if the model already exists. Defaults to False.