nltk.translate.stack_decoder module¶
A decoder that uses stacks to implement phrase-based translation.
In phrase-based translation, the source sentence is segmented into phrases of one or more words, and translations for those phrases are used to build the target sentence.
Hypothesis data structures are used to keep track of the source words translated so far and the partial output. A hypothesis can be expanded by selecting an untranslated phrase, looking up its translation in a phrase table, and appending that translation to the partial output. Translation is complete when a hypothesis covers all source words.
The search space is huge because the source sentence can be segmented in different ways, the source phrases can be selected in any order, and there could be multiple translations for the same source phrase in the phrase table. To make decoding tractable, stacks are used to limit the number of candidate hypotheses by doing histogram and/or threshold pruning.
Hypotheses with the same number of words translated are placed in the same stack. In histogram pruning, each stack has a size limit, and the hypothesis with the lowest score is removed when the stack is full. In threshold pruning, hypotheses that score below a certain threshold of the best hypothesis in that stack are removed.
Hypothesis scoring can include various factors such as phrase translation probability, language model probability, length of translation, cost of remaining words to be translated, and so on.
References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.
- class nltk.translate.stack_decoder.StackDecoder[source]¶
Bases:
object
Phrase-based stack decoder for machine translation
>>> from nltk.translate import PhraseTable >>> phrase_table = PhraseTable() >>> phrase_table.add(('niemand',), ('nobody',), log(0.8)) >>> phrase_table.add(('niemand',), ('no', 'one'), log(0.2)) >>> phrase_table.add(('erwartet',), ('expects',), log(0.8)) >>> phrase_table.add(('erwartet',), ('expecting',), log(0.2)) >>> phrase_table.add(('niemand', 'erwartet'), ('one', 'does', 'not', 'expect'), log(0.1)) >>> phrase_table.add(('die', 'spanische', 'inquisition'), ('the', 'spanish', 'inquisition'), log(0.8)) >>> phrase_table.add(('!',), ('!',), log(0.8))
>>> # nltk.model should be used here once it is implemented >>> from collections import defaultdict >>> language_prob = defaultdict(lambda: -999.0) >>> language_prob[('nobody',)] = log(0.5) >>> language_prob[('expects',)] = log(0.4) >>> language_prob[('the', 'spanish', 'inquisition')] = log(0.2) >>> language_prob[('!',)] = log(0.1) >>> language_model = type('',(object,),{'probability_change': lambda self, context, phrase: language_prob[phrase], 'probability': lambda self, phrase: language_prob[phrase]})()
>>> stack_decoder = StackDecoder(phrase_table, language_model)
>>> stack_decoder.translate(['niemand', 'erwartet', 'die', 'spanische', 'inquisition', '!']) ['nobody', 'expects', 'the', 'spanish', 'inquisition', '!']
- __init__(phrase_table, language_model)[source]¶
- Parameters
phrase_table (PhraseTable) – Table of translations for source language phrases and the log probabilities for those translations.
language_model (object) – Target language model. Must define a
probability_change
method that calculates the change in log probability of a sentence, if a given string is appended to it. This interface is experimental and will likely be replaced with nltk.model once it is implemented.
- beam_threshold¶
- float: Hypotheses that score below this factor of the best
hypothesis in a stack are dropped from consideration. Value between 0.0 and 1.0.
- compute_future_scores(src_sentence)[source]¶
Determines the approximate scores for translating every subsequence in
src_sentence
Future scores can be used a look-ahead to determine the difficulty of translating the remaining parts of a src_sentence.
- Returns
Scores of subsequences referenced by their start and end positions. For example, result[2][5] is the score of the subsequence covering positions 2, 3, and 4.
- Return type
dict(int: (dict(int): float))
- property distortion_factor¶
- float: Amount of reordering of source phrases.
Lower values favour monotone translation, suitable when word order is similar for both source and target languages. Value between 0.0 and 1.0. Default 0.5.
- expansion_score(hypothesis, translation_option, src_phrase_span)[source]¶
Calculate the score of expanding
hypothesis
withtranslation_option
- Parameters
hypothesis (_Hypothesis) – Hypothesis being expanded
translation_option (PhraseTableEntry) – Information about the proposed expansion
src_phrase_span (tuple(int, int)) – Word position span of the source phrase
- find_all_src_phrases(src_sentence)[source]¶
Finds all subsequences in src_sentence that have a phrase translation in the translation table
- Returns
Subsequences that have a phrase translation, represented as a table of lists of end positions. For example, if result[2] is [5, 6, 9], then there are three phrases starting from position 2 in
src_sentence
, ending at positions 5, 6, and 9 exclusive. The list of ending positions are in ascending order.- Return type
list(list(int))
- future_score(hypothesis, future_score_table, sentence_length)[source]¶
Determines the approximate score for translating the untranslated words in
hypothesis
- stack_size¶
- int: Maximum number of hypotheses to consider in a stack.
Higher values increase the likelihood of a good translation, but increases processing time.
- translate(src_sentence)[source]¶
- Parameters
src_sentence (list(str)) – Sentence to be translated
- Returns
Translated sentence
- Return type
list(str)
- static valid_phrases(all_phrases_from, hypothesis)[source]¶
Extract phrases from
all_phrases_from
that contains words that have not been translated byhypothesis
- Parameters
all_phrases_from (list(list(int))) – Phrases represented by their spans, in the same format as the return value of
find_all_src_phrases
- Returns
A list of phrases, represented by their spans, that cover untranslated positions.
- Return type
list(tuple(int, int))
- word_penalty¶
- float: Influences the translation length exponentially.
If positive, shorter translations are preferred. If negative, longer translations are preferred. If zero, no penalty is applied.