nltk.translate.ibm_model module¶
Common methods and classes for all IBM models. See IBMModel1
,
IBMModel2
, IBMModel3
, IBMModel4
, and IBMModel5
for specific implementations.
The IBM models are a series of generative models that learn lexical translation probabilities, p(target language word|source language word), given a sentence-aligned parallel corpus.
The models increase in sophistication from model 1 to 5. Typically, the output of lower models is used to seed the higher models. All models use the Expectation-Maximization (EM) algorithm to learn various probability tables.
Words in a sentence are one-indexed. The first word of a sentence has position 1, not 0. Index 0 is reserved in the source sentence for the NULL token. The concept of position does not apply to NULL, but it is indexed at 0 by convention.
Each target word is aligned to exactly one source word or the NULL token.
References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.
Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.
- class nltk.translate.ibm_model.AlignmentInfo[source]¶
Bases:
object
Helper data object for training IBM Models 3 and up
Read-only. For a source sentence and its counterpart in the target language, this class holds information about the sentence pair’s alignment, cepts, and fertility.
Warning: Alignments are one-indexed here, in contrast to nltk.translate.Alignment and AlignedSent, which are zero-indexed This class is not meant to be used outside of IBM models.
- alignment¶
tuple(int): Alignment function.
alignment[j]
is the position in the source sentence that is aligned to the position j in the target sentence.
- center_of_cept(i)[source]¶
- Returns
The ceiling of the average positions of the words in the tablet of cept
i
, or 0 ifi
is None
- cepts¶
list(list(int)): The positions of the target words, in ascending order, aligned to a source word position. For example, cepts[4] = (2, 3, 7) means that words in positions 2, 3 and 7 of the target sentence are aligned to the word in position 4 of the source sentence
- is_head_word(j)[source]¶
- Returns
Whether the word in position
j
of the target sentence is a head word
- previous_in_tablet(j)[source]¶
- Returns
The position of the previous word that is in the same tablet as
j
, or None ifj
is the first word of the tablet
- score¶
float: Optional. Probability of alignment, as defined by the IBM model that assesses this alignment
- src_sentence¶
tuple(str): Source sentence referred to by this object. Should include NULL token (None) in index 0.
- trg_sentence¶
tuple(str): Target sentence referred to by this object. Should have a dummy element in index 0 so that the first word starts from index 1.
- class nltk.translate.ibm_model.Counts[source]¶
Bases:
object
Data object to store counts of various parameters during training
- class nltk.translate.ibm_model.IBMModel[source]¶
Bases:
object
Abstract base class for all IBM models
- MIN_PROB = 1e-12¶
- best_model2_alignment(sentence_pair, j_pegged=None, i_pegged=0)[source]¶
Finds the best alignment according to IBM Model 2
Used as a starting point for hill climbing in Models 3 and above, because it is easier to compute than the best alignments in higher models
- Parameters
sentence_pair (AlignedSent) – Source and target language sentence pair to be word-aligned
j_pegged (int) – If specified, the alignment point of j_pegged will be fixed to i_pegged
i_pegged (int) – Alignment point to j_pegged
- hillclimb(alignment_info, j_pegged=None)[source]¶
Starting from the alignment in
alignment_info
, look at neighboring alignments iteratively for the best oneThere is no guarantee that the best alignment in the alignment space will be found, because the algorithm might be stuck in a local maximum.
- Parameters
j_pegged (int) – If specified, the search will be constrained to alignments where
j_pegged
remains unchanged- Returns
The best alignment found from hill climbing
- Return type
- neighboring(alignment_info, j_pegged=None)[source]¶
Determine the neighbors of
alignment_info
, obtained by moving or swapping one alignment point- Parameters
j_pegged (int) – If specified, neighbors that have a different alignment point from j_pegged will not be considered
- Returns
A set neighboring alignments represented by their
AlignmentInfo
- Return type
set(AlignmentInfo)
- prob_t_a_given_s(alignment_info)[source]¶
Probability of target sentence and an alignment given the source sentence
All required information is assumed to be in
alignment_info
and self.Derived classes should override this method
- sample(sentence_pair)[source]¶
Sample the most probable alignments from the entire alignment space
First, determine the best alignment according to IBM Model 2. With this initial alignment, use hill climbing to determine the best alignment according to a higher IBM Model. Add this alignment and its neighbors to the sample set. Repeat this process with other initial alignments obtained by pegging an alignment point.
Hill climbing may be stuck in a local maxima, hence the pegging and trying out of different alignments.
- Parameters
sentence_pair (AlignedSent) – Source and target language sentence pair to generate a sample of alignments from
- Returns
A set of best alignments represented by their
AlignmentInfo
and the best alignment of the set for convenience- Return type
set(AlignmentInfo), AlignmentInfo
- nltk.translate.ibm_model.longest_target_sentence_length(sentence_aligned_corpus)[source]¶
- Parameters
sentence_aligned_corpus (list(AlignedSent)) – Parallel corpus under consideration
- Returns
Number of words in the longest target language sentence of
sentence_aligned_corpus