nltk.translate.nist_score module¶
NIST score implementation.
- nltk.translate.nist_score.corpus_nist(list_of_references, hypotheses, n=5)[source]¶
Calculate a single corpus-level NIST score (aka. system-level BLEU) for all the hypotheses and their respective references.
- Parameters
references (list(list(list(str)))) – a corpus of lists of reference sentences, w.r.t. hypotheses
hypotheses (list(list(str))) – a list of hypothesis sentences
n (int) – highest n-gram order
- nltk.translate.nist_score.nist_length_penalty(ref_len, hyp_len)[source]¶
Calculates the NIST length penalty, from Eq. 3 in Doddington (2002)
penalty = exp( beta * log( min( len(hyp)/len(ref) , 1.0 )))
where,
beta is chosen to make the brevity penalty factor = 0.5 when the no. of words in the system output (hyp) is 2/3 of the average no. of words in the reference translation (ref)
The NIST penalty is different from BLEU’s such that it minimize the impact of the score of small variations in the length of a translation. See Fig. 4 in Doddington (2002)
- nltk.translate.nist_score.sentence_nist(references, hypothesis, n=5)[source]¶
Calculate NIST score from George Doddington. 2002. “Automatic evaluation of machine translation quality using n-gram co-occurrence statistics.” Proceedings of HLT. Morgan Kaufmann Publishers Inc. https://dl.acm.org/citation.cfm?id=1289189.1289273
DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU score. The official script used by NIST to compute BLEU and NIST score is mteval-14.pl. The main differences are:
BLEU uses geometric mean of the ngram overlaps, NIST uses arithmetic mean.
NIST has a different brevity penalty
NIST score from mteval-14.pl has a self-contained tokenizer
- Note: The mteval-14.pl includes a smoothing function for BLEU score that is NOT
used in the NIST score computation.
>>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which', ... 'ensures', 'that', 'the', 'military', 'always', ... 'obeys', 'the', 'commands', 'of', 'the', 'party']
>>> hypothesis2 = ['It', 'is', 'to', 'insure', 'the', 'troops', ... 'forever', 'hearing', 'the', 'activity', 'guidebook', ... 'that', 'party', 'direct']
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that', ... 'ensures', 'that', 'the', 'military', 'will', 'forever', ... 'heed', 'Party', 'commands']
>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which', ... 'guarantees', 'the', 'military', 'forces', 'always', ... 'being', 'under', 'the', 'command', 'of', 'the', ... 'Party']
>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the', ... 'army', 'always', 'to', 'heed', 'the', 'directions', ... 'of', 'the', 'party']
>>> sentence_nist([reference1, reference2, reference3], hypothesis1) 3.3709...
>>> sentence_nist([reference1, reference2, reference3], hypothesis2) 1.4619...
- Parameters
references (list(list(str))) – reference sentences
hypothesis (list(str)) – a hypothesis sentence
n (int) – highest n-gram order