nltk.tag.crf module¶
A module for POS tagging using CRFSuite
- class nltk.tag.crf.CRFTagger[source]¶
A module for POS tagging using CRFSuite
>>> from nltk.tag import CRFTagger >>> ct = CRFTagger()
>>> train_data = [[('University','Noun'), ('is','Verb'), ('a','Det'), ('good','Adj'), ('place','Noun')], ... [('dog','Noun'),('eat','Verb'),('meat','Noun')]]
>>> ct.train(train_data,'model.crf.tagger') >>> ct.tag_sents([['dog','is','good'], ['Cat','eat','meat']]) [[('dog', 'Noun'), ('is', 'Verb'), ('good', 'Adj')], [('Cat', 'Noun'), ('eat', 'Verb'), ('meat', 'Noun')]]
>>> gold_sentences = [[('dog','Noun'),('is','Verb'),('good','Adj')] , [('Cat','Noun'),('eat','Verb'), ('meat','Noun')]] >>> ct.accuracy(gold_sentences) 1.0
Setting learned model file >>> ct = CRFTagger() # doctest: +SKIP >>> ct.set_model_file(‘model.crf.tagger’) # doctest: +SKIP >>> ct.accuracy(gold_sentences) # doctest: +SKIP 1.0
- __init__(feature_func=None, verbose=False, training_opt={})[source]¶
Initialize the CRFSuite tagger
- Parameters
feature_func – The function that extracts features for each token of a sentence. This function should take 2 parameters: tokens and index which extract features at index position from tokens list. See the build in _get_features function for more detail.
verbose (boolean) – output the debugging messages during training.
training_opt (dictionary) – python-crfsuite training options
- Set of possible training options (using LBFGS training algorithm).
- ‘feature.minfreq’
The minimum frequency of features.
- ‘feature.possible_states’
Force to generate possible state features.
- ‘feature.possible_transitions’
Force to generate possible transition features.
- ‘c1’
Coefficient for L1 regularization.
- ‘c2’
Coefficient for L2 regularization.
- ‘max_iterations’
The maximum number of iterations for L-BFGS optimization.
- ‘num_memories’
The number of limited memories for approximating the inverse hessian matrix.
- ‘epsilon’
Epsilon for testing the convergence of the objective.
- ‘period’
The duration of iterations to test the stopping criterion.
- ‘delta’
The threshold for the stopping criterion; an L-BFGS iteration stops when the improvement of the log likelihood over the last ${period} iterations is no greater than this threshold.
- ‘linesearch’
The line search algorithm used in L-BFGS updates:
‘MoreThuente’: More and Thuente’s method,
‘Backtracking’: Backtracking method with regular Wolfe condition,
‘StrongBacktracking’: Backtracking method with strong Wolfe condition
- ‘max_linesearch’
The maximum number of trials for the line search algorithm.
- tag(tokens)[source]¶
Tag a sentence using Python CRFSuite Tagger. NB before using this function, user should specify the mode_file either by
Train a new model using
functionUse the pre-trained model which is set via
- Params tokens
list of tokens needed to tag.
- Returns
list of tagged tokens.
- Return type
- tag_sents(sents)[source]¶
Tag a list of sentences. NB before using this function, user should specify the mode_file either by
Train a new model using
functionUse the pre-trained model which is set via
- Params sentences
list of sentences needed to tag.
- Returns
list of tagged sentences.
- Return type