nltk.corpus.reader.tagged module¶
A reader for corpora whose documents contain part-of-speech-tagged words.
- class nltk.corpus.reader.tagged.CategorizedTaggedCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,TaggedCorpusReader
A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.
- __init__(*args, **kwargs)[source]¶
Initialize the corpus reader. Categorization arguments (
cat_pattern
,cat_map
, andcat_file
) are passed to theCategorizedCorpusReader
constructor. The remaining arguments are passed to theTaggedCorpusReader
.
- tagged_paras(fileids=None, categories=None, tagset=None)[source]¶
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of
(word,tag)
tuples.- Return type
list(list(list(tuple(str,str))))
- class nltk.corpus.reader.tagged.MacMorphoCorpusReader[source]¶
Bases:
TaggedCorpusReader
A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using ‘_’ as a separator. Sentence boundaries are based on the end-sentence tag (‘_.’). Paragraph information is not included in the corpus, so each paragraph returned by
self.paras()
andself.tagged_paras()
contains a single sentence.- __init__(root, fileids, encoding='utf8', tagset=None)[source]¶
Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/...path to corpus.../' >>> reader = TaggedCorpusReader(root, '.*', '.txt')
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
- class nltk.corpus.reader.tagged.TaggedCorpusReader[source]¶
Bases:
CorpusReader
Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using
nltk.tag.str2tuple
. By default,'/'
is used as the separator. I.e., words should have the form:word1/tag1 word2/tag2 word3/tag3 ...
But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.
- __init__(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]¶
Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/...path to corpus.../' >>> reader = TaggedCorpusReader(root, '.*', '.txt')
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
- paras(fileids=None)[source]¶
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
- Return type
list(list(list(str)))
- sents(fileids=None)[source]¶
- Returns
the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
- Return type
list(list(str))
- tagged_paras(fileids=None, tagset=None)[source]¶
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of
(word,tag)
tuples.- Return type
list(list(list(tuple(str,str))))
- tagged_sents(fileids=None, tagset=None)[source]¶
- Returns
the given file(s) as a list of sentences, each encoded as a list of
(word,tag)
tuples.- Return type
list(list(tuple(str,str)))
- class nltk.corpus.reader.tagged.TaggedCorpusView[source]¶
Bases:
StreamBackedCorpusView
A specialized corpus view for tagged documents. It can be customized via flags to divide the tagged corpus documents up by sentence or paragraph, and to include or omit part of speech tags.
TaggedCorpusView
objects are typically created byTaggedCorpusReader
(not directly by nltk users).- __init__(corpus_file, encoding, tagged, group_by_sent, group_by_para, sep, word_tokenizer, sent_tokenizer, para_block_reader, tag_mapping_function=None)[source]¶
Create a new corpus view, based on the file
fileid
, and read withblock_reader
. See the class documentation for more information.- Parameters
fileid – The path to the file that is read by this corpus view.
fileid
can either be a string or aPathPointer
.startpos – The file position at which the view will start reading. This can be used to skip over preface sections.
encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).
- class nltk.corpus.reader.tagged.TimitTaggedCorpusReader[source]¶
Bases:
TaggedCorpusReader
A corpus reader for tagged sentences that are included in the TIMIT corpus.
- __init__(*args, **kwargs)[source]¶
Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/...path to corpus.../' >>> reader = TaggedCorpusReader(root, '.*', '.txt')
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.