nltk.corpus.reader.conll module¶
Read CoNLL-style chunk fileids.
- class nltk.corpus.reader.conll.ConllChunkCorpusReader[source]¶
Bases:
ConllCorpusReader
A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.
- __init__(root, fileids, chunk_types, encoding='utf8', tagset=None, separator=None)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.conll.ConllCorpusReader[source]¶
Bases:
CorpusReader
A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or “grid”) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the
ConllCorpusReader
constructor therefore takes an argument,columntypes
, which is used to specify the columns that are used by a given corpus. By default columns are split by consecutive whitespaces, with theseparator
argument you can set a string to split by (e.g.' '
).- @todo: Add support for reading from corpora where different
parallel files contain different columns.
- @todo: Possibly add caching of the grid corpus view? This would
allow the same grid view to be used by different data access methods (eg words() and parsed_sents() could both share the same grid corpus view object).
- @todo: Better support for -DOCSTART-. Currently, we just ignore
it, but it could be used to define methods that retrieve a document at a time (eg parsed_documents()).
- CHUNK = 'chunk'¶
column type for chunk structures
- COLUMN_TYPES = ('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')¶
A list of all column types supported by the conll corpus reader.
- IGNORE = 'ignore'¶
column type for column that should be ignored
- NE = 'ne'¶
column type for named entities
- POS = 'pos'¶
column type for part-of-speech tags
- SRL = 'srl'¶
column type for semantic role labels
- TREE = 'tree'¶
column type for parse trees
- WORDS = 'words'¶
column type for words
- __init__(root, fileids, columntypes, chunk_types=None, root_label='S', pos_in_tree=False, srl_includes_roleset=True, encoding='utf8', tree_class=<class 'nltk.tree.tree.Tree'>, tagset=None, separator=None)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- iob_sents(fileids=None, tagset=None)[source]¶
- Returns
a list of lists of word/tag/IOB tuples
- Return type
list(list)
- Parameters
fileids (None or str or list) – the list of fileids that make up this corpus
- class nltk.corpus.reader.conll.ConllSRLInstance[source]¶
Bases:
object
An SRL instance from a CoNLL corpus, which identifies and providing labels for the arguments of a single verb.
- arguments¶
A list of
(argspan, argid)
tuples, specifying the location and type for each of the arguments identified by this instance.argspan
is a tuplestart, end
, indicating that the argument consists of thewords[start:end]
.
- tagged_spans¶
A list of
(span, id)
tuples, specifying the location and type for each of the arguments, as well as the verb pieces, that make up this instance.
- tree¶
The parse tree for the sentence containing this instance.
- verb¶
A list of the word indices of the words that compose the verb whose arguments are identified by this instance. This will contain multiple word indices when multi-word verbs are used (e.g. ‘turn on’).
- verb_head¶
The word index of the head word of the verb whose arguments are identified by this instance. E.g., for a sentence that uses the verb ‘turn on,’
verb_head
will be the word index of the word ‘turn’.
- words¶
A list of the words in the sentence containing this instance.