nltk.corpus.reader.bracket_parse module¶
Corpus reader for corpora that consist of parenthesis-delineated parse trees.
- class nltk.corpus.reader.bracket_parse.AlpinoCorpusReader[source]¶
Bases:
BracketParseCorpusReader
Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree. This is no good for tag_ and word_ _tag and _word will be overridden to use a non-default new parameter ‘ordered’ to the overridden _normalize function. The _parse function can then remain untouched.
- __init__(root, encoding='ISO-8859-1', tagset=None)[source]¶
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
comment_char – The character which can appear at the start of a line to indicate that the rest of the line is a comment.
detect_blocks – The method that is used to find blocks in the corpus; can be ‘unindented_paren’ (every unindented parenthesis starts a new parse) or ‘sexpr’ (brackets are matched).
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.bracket_parse.BracketParseCorpusReader[source]¶
Bases:
SyntaxCorpusReader
Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the “combined” section of the Penn Treebank, e.g. “(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))”.
- __init__(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]¶
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
comment_char – The character which can appear at the start of a line to indicate that the rest of the line is a comment.
detect_blocks – The method that is used to find blocks in the corpus; can be ‘unindented_paren’ (every unindented parenthesis starts a new parse) or ‘sexpr’ (brackets are matched).
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.bracket_parse.CategorizedBracketParseCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,BracketParseCorpusReader
A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>
- __init__(*args, **kwargs)[source]¶
Initialize the corpus reader. Categorization arguments (C{cat_pattern}, C{cat_map}, and C{cat_file}) are passed to the L{CategorizedCorpusReader constructor <CategorizedCorpusReader.__init__>}. The remaining arguments are passed to the L{BracketParseCorpusReader constructor <BracketParseCorpusReader.__init__>}.