nltk.corpus.reader package¶
Submodules¶
- nltk.corpus.reader.aligned module
- nltk.corpus.reader.api module
- nltk.corpus.reader.bcp47 module
- nltk.corpus.reader.bnc module
- nltk.corpus.reader.bracket_parse module
- nltk.corpus.reader.categorized_sents module
- nltk.corpus.reader.chasen module
- nltk.corpus.reader.childes module
- nltk.corpus.reader.chunked module
- nltk.corpus.reader.cmudict module
- nltk.corpus.reader.comparative_sents module
- nltk.corpus.reader.conll module
- nltk.corpus.reader.crubadan module
- nltk.corpus.reader.dependency module
- nltk.corpus.reader.framenet module
- nltk.corpus.reader.ieer module
- nltk.corpus.reader.indian module
- nltk.corpus.reader.ipipan module
- nltk.corpus.reader.knbc module
- nltk.corpus.reader.lin module
- nltk.corpus.reader.markdown module
- nltk.corpus.reader.mte module
- nltk.corpus.reader.nkjp module
- nltk.corpus.reader.nombank module
- nltk.corpus.reader.nps_chat module
- nltk.corpus.reader.opinion_lexicon module
- nltk.corpus.reader.panlex_lite module
- nltk.corpus.reader.panlex_swadesh module
- nltk.corpus.reader.pl196x module
- nltk.corpus.reader.plaintext module
- nltk.corpus.reader.ppattach module
- nltk.corpus.reader.propbank module
- nltk.corpus.reader.pros_cons module
- nltk.corpus.reader.reviews module
- nltk.corpus.reader.rte module
- nltk.corpus.reader.semcor module
- nltk.corpus.reader.senseval module
- nltk.corpus.reader.sentiwordnet module
- nltk.corpus.reader.sinica_treebank module
- nltk.corpus.reader.string_category module
- nltk.corpus.reader.switchboard module
- nltk.corpus.reader.tagged module
- nltk.corpus.reader.timit module
- nltk.corpus.reader.toolbox module
- nltk.corpus.reader.twitter module
- nltk.corpus.reader.udhr module
- nltk.corpus.reader.util module
- nltk.corpus.reader.verbnet module
- nltk.corpus.reader.wordlist module
- nltk.corpus.reader.wordnet module
- nltk.corpus.reader.xmldocs module
- nltk.corpus.reader.ycoe module
Module contents¶
NLTK corpus readers. The modules in this package provide functions that can be used to read corpus fileids in a variety of formats. These functions can be used to read both the corpus fileids that are distributed in the NLTK corpus package, and corpus fileids that are part of external corpora.
Corpus Reader Functions¶
Each corpus module defines one or more “corpus reader functions”,
which can be used to read documents from that corpus. These functions
take an argument, item
, which is used to indicate which document
should be read from the corpus:
If
item
is one of the unique identifiers listed in the corpus module’sitems
variable, then the corresponding document will be loaded from the NLTK corpus package.If
item
is a fileid, then that file will be read.
Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.
Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:
words(): list of str
sents(): list of (list of str)
paras(): list of (list of (list of str))
tagged_words(): list of (str,str) tuple
tagged_sents(): list of (list of (str,str))
tagged_paras(): list of (list of (list of (str,str)))
chunked_sents(): list of (Tree w/ (str,str) leaves)
parsed_sents(): list of (Tree with str leaves)
parsed_paras(): list of (list of (Tree with str leaves))
xml(): A single xml ElementTree
raw(): unprocessed corpus contents
For example, to read a list of the words in the Brown Corpus, use
nltk.corpus.brown.words()
:
>>> from nltk.corpus import brown
>>> print(", ".join(brown.words()[:6])) # only first 6 words
The, Fulton, County, Grand, Jury, said
isort:skip_file
- class nltk.corpus.reader.AlignedCorpusReader[source]¶
Bases:
CorpusReader
Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.
- __init__(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), alignedsent_block_reader=<function read_alignedsent_block>, encoding='latin1')[source]¶
Construct a new Aligned Corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/...path to corpus.../' >>> reader = AlignedCorpusReader(root, '.*', '.txt')
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
- aligned_sents(fileids=None)[source]¶
- Returns
the given file(s) as a list of AlignedSent objects.
- Return type
list(AlignedSent)
- class nltk.corpus.reader.AlpinoCorpusReader[source]¶
Bases:
BracketParseCorpusReader
Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree. This is no good for tag_ and word_ _tag and _word will be overridden to use a non-default new parameter ‘ordered’ to the overridden _normalize function. The _parse function can then remain untouched.
- __init__(root, encoding='ISO-8859-1', tagset=None)[source]¶
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
comment_char – The character which can appear at the start of a line to indicate that the rest of the line is a comment.
detect_blocks – The method that is used to find blocks in the corpus; can be ‘unindented_paren’ (every unindented parenthesis starts a new parse) or ‘sexpr’ (brackets are matched).
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.BCP47CorpusReader[source]¶
Bases:
CorpusReader
Parse BCP-47 composite language tags
Supports all the main subtags, and the ‘u-sd’ extension:
>>> from nltk.corpus import bcp47 >>> bcp47.name('oc-gascon-u-sd-fr64') 'Occitan (post 1500): Gascon: Pyrénées-Atlantiques'
Can load a conversion table to Wikidata Q-codes: >>> bcp47.load_wiki_q() >>> bcp47.wiki_q[‘en-GI-spanglis’] ‘Q79388’
- class nltk.corpus.reader.BNCCorpusReader[source]¶
Bases:
XMLCorpusReader
Corpus reader for the XML version of the British National Corpus.
For access to the complete XML data structure, use the
xml()
method. For access to simple word lists and tagged word lists, usewords()
,sents()
,tagged_words()
, andtagged_sents()
.You can obtain the full version of the BNC corpus at https://www.ota.ox.ac.uk/desc/2554
If you extracted the archive to a directory called BNC, then you can instantiate the reader as:
BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
- __init__(root, fileids, lazy=True)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- sents(fileids=None, strip_space=True, stem=False)[source]¶
- Returns
the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
- Return type
list(list(str))
- Parameters
strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
stem – If true, then use word stems instead of word strings.
- tagged_sents(fileids=None, c5=False, strip_space=True, stem=False)[source]¶
- Returns
the given file(s) as a list of sentences, each encoded as a list of
(word,tag)
tuples.- Return type
list(list(tuple(str,str)))
- Parameters
c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
stem – If true, then use word stems instead of word strings.
- tagged_words(fileids=None, c5=False, strip_space=True, stem=False)[source]¶
- Returns
the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples
(word,tag)
.- Return type
list(tuple(str,str))
- Parameters
c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
stem – If true, then use word stems instead of word strings.
- words(fileids=None, strip_space=True, stem=False)[source]¶
- Returns
the given file(s) as a list of words and punctuation symbols.
- Return type
list(str)
- Parameters
strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
stem – If true, then use word stems instead of word strings.
- class nltk.corpus.reader.BracketParseCorpusReader[source]¶
Bases:
SyntaxCorpusReader
Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the “combined” section of the Penn Treebank, e.g. “(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))”.
- __init__(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]¶
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
comment_char – The character which can appear at the start of a line to indicate that the rest of the line is a comment.
detect_blocks – The method that is used to find blocks in the corpus; can be ‘unindented_paren’ (every unindented parenthesis starts a new parse) or ‘sexpr’ (brackets are matched).
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.CHILDESCorpusReader[source]¶
Bases:
XMLCorpusReader
Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at
https://childes.talkbank.org/
. The XML version of CHILDES is located athttps://childes.talkbank.org/data-xml/
. Copy the needed parts of the CHILDES XML corpus into the NLTK data directory (nltk_data/corpora/CHILDES/
).For access to the file text use the usual nltk functions,
words()
,sents()
,tagged_words()
andtagged_sents()
.- MLU(fileids=None, speaker='CHI')[source]¶
- Returns
the given file(s) as a floating number
- Return type
list(float)
- __init__(root, fileids, lazy=True)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- age(fileids=None, speaker='CHI', month=False)[source]¶
- Returns
the given file(s) as string or int
- Return type
list or int
- Parameters
month – If true, return months instead of year-month-date
- childes_url_base = 'https://childes.talkbank.org/browser/index.php?url='¶
- corpus(fileids=None)[source]¶
- Returns
the given file(s) as a dict of
(corpus_property_key, value)
- Return type
list(dict)
- participants(fileids=None)[source]¶
- Returns
the given file(s) as a dict of
(participant_property_key, value)
- Return type
list(dict)
- sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]¶
- Returns
the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
- Return type
list(list(str))
- Parameters
speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
stem – If true, then use word stems instead of word strings.
relation – If true, then return tuples of
(str,pos,relation_list)
. If there is manually-annotated relation info, it will return tuples of(str,pos,test_relation_list,str,pos,gold_relation_list)
strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
- tagged_sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]¶
- Returns
the given file(s) as a list of sentences, each encoded as a list of
(word,tag)
tuples.- Return type
list(list(tuple(str,str)))
- Parameters
speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
stem – If true, then use word stems instead of word strings.
relation – If true, then return tuples of
(str,pos,relation_list)
. If there is manually-annotated relation info, it will return tuples of(str,pos,test_relation_list,str,pos,gold_relation_list)
strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
- tagged_words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]¶
- Returns
the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples
(word,tag)
.- Return type
list(tuple(str,str))
- Parameters
speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
stem – If true, then use word stems instead of word strings.
relation – If true, then return tuples of (stem, index, dependent_index)
strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
- webview_file(fileid, urlbase=None)[source]¶
Map a corpus file to its web version on the CHILDES website, and open it in a web browser.
- The complete URL to be used is:
childes.childes_url_base + urlbase + fileid.replace(‘.xml’, ‘.cha’)
If no urlbase is passed, we try to calculate it. This requires that the childes corpus was set up to mirror the folder hierarchy under childes.psy.cmu.edu/data-xml/, e.g.: nltk_data/corpora/childes/Eng-USA/Cornell/??? or nltk_data/corpora/childes/Romance/Spanish/Aguirre/???
The function first looks (as a special case) if “Eng-USA” is on the path consisting of <corpus root>+fileid; then if “childes”, possibly followed by “data-xml”, appears. If neither one is found, we use the unmodified fileid and hope for the best. If this is not right, specify urlbase explicitly, e.g., if the corpus root points to the Cornell folder, urlbase=’Eng-USA/Cornell’.
- words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]¶
- Returns
the given file(s) as a list of words
- Return type
list(str)
- Parameters
speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
stem – If true, then use word stems instead of word strings.
relation – If true, then return tuples of (stem, index, dependent_index)
strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
- class nltk.corpus.reader.CMUDictCorpusReader[source]¶
Bases:
CorpusReader
- dict()[source]¶
- Returns
the cmudict lexicon as a dictionary, whose keys are lowercase words and whose values are lists of pronunciations.
- class nltk.corpus.reader.CategorizedBracketParseCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,BracketParseCorpusReader
A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>
- __init__(*args, **kwargs)[source]¶
Initialize the corpus reader. Categorization arguments (C{cat_pattern}, C{cat_map}, and C{cat_file}) are passed to the L{CategorizedCorpusReader constructor <CategorizedCorpusReader.__init__>}. The remaining arguments are passed to the L{BracketParseCorpusReader constructor <BracketParseCorpusReader.__init__>}.
- class nltk.corpus.reader.CategorizedCorpusReader[source]¶
Bases:
object
A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method
categories()
, which returns a list of the categories for the corpus or for a specified set of fileids; and overridesfileids()
to take acategories
argument, restricting the set of fileids to be returned.Subclasses are expected to:
Call
__init__()
to set up the mapping.Override all view methods to accept a
categories
parameter, which can be used instead of thefileids
parameter, to select which fileids should be included in the returned view.
- __init__(kwargs)[source]¶
Initialize this mapping based on keyword arguments, as follows:
cat_pattern: A regular expression pattern used to find the category for each file identifier. The pattern will be applied to each file identifier, and the first matching group will be used as the category label for that file.
cat_map: A dictionary, mapping from file identifiers to category labels.
cat_file: The name of a file that contains the mapping from file identifiers to categories. The argument
cat_delimiter
can be used to specify a delimiter.
The corresponding argument will be deleted from
kwargs
. If more than one argument is specified, an exception will be raised.
- categories(fileids=None)[source]¶
Return a list of the categories that are defined for this corpus, or for the file(s) if it is given.
- class nltk.corpus.reader.CategorizedPlaintextCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,PlaintextCorpusReader
A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.
- class nltk.corpus.reader.CategorizedSentencesCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,CorpusReader
A reader for corpora in which each row represents a single instance, mainly a sentence. Istances are divided into categories based on their file identifiers (see CategorizedCorpusReader). Since many corpora allow rows that contain more than one sentence, it is possible to specify a sentence tokenizer to retrieve all sentences instead than all rows.
Examples using the Subjectivity Dataset:
>>> from nltk.corpus import subjectivity >>> subjectivity.sents()[23] ['television', 'made', 'him', 'famous', ',', 'but', 'his', 'biggest', 'hits', 'happened', 'off', 'screen', '.'] >>> subjectivity.categories() ['obj', 'subj'] >>> subjectivity.words(categories='subj') ['smart', 'and', 'alert', ',', 'thirteen', ...]
Examples using the Sentence Polarity Dataset:
>>> from nltk.corpus import sentence_polarity >>> sentence_polarity.sents() [['simplistic', ',', 'silly', 'and', 'tedious', '.'], ["it's", 'so', 'laddish', 'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find', 'it', 'funny', '.'], ...] >>> sentence_polarity.categories() ['neg', 'pos']
- CorpusView¶
alias of
StreamBackedCorpusView
- __init__(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), sent_tokenizer=None, encoding='utf8', **kwargs)[source]¶
- Parameters
root – The root directory for the corpus.
fileids – a list or regexp specifying the fileids in the corpus.
word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer
sent_tokenizer – a tokenizer for breaking paragraphs into sentences.
encoding – the encoding that should be used to read the corpus.
kwargs – additional parameters passed to CategorizedCorpusReader.
- sents(fileids=None, categories=None)[source]¶
Return all sentences in the corpus or in the specified file(s).
- Parameters
fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
categories – a list specifying the categories whose sentences have to be returned.
- Returns
the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.
- Return type
list(list(str))
- words(fileids=None, categories=None)[source]¶
Return all words and punctuation symbols in the corpus or in the specified file(s).
- Parameters
fileids – a list or regexp specifying the ids of the files whose words have to be returned.
categories – a list specifying the categories whose words have to be returned.
- Returns
the given file(s) as a list of words and punctuation symbols.
- Return type
list(str)
- class nltk.corpus.reader.CategorizedTaggedCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,TaggedCorpusReader
A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.
- __init__(*args, **kwargs)[source]¶
Initialize the corpus reader. Categorization arguments (
cat_pattern
,cat_map
, andcat_file
) are passed to theCategorizedCorpusReader
constructor. The remaining arguments are passed to theTaggedCorpusReader
.
- tagged_paras(fileids=None, categories=None, tagset=None)[source]¶
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of
(word,tag)
tuples.- Return type
list(list(list(tuple(str,str))))
- class nltk.corpus.reader.ChasenCorpusReader[source]¶
Bases:
CorpusReader
- __init__(root, fileids, encoding='utf8', sent_splitter=None)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.ChunkedCorpusReader[source]¶
Bases:
CorpusReader
Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function. Each of these steps can be performed using a default function or a custom function. By default, paragraphs are split on blank lines; sentences are listed one per line; and sentences are parsed into chunk trees using
nltk.chunk.tagstr2tree
.- __init__(root, fileids, extension='', str2chunktree=<function tagstr2tree>, sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]¶
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
- chunked_paras(fileids=None, tagset=None)[source]¶
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a shallow Tree. The leaves of these trees are encoded as
(word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags).- Return type
list(list(Tree))
- chunked_sents(fileids=None, tagset=None)[source]¶
- Returns
the given file(s) as a list of sentences, each encoded as a shallow Tree. The leaves of these trees are encoded as
(word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags).- Return type
list(Tree)
- chunked_words(fileids=None, tagset=None)[source]¶
- Returns
the given file(s) as a list of tagged words and chunks. Words are encoded as
(word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags). Chunks are encoded as depth-one trees over(word,tag)
tuples or word strings.- Return type
list(tuple(str,str) and Tree)
- paras(fileids=None)[source]¶
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
- Return type
list(list(list(str)))
- sents(fileids=None)[source]¶
- Returns
the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
- Return type
list(list(str))
- tagged_paras(fileids=None, tagset=None)[source]¶
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of
(word,tag)
tuples.- Return type
list(list(list(tuple(str,str))))
- tagged_sents(fileids=None, tagset=None)[source]¶
- Returns
the given file(s) as a list of sentences, each encoded as a list of
(word,tag)
tuples.- Return type
list(list(tuple(str,str)))
- class nltk.corpus.reader.ComparativeSentencesCorpusReader[source]¶
Bases:
CorpusReader
Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).
>>> from nltk.corpus import comparative_sentences >>> comparison = comparative_sentences.comparisons()[0] >>> comparison.text ['its', 'fast-forward', 'and', 'rewind', 'work', 'much', 'more', 'smoothly', 'and', 'consistently', 'than', 'those', 'of', 'other', 'models', 'i', "'ve", 'had', '.'] >>> comparison.entity_2 'models' >>> (comparison.feature, comparison.keyword) ('rewind', 'more') >>> len(comparative_sentences.comparisons()) 853
- CorpusView¶
alias of
StreamBackedCorpusView
- __init__(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), sent_tokenizer=None, encoding='utf8')[source]¶
- Parameters
root – The root directory for this corpus.
fileids – a list or regexp specifying the fileids in this corpus.
word_tokenizer – tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer
sent_tokenizer – tokenizer for breaking paragraphs into sentences.
encoding – the encoding that should be used to read the corpus.
- comparisons(fileids=None)[source]¶
Return all comparisons in the corpus.
- Parameters
fileids – a list or regexp specifying the ids of the files whose comparisons have to be returned.
- Returns
the given file(s) as a list of Comparison objects.
- Return type
list(Comparison)
- keywords(fileids=None)[source]¶
Return a set of all keywords used in the corpus.
- Parameters
fileids – a list or regexp specifying the ids of the files whose keywords have to be returned.
- Returns
the set of keywords and comparative phrases used in the corpus.
- Return type
set(str)
- keywords_readme()[source]¶
Return the list of words and constituents considered as clues of a comparison (from listOfkeywords.txt).
- sents(fileids=None)[source]¶
Return all sentences in the corpus.
- Parameters
fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
- Returns
all sentences of the corpus as lists of tokens (or as plain strings, if no word tokenizer is specified).
- Return type
list(list(str)) or list(str)
- class nltk.corpus.reader.ConllChunkCorpusReader[source]¶
Bases:
ConllCorpusReader
A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.
- __init__(root, fileids, chunk_types, encoding='utf8', tagset=None, separator=None)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.ConllCorpusReader[source]¶
Bases:
CorpusReader
A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or “grid”) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the
ConllCorpusReader
constructor therefore takes an argument,columntypes
, which is used to specify the columns that are used by a given corpus. By default columns are split by consecutive whitespaces, with theseparator
argument you can set a string to split by (e.g.' '
).- @todo: Add support for reading from corpora where different
parallel files contain different columns.
- @todo: Possibly add caching of the grid corpus view? This would
allow the same grid view to be used by different data access methods (eg words() and parsed_sents() could both share the same grid corpus view object).
- @todo: Better support for -DOCSTART-. Currently, we just ignore
it, but it could be used to define methods that retrieve a document at a time (eg parsed_documents()).
- CHUNK = 'chunk'¶
column type for chunk structures
- COLUMN_TYPES = ('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')¶
A list of all column types supported by the conll corpus reader.
- IGNORE = 'ignore'¶
column type for column that should be ignored
- NE = 'ne'¶
column type for named entities
- POS = 'pos'¶
column type for part-of-speech tags
- SRL = 'srl'¶
column type for semantic role labels
- TREE = 'tree'¶
column type for parse trees
- WORDS = 'words'¶
column type for words
- __init__(root, fileids, columntypes, chunk_types=None, root_label='S', pos_in_tree=False, srl_includes_roleset=True, encoding='utf8', tree_class=<class 'nltk.tree.tree.Tree'>, tagset=None, separator=None)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- iob_sents(fileids=None, tagset=None)[source]¶
- Returns
a list of lists of word/tag/IOB tuples
- Return type
list(list)
- Parameters
fileids (None or str or list) – the list of fileids that make up this corpus
- class nltk.corpus.reader.CorpusReader[source]¶
Bases:
object
A base class for “corpus reader” classes, each of which can be used to read a specific corpus format. Each individual corpus reader instance is used to read a specific corpus, consisting of one or more files under a common root directory. Each file is identified by its
file identifier
, which is the relative path to the file from the root directory.A separate subclass is defined for each corpus format. These subclasses define one or more methods that provide ‘views’ on the corpus contents, such as
words()
(for a list of words) andparsed_sents()
(for a list of parsed sentences). Called with no arguments, these methods will return the contents of the entire corpus. For most corpora, these methods define one or more selection arguments, such asfileids
orcategories
, which can be used to select which portion of the corpus should be returned.- __init__(root, fileids, encoding='utf8', tagset=None)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- abspath(fileid)[source]¶
Return the absolute path for the given file.
- Parameters
fileid (str) – The file identifier for the file whose path should be returned.
- Return type
- abspaths(fileids=None, include_encoding=False, include_fileid=False)[source]¶
Return a list of the absolute paths for all fileids in this corpus; or for the given list of fileids, if specified.
- Parameters
fileids (None or str or list) – Specifies the set of fileids for which paths should be returned. Can be None, for all fileids; a list of file identifiers, for a specified set of fileids; or a single file identifier, for a single file. Note that the return value is always a list of paths, even if
fileids
is a single file identifier.include_encoding – If true, then return a list of
(path_pointer, encoding)
tuples.
- Return type
list(PathPointer)
- encoding(file)[source]¶
Return the unicode encoding for the given corpus file, if known. If the encoding is unknown, or if the given file should be processed using byte strings (str), then return None.
- ensure_loaded()[source]¶
Load this corpus (if it has not already been loaded). This is used by LazyCorpusLoader as a simple method that can be used to make sure a corpus is loaded – e.g., in case a user wants to do help(some_corpus).
- open(file)[source]¶
Return an open stream that can be used to read the given file. If the file’s encoding is not None, then the stream will automatically decode the file’s contents into unicode.
- Parameters
file – The file identifier of the file to read.
- raw(fileids=None)[source]¶
- Parameters
fileids – A list specifying the fileids that should be used.
- Returns
the given file(s) as a single string.
- Return type
str
- property root¶
The directory where this corpus is stored.
- Type
- class nltk.corpus.reader.CrubadanCorpusReader[source]¶
Bases:
CorpusReader
A corpus reader used to access language An Crubadan n-gram files.
- __init__(root, fileids, encoding='utf8', tagset=None)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.DependencyCorpusReader[source]¶
Bases:
SyntaxCorpusReader
- __init__(root, fileids, encoding='utf8', word_tokenizer=<nltk.tokenize.simple.TabTokenizer object>, sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), para_block_reader=<function read_blankline_block>)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.EuroparlCorpusReader[source]¶
Bases:
PlaintextCorpusReader
Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from
PlaintextCorpusReader
except that:Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces.
For the same reason, the sentence tokenizer should just split the paragraph at line breaks.
There is a new ‘chapters()’ method that returns chapters instead instead of paragraphs.
The ‘paras()’ method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl.
- class nltk.corpus.reader.FramenetCorpusReader[source]¶
Bases:
XMLCorpusReader
A corpus reader for the Framenet Corpus.
>>> from nltk.corpus import framenet as fn >>> fn.lu(3238).frame.lexUnit['glint.v'] is fn.lu(3238) True >>> fn.frame_by_name('Replacing') is fn.lus('replace.v')[0].frame True >>> fn.lus('prejudice.n')[0].frame.frameRelations == fn.frame_relations('Partiality') True
- __init__(root, fileids)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- annotations(luNamePattern=None, exemplars=True, full_text=True)[source]¶
Frame annotation sets matching the specified criteria.
- doc(fn_docid)[source]¶
Returns the annotated document whose id number is
fn_docid
. This id number can be obtained by calling the Documents() function.The dict that is returned from this function will contain the following keys:
‘_type’ : ‘fulltextannotation’
- ‘sentence’a list of sentences in the document
- Each item in the list is a dict containing the following keys:
‘ID’ : the ID number of the sentence
‘_type’ : ‘sentence’
‘text’ : the text of the sentence
‘paragNo’ : the paragraph number
‘sentNo’ : the sentence number
‘docID’ : the document ID number
‘corpID’ : the corpus ID number
‘aPos’ : the annotation position
- ‘annotationSet’a list of annotation layers for the sentence
- Each item in the list is a dict containing the following keys:
‘ID’ : the ID number of the annotation set
‘_type’ : ‘annotationset’
‘status’ : either ‘MANUAL’ or ‘UNANN’
‘luName’ : (only if status is ‘MANUAL’)
‘luID’ : (only if status is ‘MANUAL’)
‘frameID’ : (only if status is ‘MANUAL’)
‘frameName’: (only if status is ‘MANUAL’)
- ‘layer’a list of labels for the layer
- Each item in the layer is a dict containing the following keys:
‘_type’: ‘layer’
‘rank’
‘name’
- ‘label’a list of labels in the layer
- Each item is a dict containing the following keys:
‘start’
‘end’
‘name’
‘feID’ (optional)
- Parameters
fn_docid (int) – The Framenet id number of the document
- Returns
Information about the annotated document
- Return type
dict
- docs(name=None)[source]¶
Return a list of the annotated full-text documents in FrameNet, optionally filtered by a regex to be matched against the document name.
- docs_metadata(name=None)[source]¶
Return an index of the annotated documents in Framenet.
Details for a specific annotated document can be obtained using this class’s doc() function and pass it the value of the ‘ID’ field.
>>> from nltk.corpus import framenet as fn >>> len(fn.docs()) in (78, 107) # FN 1.5 and 1.7, resp. True >>> set([x.corpname for x in fn.docs_metadata()])>=set(['ANC', 'KBEval', 'LUCorpus-v0.3', 'Miscellaneous', 'NTI', 'PropBank']) True
- Parameters
name (str) – A regular expression pattern used to search the file name of each annotated document. The document’s file name contains the name of the corpus that the document is from, followed by two underscores “__” followed by the document name. So, for example, the file name “LUCorpus-v0.3__20000410_nyt-NEW.xml” is from the corpus named “LUCorpus-v0.3” and the document name is “20000410_nyt-NEW.xml”.
- Returns
A list of selected (or all) annotated documents
- Return type
list of dicts, where each dict object contains the following keys:
’name’
’ID’
’corpid’
’corpname’
’description’
’filename’
- exemplars(luNamePattern=None, frame=None, fe=None, fe2=None)[source]¶
Lexicographic exemplar sentences, optionally filtered by LU name and/or 1-2 FEs that are realized overtly. ‘frame’ may be a name pattern, frame ID, or frame instance. ‘fe’ may be a name pattern or FE instance; if specified, ‘fe2’ may also be specified to retrieve sentences with both overt FEs (in either order).
- fe_relations()[source]¶
Obtain a list of frame element relations.
>>> from nltk.corpus import framenet as fn >>> ferels = fn.fe_relations() >>> isinstance(ferels, list) True >>> len(ferels) in (10020, 12393) # FN 1.5 and 1.7, resp. True >>> PrettyDict(ferels[0], breakLines=True) {'ID': 14642, '_type': 'ferelation', 'frameRelation': <Parent=Abounding_with -- Inheritance -> Child=Lively_place>, 'subFE': <fe ID=11370 name=Degree>, 'subFEName': 'Degree', 'subFrame': <frame ID=1904 name=Lively_place>, 'subID': 11370, 'supID': 2271, 'superFE': <fe ID=2271 name=Degree>, 'superFEName': 'Degree', 'superFrame': <frame ID=262 name=Abounding_with>, 'type': <framerelationtype ID=1 name=Inheritance>}
- Returns
A list of all of the frame element relations in framenet
- Return type
list(dict)
- fes(name=None, frame=None)[source]¶
Lists frame element objects. If ‘name’ is provided, this is treated as a case-insensitive regular expression to filter by frame name. (Case-insensitivity is because casing of frame element names is not always consistent across frames.) Specify ‘frame’ to filter by a frame name pattern, ID, or object.
>>> from nltk.corpus import framenet as fn >>> fn.fes('Noise_maker') [<fe ID=6043 name=Noise_maker>] >>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound')]) [('Cause_to_make_noise', 'Sound_maker'), ('Make_noise', 'Sound'), ('Make_noise', 'Sound_source'), ('Sound_movement', 'Location_of_sound_source'), ('Sound_movement', 'Sound'), ('Sound_movement', 'Sound_source'), ('Sounds', 'Component_sound'), ('Sounds', 'Location_of_sound_source'), ('Sounds', 'Sound_source'), ('Vocalizations', 'Location_of_sound_source'), ('Vocalizations', 'Sound_source')] >>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound',r'(?i)make_noise')]) [('Cause_to_make_noise', 'Sound_maker'), ('Make_noise', 'Sound'), ('Make_noise', 'Sound_source')] >>> sorted(set(fe.name for fe in fn.fes('^sound'))) ['Sound', 'Sound_maker', 'Sound_source'] >>> len(fn.fes('^sound$')) 2
- Parameters
name (str) – A regular expression pattern used to match against frame element names. If ‘name’ is None, then a list of all frame elements will be returned.
- Returns
A list of matching frame elements
- Return type
list(AttrDict)
- frame(fn_fid_or_fname, ignorekeys=[])[source]¶
Get the details for the specified Frame using the frame’s name or id number.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> f = fn.frame(256) >>> f.name 'Medical_specialties' >>> f = fn.frame('Medical_specialties') >>> f.ID 256 >>> # ensure non-ASCII character in definition doesn't trigger an encoding error: >>> fn.frame('Imposing_obligation') frame (1494): Imposing_obligation...
The dict that is returned from this function will contain the following information about the Frame:
‘name’ : the name of the Frame (e.g. ‘Birth’, ‘Apply_heat’, etc.)
‘definition’ : textual definition of the Frame
‘ID’ : the internal ID number of the Frame
- ‘semTypes’a list of semantic types for this frame
- Each item in the list is a dict containing the following keys:
‘name’ : can be used with the semtype() function
‘ID’ : can be used with the semtype() function
- ‘lexUnit’a dict containing all of the LUs for this frame.
The keys in this dict are the names of the LUs and the value for each key is itself a dict containing info about the LU (see the lu() function for more info.)
- ‘FE’a dict containing the Frame Elements that are part of this frame
The keys in this dict are the names of the FEs (e.g. ‘Body_system’) and the values are dicts containing the following keys
‘definition’ : The definition of the FE
‘name’ : The name of the FE e.g. ‘Body_system’
‘ID’ : The id number
‘_type’ : ‘fe’
‘abbrev’ : Abbreviation e.g. ‘bod’
‘coreType’ : one of “Core”, “Peripheral”, or “Extra-Thematic”
- ‘semType’if not None, a dict with the following two keys:
- ‘name’name of the semantic type. can be used with
the semtype() function
- ‘ID’id number of the semantic type. can be used with
the semtype() function
- ‘requiresFE’if not None, a dict with the following two keys:
‘name’ : the name of another FE in this frame
‘ID’ : the id of the other FE in this frame
- ‘excludesFE’if not None, a dict with the following two keys:
‘name’ : the name of another FE in this frame
‘ID’ : the id of the other FE in this frame
‘frameRelation’ : a list of objects describing frame relations
- ‘FEcoreSets’a list of Frame Element core sets for this frame
Each item in the list is a list of FE objects
- Parameters
fn_fid_or_fname (int or str) – The Framenet name or id number of the frame
ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
- Returns
Information about a frame
- Return type
dict
- frame_by_id(fn_fid, ignorekeys=[])[source]¶
Get the details for the specified Frame using the frame’s id number.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> f = fn.frame_by_id(256) >>> f.ID 256 >>> f.name 'Medical_specialties' >>> f.definition "This frame includes words that name medical specialties and is closely related to the Medical_professionals frame. The FE Type characterizing a sub-are in a Specialty may also be expressed. 'Ralph practices paediatric oncology.'"
- Parameters
fn_fid (int) – The Framenet id number of the frame
ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
- Returns
Information about a frame
- Return type
dict
Also see the
frame()
function for details about what is contained in the dict that is returned.
- frame_by_name(fn_fname, ignorekeys=[], check_cache=True)[source]¶
Get the details for the specified Frame using the frame’s name.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> f = fn.frame_by_name('Medical_specialties') >>> f.ID 256 >>> f.name 'Medical_specialties' >>> f.definition "This frame includes words that name medical specialties and is closely related to the Medical_professionals frame. The FE Type characterizing a sub-are in a Specialty may also be expressed. 'Ralph practices paediatric oncology.'"
- Parameters
fn_fname (str) – The name of the frame
ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
- Returns
Information about a frame
- Return type
dict
Also see the
frame()
function for details about what is contained in the dict that is returned.
- frame_ids_and_names(name=None)[source]¶
Uses the frame index, which is much faster than looking up each frame definition if only the names and IDs are needed.
- frame_relation_types()[source]¶
Obtain a list of frame relation types.
>>> from nltk.corpus import framenet as fn >>> frts = sorted(fn.frame_relation_types(), key=itemgetter('ID')) >>> isinstance(frts, list) True >>> len(frts) in (9, 10) # FN 1.5 and 1.7, resp. True >>> PrettyDict(frts[0], breakLines=True) {'ID': 1, '_type': 'framerelationtype', 'frameRelations': [<Parent=Event -- Inheritance -> Child=Change_of_consistency>, <Parent=Event -- Inheritance -> Child=Rotting>, ...], 'name': 'Inheritance', 'subFrameName': 'Child', 'superFrameName': 'Parent'}
- Returns
A list of all of the frame relation types in framenet
- Return type
list(dict)
- frame_relations(frame=None, frame2=None, type=None)[source]¶
- Parameters
frame (int or str or AttrDict) – (optional) frame object, name, or ID; only relations involving this frame will be returned
frame2 – (optional; ‘frame’ must be a different frame) only show relations between the two specified frames, in either direction
type – (optional) frame relation type (name or object); show only relations of this type
- Returns
A list of all of the frame relations in framenet
- Return type
list(dict)
>>> from nltk.corpus import framenet as fn >>> frels = fn.frame_relations() >>> isinstance(frels, list) True >>> len(frels) in (1676, 2070) # FN 1.5 and 1.7, resp. True >>> PrettyList(fn.frame_relations('Cooking_creation'), maxReprSize=0, breakLines=True) [<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>, <Parent=Apply_heat -- Using -> Child=Cooking_creation>, <MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>] >>> PrettyList(fn.frame_relations(274), breakLines=True) [<Parent=Avoiding -- Inheritance -> Child=Dodging>, <Parent=Avoiding -- Inheritance -> Child=Evading>, ...] >>> PrettyList(fn.frame_relations(fn.frame('Cooking_creation')), breakLines=True) [<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>, <Parent=Apply_heat -- Using -> Child=Cooking_creation>, ...] >>> PrettyList(fn.frame_relations('Cooking_creation', type='Inheritance')) [<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>] >>> PrettyList(fn.frame_relations('Cooking_creation', 'Apply_heat'), breakLines=True) [<Parent=Apply_heat -- Using -> Child=Cooking_creation>, <MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
- frames(name=None)[source]¶
Obtain details for a specific frame.
>>> from nltk.corpus import framenet as fn >>> len(fn.frames()) in (1019, 1221) # FN 1.5 and 1.7, resp. True >>> x = PrettyList(fn.frames(r'(?i)crim'), maxReprSize=0, breakLines=True) >>> x.sort(key=itemgetter('ID')) >>> x [<frame ID=200 name=Criminal_process>, <frame ID=500 name=Criminal_investigation>, <frame ID=692 name=Crime_scenario>, <frame ID=700 name=Committing_crime>]
A brief intro to Frames (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):
A Frame is a script-like conceptual structure that describes a particular type of situation, object, or event along with the participants and props that are needed for that Frame. For example, the “Apply_heat” frame describes a common situation involving a Cook, some Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil, brown, simmer, steam, etc.
We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are called “lexical units” (LUs).
FrameNet includes relations between Frames. Several types of relations are defined, of which the most important are:
Inheritance: An IS-A relation. The child frame is a subtype of the parent frame, and each FE in the parent is bound to a corresponding FE in the child. An example is the “Revenge” frame which inherits from the “Rewards_and_punishments” frame.
Using: The child frame presupposes the parent frame as background, e.g the “Speed” frame “uses” (or presupposes) the “Motion” frame; however, not all parent FEs need to be bound to child FEs.
Subframe: The child frame is a subevent of a complex event represented by the parent, e.g. the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”, “Trial”, and “Sentencing”.
Perspective_on: The child frame provides a particular perspective on an un-perspectivized parent frame. A pair of examples consists of the “Hiring” and “Get_a_job” frames, which perspectivize the “Employment_start” frame from the Employer’s and the Employee’s point of view, respectively.
- Parameters
name (str) – A regular expression pattern used to match against Frame names. If ‘name’ is None, then a list of all Framenet Frames will be returned.
- Returns
A list of matching Frames (or all Frames).
- Return type
list(AttrDict)
- frames_by_lemma(pat)[source]¶
Returns a list of all frames that contain LUs in which the
name
attribute of the LU matches the given regular expressionpat
. Note that LU names are composed of “lemma.POS”, where the “lemma” part can be made up of either a single lexeme (e.g. ‘run’) or multiple lexemes (e.g. ‘a little’).Note: if you are going to be doing a lot of this type of searching, you’d want to build an index that maps from lemmas to frames because each time frames_by_lemma() is called, it has to search through ALL of the frame XML files in the db.
>>> from nltk.corpus import framenet as fn >>> from nltk.corpus.reader.framenet import PrettyList >>> PrettyList(sorted(fn.frames_by_lemma(r'(?i)a little'), key=itemgetter('ID'))) [<frame ID=189 name=Quanti...>, <frame ID=2001 name=Degree>]
- Returns
A list of frame objects.
- Return type
list(AttrDict)
- ft_sents(docNamePattern=None)[source]¶
Full-text annotation sentences, optionally filtered by document name.
- lu(fn_luid, ignorekeys=[], luName=None, frameID=None, frameName=None)[source]¶
Access a lexical unit by its ID. luName, frameID, and frameName are used only in the event that the LU does not have a file in the database (which is the case for LUs with “Problem” status); in this case, a placeholder LU is created which just contains its name, ID, and frame.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> fn.lu(256).name 'foresee.v' >>> fn.lu(256).definition 'COD: be aware of beforehand; predict.' >>> fn.lu(256).frame.name 'Expectation' >>> list(map(PrettyDict, fn.lu(256).lexemes)) [{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}]
>>> fn.lu(227).exemplars[23] exemplar sentence (352962): [sentNo] 0 [aPos] 59699508 [LU] (227) guess.v in Coming_to_believe [frame] (23) Coming_to_believe [annotationSet] 2 annotation sets [POS] 18 tags [POS_tagset] BNC [GF] 3 relations [PT] 3 phrases [Other] 1 entry [text] + [Target] + [FE] When he was inside the house , Culley noticed the characteristic ------------------ Content he would n't have guessed at . -- ******* -- Co C1 [Evidence:INI] (Co=Cognizer, C1=Content)
The dict that is returned from this function will contain most of the following information about the LU. Note that some LUs do not contain all of these pieces of information - particularly ‘totalAnnotated’ and ‘incorporatedFE’ may be missing in some LUs:
‘name’ : the name of the LU (e.g. ‘merger.n’)
‘definition’ : textual definition of the LU
‘ID’ : the internal ID number of the LU
‘_type’ : ‘lu’
‘status’ : e.g. ‘Created’
‘frame’ : Frame that this LU belongs to
‘POS’ : the part of speech of this LU (e.g. ‘N’)
‘totalAnnotated’ : total number of examples annotated with this LU
‘incorporatedFE’ : FE that incorporates this LU (e.g. ‘Ailment’)
- ‘sentenceCount’a dict with the following two keys:
‘annotated’: number of sentences annotated with this LU
‘total’ : total number of sentences with this LU
- ‘lexemes’a list of dicts describing the lemma of this LU.
Each dict in the list contains these keys:
‘POS’ : part of speech e.g. ‘N’
- ‘name’either single-lexeme e.g. ‘merger’ or
multi-lexeme e.g. ‘a little’
‘order’: the order of the lexeme in the lemma (starting from 1)
‘headword’: a boolean (‘true’ or ‘false’)
- ‘breakBefore’: Can this lexeme be separated from the previous lexeme?
Consider: “take over.v” as in:
Germany took over the Netherlands in 2 days. Germany took the Netherlands over in 2 days.
In this case, ‘breakBefore’ would be “true” for the lexeme “over”. Contrast this with “take after.v” as in:
Mary takes after her grandmother. *Mary takes her grandmother after.
In this case, ‘breakBefore’ would be “false” for the lexeme “after”
‘lemmaID’ : Can be used to connect lemmas in different LUs
‘semTypes’ : a list of semantic type objects for this LU
- ‘subCorpus’a list of subcorpora
- Each item in the list is a dict containing the following keys:
‘name’ :
- ‘sentence’a list of sentences in the subcorpus
- each item in the list is a dict with the following keys:
‘ID’:
‘sentNo’:
‘text’: the text of the sentence
‘aPos’:
- ‘annotationSet’: a list of annotation sets
- each item in the list is a dict with the following keys:
‘ID’:
‘status’:
- ‘layer’: a list of layers
- each layer is a dict containing the following keys:
‘name’: layer name (e.g. ‘BNC’)
‘rank’:
- ‘label’: a list of labels for the layer
- each label is a dict containing the following keys:
‘start’: start pos of label in sentence ‘text’ (0-based)
‘end’: end pos of label in sentence ‘text’ (0-based)
‘name’: name of label (e.g. ‘NN1’)
Under the hood, this implementation looks up the lexical unit information in the frame definition file. That file does not contain corpus annotations, so the LU files will be accessed on demand if those are needed. In principle, valence patterns could be loaded here too, though these are not currently supported.
- Parameters
fn_luid (int) – The id number of the lexical unit
ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
- Returns
All information about the lexical unit
- Return type
dict
- lu_basic(fn_luid)[source]¶
Returns basic information about the LU whose id is
fn_luid
. This is basically just a wrapper around thelu()
function with “subCorpus” info excluded.>>> from nltk.corpus import framenet as fn >>> lu = PrettyDict(fn.lu_basic(256), breakLines=True) >>> # ellipses account for differences between FN 1.5 and 1.7 >>> lu {'ID': 256, 'POS': 'V', 'URL': 'https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu256.xml', '_type': 'lu', 'cBy': ..., 'cDate': '02/08/2001 01:27:50 PST Thu', 'definition': 'COD: be aware of beforehand; predict.', 'definitionMarkup': 'COD: be aware of beforehand; predict.', 'frame': <frame ID=26 name=Expectation>, 'lemmaID': 15082, 'lexemes': [{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}], 'name': 'foresee.v', 'semTypes': [], 'sentenceCount': {'annotated': ..., 'total': ...}, 'status': 'FN1_Sent'}
- Parameters
fn_luid (int) – The id number of the desired LU
- Returns
Basic information about the lexical unit
- Return type
dict
- lu_ids_and_names(name=None)[source]¶
Uses the LU index, which is much faster than looking up each LU definition if only the names and IDs are needed.
- lus(name=None, frame=None)[source]¶
Obtain details for lexical units. Optionally restrict by lexical unit name pattern, and/or to a certain frame or frames whose name matches a pattern.
>>> from nltk.corpus import framenet as fn >>> len(fn.lus()) in (11829, 13572) # FN 1.5 and 1.7, resp. True >>> PrettyList(sorted(fn.lus(r'(?i)a little'), key=itemgetter('ID')), maxReprSize=0, breakLines=True) [<lu ID=14733 name=a little.n>, <lu ID=14743 name=a little.adv>, <lu ID=14744 name=a little bit.adv>] >>> PrettyList(sorted(fn.lus(r'interest', r'(?i)stimulus'), key=itemgetter('ID'))) [<lu ID=14894 name=interested.a>, <lu ID=14920 name=interesting.a>]
A brief intro to Lexical Units (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):
A lexical unit (LU) is a pairing of a word with a meaning. For example, the “Apply_heat” Frame describes a common situation involving a Cook, some Food, and a Heating Instrument, and is _evoked_ by words such as bake, blanch, boil, broil, brown, simmer, steam, etc. These frame-evoking words are the LUs in the Apply_heat frame. Each sense of a polysemous word is a different LU.
We have used the word “word” in talking about LUs. The reality is actually rather complex. When we say that the word “bake” is polysemous, we mean that the lemma “bake.v” (which has the word-forms “bake”, “bakes”, “baked”, and “baking”) is linked to three different frames:
Apply_heat: “Michelle baked the potatoes for 45 minutes.”
Cooking_creation: “Michelle baked her mother a cake for her birthday.”
Absorb_heat: “The potatoes have to bake for more than 30 minutes.”
These constitute three different LUs, with different definitions.
Multiword expressions such as “given name” and hyphenated words like “shut-eye” can also be LUs. Idiomatic phrases such as “middle of nowhere” and “give the slip (to)” are also defined as LUs in the appropriate frames (“Isolated_places” and “Evading”, respectively), and their internal structure is not analyzed.
Framenet provides multiple annotated examples of each sense of a word (i.e. each LU). Moreover, the set of examples (approximately 20 per LU) illustrates all of the combinatorial possibilities of the lexical unit.
Each LU is linked to a Frame, and hence to the other words which evoke that Frame. This makes the FrameNet database similar to a thesaurus, grouping together semantically similar words.
In the simplest case, frame-evoking words are verbs such as “fried” in:
“Matilde fried the catfish in a heavy iron skillet.”
Sometimes event nouns may evoke a Frame. For example, “reduction” evokes “Cause_change_of_scalar_position” in:
“…the reduction of debt levels to $665 million from $2.6 billion.”
Adjectives may also evoke a Frame. For example, “asleep” may evoke the “Sleep” frame as in:
“They were asleep for hours.”
Many common nouns, such as artifacts like “hat” or “tower”, typically serve as dependents rather than clearly evoking their own frames.
- Parameters
name (str) –
A regular expression pattern used to search the LU names. Note that LU names take the form of a dotted string (e.g. “run.v” or “a little.adv”) in which a lemma precedes the “.” and a POS follows the dot. The lemma may be composed of a single lexeme (e.g. “run”) or of multiple lexemes (e.g. “a little”). If ‘name’ is not given, then all LUs will be returned.
The valid POSes are:
v - verb n - noun a - adjective adv - adverb prep - preposition num - numbers intj - interjection art - article c - conjunction scon - subordinating conjunction
- Returns
A list of selected (or all) lexical units
- Return type
list of LU objects (dicts). See the lu() function for info about the specifics of LU objects.
- propagate_semtypes()[source]¶
Apply inference rules to distribute semtypes over relations between FEs. For FrameNet 1.5, this results in 1011 semtypes being propagated. (Not done by default because it requires loading all frame files, which takes several seconds. If this needed to be fast, it could be rewritten to traverse the neighboring relations on demand for each FE semtype.)
>>> from nltk.corpus import framenet as fn >>> x = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType) >>> fn.propagate_semtypes() >>> y = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType) >>> y-x > 1000 True
- semtype(key)[source]¶
>>> from nltk.corpus import framenet as fn >>> fn.semtype(233).name 'Temperature' >>> fn.semtype(233).abbrev 'Temp' >>> fn.semtype('Temperature').ID 233
- Parameters
key (string or int) – The name, abbreviation, or id number of the semantic type
- Returns
Information about a semantic type
- Return type
dict
- semtypes()[source]¶
Obtain a list of semantic types.
>>> from nltk.corpus import framenet as fn >>> stypes = fn.semtypes() >>> len(stypes) in (73, 109) # FN 1.5 and 1.7, resp. True >>> sorted(stypes[0].keys()) ['ID', '_type', 'abbrev', 'definition', 'definitionMarkup', 'name', 'rootType', 'subTypes', 'superType']
- Returns
A list of all of the semantic types in framenet
- Return type
list(dict)
- warnings(v)[source]¶
Enable or disable warnings of data integrity issues as they are encountered. If v is truthy, warnings will be enabled.
(This is a function rather than just an attribute/property to ensure that if enabling warnings is the first action taken, the corpus reader is instantiated first.)
- class nltk.corpus.reader.IEERCorpusReader[source]¶
Bases:
CorpusReader
- class nltk.corpus.reader.IPIPANCorpusReader[source]¶
Bases:
CorpusReader
Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.
The corpus includes information about text domain, channel and categories. You can access possible values using
domains()
,channels()
andcategories()
. You can use also this metadata to filter files, e.g.:fileids(channel='prasa')
,fileids(categories='publicystyczny')
.The reader supports methods: words, sents, paras and their tagged versions. You can get part of speech instead of full tag by giving “simplify_tags=True” parameter, e.g.:
tagged_sents(simplify_tags=True)
.Also you can get all tags disambiguated tags specifying parameter “one_tag=False”, e.g.:
tagged_paras(one_tag=False)
.You can get all tags that were assigned by a morphological analyzer specifying parameter “disamb_only=False”, e.g.
tagged_words(disamb_only=False)
.The IPIPAN Corpus contains tags indicating if there is a space between two tokens. To add special “no space” markers, you should specify parameter “append_no_space=True”, e.g.
tagged_words(append_no_space=True)
. As a result in place where there should be no space between two tokens new pair (‘’, ‘no-space’) will be inserted (for tagged data) and just ‘’ for methods without tags.The corpus reader can also try to append spaces between words. To enable this option, specify parameter “append_space=True”, e.g.
words(append_space=True)
. As a result either ‘ ‘ or (’ ‘, ‘space’) will be inserted between tokens.By default, xml entities like " and & are replaced by corresponding characters. You can turn off this feature, specifying parameter “replace_xmlentities=False”, e.g.
words(replace_xmlentities=False)
.- __init__(root, fileids)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.IndianCorpusReader[source]¶
Bases:
CorpusReader
List of words, one per line. Blank lines are ignored.
- class nltk.corpus.reader.KNBCorpusReader[source]¶
Bases:
SyntaxCorpusReader
- This class implements:
__init__
, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files._read_block
, which reads a block from the input stream._word
, which takes a block and returns a list of list of words._tag
, which takes a block and returns a list of list of tagged words._parse
, which takes a block and returns a list of parsed sentences.
- The structure of tagged words:
tagged_word = (word(str), tags(tuple)) tags = (surface, reading, lemma, pos1, posid1, pos2, posid2, pos3, posid3, others …)
Usage example
>>> from nltk.corpus.util import LazyCorpusLoader >>> knbc = LazyCorpusLoader( ... 'knbc/corpus1', ... KNBCorpusReader, ... r'.*/KN.*', ... encoding='euc-jp', ... )
>>> len(knbc.sents()[0]) 9
- class nltk.corpus.reader.LinThesaurusCorpusReader[source]¶
Bases:
CorpusReader
Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin.
- __init__(root, badscore=0.0)[source]¶
Initialize the thesaurus.
- Parameters
root (C{string}) – root directory containing thesaurus LISP files
badscore (C{float}) – the score to give to words which do not appear in each other’s sets of synonyms
- scored_synonyms(ngram, fileid=None)[source]¶
Returns a list of scored synonyms (tuples of synonyms and scores) for the current ngram
- Parameters
ngram (C{string}) – ngram to lookup
fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
- Returns
If fileid is specified, list of tuples of scores and synonyms; otherwise, list of tuples of fileids and lists, where inner lists consist of tuples of scores and synonyms.
- similarity(ngram1, ngram2, fileid=None)[source]¶
Returns the similarity score for two ngrams.
- Parameters
ngram1 (C{string}) – first ngram to compare
ngram2 (C{string}) – second ngram to compare
fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
- Returns
If fileid is specified, just the score for the two ngrams; otherwise, list of tuples of fileids and scores.
- synonyms(ngram, fileid=None)[source]¶
Returns a list of synonyms for the current ngram.
- Parameters
ngram (C{string}) – ngram to lookup
fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
- Returns
If fileid is specified, list of synonyms; otherwise, list of tuples of fileids and lists, where inner lists contain synonyms.
- class nltk.corpus.reader.MTECorpusReader[source]¶
Bases:
TaggedCorpusReader
Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset
- __init__(root=None, fileids=None, encoding='utf8')[source]¶
Construct a new MTECorpusreader for a set of documents located at the given root directory. Example usage:
>>> root = '/...path to corpus.../' >>> reader = MTECorpusReader(root, 'oana-*.xml', 'utf8')
- Parameters
root – The root directory for this corpus. (default points to location in multext config file)
fileids – A list or regexp specifying the fileids in this corpus. (default is oana-en.xml)
encoding – The encoding of the given files (default is utf8)
- lemma_paras(fileids=None)[source]¶
- Parameters
fileids – A list specifying the fileids that should be used.
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of tuples of the word and the corresponding lemma (word, lemma)
- Return type
- lemma_sents(fileids=None)[source]¶
- Parameters
fileids – A list specifying the fileids that should be used.
- Returns
the given file(s) as a list of sentences or utterances, each encoded as a list of tuples of the word and the corresponding lemma (word, lemma)
- Return type
list(list(tuple(str, str)))
- lemma_words(fileids=None)[source]¶
- Parameters
fileids – A list specifying the fileids that should be used.
- Returns
the given file(s) as a list of words, the corresponding lemmas and punctuation symbols, encoded as tuples (word, lemma)
- Return type
list(tuple(str,str))
- paras(fileids=None)[source]¶
- Parameters
fileids – A list specifying the fileids that should be used.
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word string
- Return type
list(list(list(str)))
- sents(fileids=None)[source]¶
- Parameters
fileids – A list specifying the fileids that should be used.
- Returns
the given file(s) as a list of sentences or utterances, each encoded as a list of word strings
- Return type
list(list(str))
- tagged_paras(fileids=None, tagset='msd', tags='')[source]¶
- Parameters
fileids – A list specifying the fileids that should be used.
tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of (word,tag) tuples
- Return type
list(list(list(tuple(str, str))))
- tagged_sents(fileids=None, tagset='msd', tags='')[source]¶
- Parameters
fileids – A list specifying the fileids that should be used.
tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
- Returns
the given file(s) as a list of sentences or utterances, each each encoded as a list of (word,tag) tuples
- Return type
list(list(tuple(str, str)))
- tagged_words(fileids=None, tagset='msd', tags='')[source]¶
- Parameters
fileids – A list specifying the fileids that should be used.
tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
- Returns
the given file(s) as a list of tagged words and punctuation symbols encoded as tuples (word, tag)
- Return type
list(tuple(str, str))
- class nltk.corpus.reader.MWAPPDBCorpusReader[source]¶
Bases:
WordListCorpusReader
This class is used to read the list of word pairs from the subset of lexical pairs of The Paraphrase Database (PPDB) XXXL used in the Monolingual Word Alignment (MWA) algorithm described in Sultan et al. (2014a, 2014b, 2015):
The original source of the full PPDB corpus can be found on https://www.cis.upenn.edu/~ccb/ppdb/
- Returns
a list of tuples of similar lexical terms.
- entries(fileids='ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs')[source]¶
- Returns
a tuple of synonym word pairs.
- mwa_ppdb_xxxl_file = 'ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs'¶
- class nltk.corpus.reader.MacMorphoCorpusReader[source]¶
Bases:
TaggedCorpusReader
A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using ‘_’ as a separator. Sentence boundaries are based on the end-sentence tag (‘_.’). Paragraph information is not included in the corpus, so each paragraph returned by
self.paras()
andself.tagged_paras()
contains a single sentence.- __init__(root, fileids, encoding='utf8', tagset=None)[source]¶
Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/...path to corpus.../' >>> reader = TaggedCorpusReader(root, '.*', '.txt')
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
- class nltk.corpus.reader.NKJPCorpusReader[source]¶
Bases:
XMLCorpusReader
- HEADER_MODE = 2¶
- RAW_MODE = 3¶
- SENTS_MODE = 1¶
- WORDS_MODE = 0¶
- __init__(root, fileids='.*')[source]¶
Corpus reader designed to work with National Corpus of Polish. See http://nkjp.pl/ for more details about NKJP. use example: import nltk import nkjp from nkjp import NKJPCorpusReader x = NKJPCorpusReader(root=’/home/USER/nltk_data/corpora/nkjp/’, fileids=’’) # obtain the whole corpus x.header() x.raw() x.words() x.tagged_words(tags=[‘subst’, ‘comp’]) #Link to find more tags: nkjp.pl/poliqarp/help/ense2.html x.sents() x = NKJPCorpusReader(root=’/home/USER/nltk_data/corpora/nkjp/’, fileids=’Wilk*’) # obtain particular file(s) x.header(fileids=[‘WilkDom’, ‘/home/USER/nltk_data/corpora/nkjp/WilkWilczy’]) x.tagged_words(fileids=[‘WilkDom’, ‘/home/USER/nltk_data/corpora/nkjp/WilkWilczy’], tags=[‘subst’, ‘comp’])
- class nltk.corpus.reader.NPSChatCorpusReader[source]¶
Bases:
XMLCorpusReader
- __init__(root, fileids, wrap_etree=False, tagset=None)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- words(fileids=None)[source]¶
Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.
- Returns
the given file’s text nodes as a list of words and punctuation symbols
- Return type
list(str)
- class nltk.corpus.reader.NombankCorpusReader[source]¶
Bases:
CorpusReader
Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-noun basis. Each “frameset file” contains one or more predicates, such as
'turn'
or'turn_on'
, each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.- __init__(root, nomfile, framefiles='', nounsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]¶
- Parameters
root – The root directory for this corpus.
nomfile – The name of the file containing the predicate- argument annotations (relative to
root
).framefiles – A list or regexp specifying the frameset fileids for this corpus.
parse_fileid_xform – A transform that should be applied to the fileids in this corpus. This should be a function of one argument (a fileid) that returns a string (the new fileid).
parse_corpus – The corpus containing the parse trees corresponding to this corpus. These parse trees are necessary to resolve the tree pointers used by nombank.
- instances(baseform=None)[source]¶
- Returns
a corpus view that acts as a list of
NombankInstance
objects, one for each noun in the corpus.
- lines()[source]¶
- Returns
a corpus view that acts as a list of strings, one for each line in the predicate-argument annotation file.
- class nltk.corpus.reader.NonbreakingPrefixesCorpusReader[source]¶
Bases:
WordListCorpusReader
This is a class to read the nonbreaking prefixes textfiles from the Moses Machine Translation toolkit. These lists are used in the Python port of the Moses’ word tokenizer.
- available_langs = {'ca': 'ca', 'catalan': 'ca', 'cs': 'cs', 'czech': 'cs', 'de': 'de', 'dutch': 'nl', 'el': 'el', 'en': 'en', 'english': 'en', 'es': 'es', 'fi': 'fi', 'finnish': 'fi', 'fr': 'fr', 'french': 'fr', 'german': 'de', 'greek': 'el', 'hu': 'hu', 'hungarian': 'hu', 'icelandic': 'is', 'is': 'is', 'it': 'it', 'italian': 'it', 'latvian': 'lv', 'lv': 'lv', 'nl': 'nl', 'pl': 'pl', 'polish': 'pl', 'portuguese': 'pt', 'pt': 'pt', 'ro': 'ro', 'romanian': 'ro', 'ru': 'ru', 'russian': 'ru', 'sk': 'sk', 'sl': 'sl', 'slovak': 'sk', 'slovenian': 'sl', 'spanish': 'es', 'sv': 'sv', 'swedish': 'sv', 'ta': 'ta', 'tamil': 'ta'}¶
- words(lang=None, fileids=None, ignore_lines_startswith='#')[source]¶
This module returns a list of nonbreaking prefixes for the specified language(s).
>>> from nltk.corpus import nonbreaking_prefixes as nbp >>> nbp.words('en')[:10] == [u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J'] True >>> nbp.words('ta')[:5] == [u'அ', u'ஆ', u'இ', u'ஈ', u'உ'] True
- Returns
a list words for the specified language(s).
- class nltk.corpus.reader.OpinionLexiconCorpusReader[source]¶
Bases:
WordListCorpusReader
Reader for Liu and Hu opinion lexicon. Blank lines and readme are ignored.
>>> from nltk.corpus import opinion_lexicon >>> opinion_lexicon.words() ['2-faced', '2-faces', 'abnormal', 'abolish', ...]
The OpinionLexiconCorpusReader provides shortcuts to retrieve positive/negative words:
>>> opinion_lexicon.negative() ['2-faced', '2-faces', 'abnormal', 'abolish', ...]
Note that words from words() method are sorted by file id, not alphabetically:
>>> opinion_lexicon.words()[0:10] ['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted'] >>> sorted(opinion_lexicon.words())[0:10] ['2-faced', '2-faces', 'a+', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort']
- CorpusView¶
alias of
IgnoreReadmeCorpusView
- negative()[source]¶
Return all negative words in alphabetical order.
- Returns
a list of negative words.
- Return type
list(str)
- positive()[source]¶
Return all positive words in alphabetical order.
- Returns
a list of positive words.
- Return type
list(str)
- words(fileids=None)[source]¶
Return all words in the opinion lexicon. Note that these words are not sorted in alphabetical order.
- Parameters
fileids – a list or regexp specifying the ids of the files whose words have to be returned.
- Returns
the given file(s) as a list of words and punctuation symbols.
- Return type
list(str)
- class nltk.corpus.reader.PPAttachmentCorpusReader[source]¶
Bases:
CorpusReader
sentence_id verb noun1 preposition noun2 attachment
- class nltk.corpus.reader.PanLexLiteCorpusReader[source]¶
Bases:
CorpusReader
- MEANING_Q = '\n SELECT dnx2.mn, dnx2.uq, dnx2.ap, dnx2.ui, ex2.tt, ex2.lv\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.tt = ? AND ex.lv = ?\n ORDER BY dnx2.uq DESC\n '¶
- TRANSLATION_Q = '\n SELECT s.tt, sum(s.uq) AS trq FROM (\n SELECT ex2.tt, max(dnx.uq) AS uq\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.lv = ? AND ex.tt = ? AND ex2.lv = ?\n GROUP BY ex2.tt, dnx.ui\n ) s\n GROUP BY s.tt\n ORDER BY trq DESC, s.tt\n '¶
- __init__(root)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- language_varieties(lc=None)[source]¶
Return a list of PanLex language varieties.
- Parameters
lc – ISO 639 alpha-3 code. If specified, filters returned varieties by this code. If unspecified, all varieties are returned.
- Returns
the specified language varieties as a list of tuples. The first element is the language variety’s seven-character uniform identifier, and the second element is its default name.
- Return type
list(tuple)
- meanings(expr_uid, expr_tt)[source]¶
Return a list of meanings for an expression.
- Parameters
expr_uid – the expression’s language variety, as a seven-character uniform identifier.
expr_tt – the expression’s text.
- Returns
a list of Meaning objects.
- Return type
list(Meaning)
- translations(from_uid, from_tt, to_uid)[source]¶
Return a list of translations for an expression into a single language variety.
- Parameters
from_uid – the source expression’s language variety, as a seven-character uniform identifier.
from_tt – the source expression’s text.
to_uid – the target language variety, as a seven-character uniform identifier.
- Returns
a list of translation tuples. The first element is the expression text and the second element is the translation quality.
- Return type
list(tuple)
- class nltk.corpus.reader.PanlexSwadeshCorpusReader[source]¶
Bases:
WordListCorpusReader
This is a class to read the PanLex Swadesh list from
David Kamholz, Jonathan Pool, and Susan M. Colowick (2014). PanLex: Building a Resource for Panlingual Lexical Translation. In LREC. http://www.lrec-conf.org/proceedings/lrec2014/pdf/1029_Paper.pdf
License: CC0 1.0 Universal https://creativecommons.org/publicdomain/zero/1.0/legalcode
- __init__(*args, **kwargs)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.Pl196xCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,XMLCorpusReader
- __init__(*args, **kwargs)[source]¶
Initialize this mapping based on keyword arguments, as follows:
cat_pattern: A regular expression pattern used to find the category for each file identifier. The pattern will be applied to each file identifier, and the first matching group will be used as the category label for that file.
cat_map: A dictionary, mapping from file identifiers to category labels.
cat_file: The name of a file that contains the mapping from file identifiers to categories. The argument
cat_delimiter
can be used to specify a delimiter.
The corresponding argument will be deleted from
kwargs
. If more than one argument is specified, an exception will be raised.
- head_len = 2770¶
- textids(fileids=None, categories=None)[source]¶
In the pl196x corpus each category is stored in single file and thus both methods provide identical functionality. In order to accommodate finer granularity, a non-standard textids() method was implemented. All the main functions can be supplied with a list of required chunks—giving much more control to the user.
- words(fileids=None, categories=None, textids=None)[source]¶
Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.
- Returns
the given file’s text nodes as a list of words and punctuation symbols
- Return type
list(str)
- class nltk.corpus.reader.PlaintextCorpusReader[source]¶
Bases:
CorpusReader
Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor.
This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the
CorpusView
class variable.- CorpusView¶
alias of
StreamBackedCorpusView
- __init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object>, para_block_reader=<function read_blankline_block>, encoding='utf8')[source]¶
Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/usr/local/share/nltk_data/corpora/webtext/' >>> reader = PlaintextCorpusReader(root, '.*\.txt')
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
word_tokenizer – Tokenizer for breaking sentences or paragraphs into words.
sent_tokenizer – Tokenizer for breaking paragraphs into words.
para_block_reader – The block reader used to divide the corpus into paragraph blocks.
- paras(fileids=None)[source]¶
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
- Return type
list(list(list(str)))
- class nltk.corpus.reader.PropbankCorpusReader[source]¶
Bases:
CorpusReader
Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-verb basis. Each “frameset file” contains one or more predicates, such as
'turn'
or'turn_on'
, each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.- __init__(root, propfile, framefiles='', verbsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]¶
- Parameters
root – The root directory for this corpus.
propfile – The name of the file containing the predicate- argument annotations (relative to
root
).framefiles – A list or regexp specifying the frameset fileids for this corpus.
parse_fileid_xform – A transform that should be applied to the fileids in this corpus. This should be a function of one argument (a fileid) that returns a string (the new fileid).
parse_corpus – The corpus containing the parse trees corresponding to this corpus. These parse trees are necessary to resolve the tree pointers used by propbank.
- instances(baseform=None)[source]¶
- Returns
a corpus view that acts as a list of
PropBankInstance
objects, one for each noun in the corpus.
- class nltk.corpus.reader.ProsConsCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,CorpusReader
Reader for the Pros and Cons sentence dataset.
>>> from nltk.corpus import pros_cons >>> pros_cons.sents(categories='Cons') [['East', 'batteries', '!', 'On', '-', 'off', 'switch', 'too', 'easy', 'to', 'maneuver', '.'], ['Eats', '...', 'no', ',', 'GULPS', 'batteries'], ...] >>> pros_cons.words('IntegratedPros.txt') ['Easy', 'to', 'use', ',', 'economical', '!', ...]
- CorpusView¶
alias of
StreamBackedCorpusView
- __init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), encoding='utf8', **kwargs)[source]¶
- Parameters
root – The root directory for the corpus.
fileids – a list or regexp specifying the fileids in the corpus.
word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer
encoding – the encoding that should be used to read the corpus.
kwargs – additional parameters passed to CategorizedCorpusReader.
- sents(fileids=None, categories=None)[source]¶
Return all sentences in the corpus or in the specified files/categories.
- Parameters
fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
categories – a list specifying the categories whose sentences have to be returned.
- Returns
the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.
- Return type
list(list(str))
- words(fileids=None, categories=None)[source]¶
Return all words and punctuation symbols in the corpus or in the specified files/categories.
- Parameters
fileids – a list or regexp specifying the ids of the files whose words have to be returned.
categories – a list specifying the categories whose words have to be returned.
- Returns
the given file(s) as a list of words and punctuation symbols.
- Return type
list(str)
- class nltk.corpus.reader.RTECorpusReader[source]¶
Bases:
XMLCorpusReader
Corpus reader for corpora in RTE challenges.
This is just a wrapper around the XMLCorpusReader. See module docstring above for the expected structure of input documents.
- class nltk.corpus.reader.ReviewsCorpusReader[source]¶
Bases:
CorpusReader
Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization.
>>> from nltk.corpus import product_reviews_1 >>> camera_reviews = product_reviews_1.reviews('Canon_G3.txt') >>> review = camera_reviews[0] >>> review.sents()[0] ['i', 'recently', 'purchased', 'the', 'canon', 'powershot', 'g3', 'and', 'am', 'extremely', 'satisfied', 'with', 'the', 'purchase', '.'] >>> review.features() [('canon powershot g3', '+3'), ('use', '+2'), ('picture', '+2'), ('picture quality', '+1'), ('picture quality', '+1'), ('camera', '+2'), ('use', '+2'), ('feature', '+1'), ('picture quality', '+3'), ('use', '+1'), ('option', '+1')]
We can also reach the same information directly from the stream:
>>> product_reviews_1.features('Canon_G3.txt') [('canon powershot g3', '+3'), ('use', '+2'), ...]
We can compute stats for specific product features:
>>> n_reviews = len([(feat,score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture']) >>> tot = sum([int(score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture']) >>> mean = tot / n_reviews >>> print(n_reviews, tot, mean) 15 24 1.6
- CorpusView¶
alias of
StreamBackedCorpusView
- __init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), encoding='utf8')[source]¶
- Parameters
root – The root directory for the corpus.
fileids – a list or regexp specifying the fileids in the corpus.
word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WordPunctTokenizer
encoding – the encoding that should be used to read the corpus.
- features(fileids=None)[source]¶
Return a list of features. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.
- Parameters
fileids – a list or regexp specifying the ids of the files whose features have to be returned.
- Returns
all features for the item(s) in the given file(s).
- Return type
list(tuple)
- reviews(fileids=None)[source]¶
Return all the reviews as a list of Review objects. If fileids is specified, return all the reviews from each of the specified files.
- Parameters
fileids – a list or regexp specifying the ids of the files whose reviews have to be returned.
- Returns
the given file(s) as a list of reviews.
- sents(fileids=None)[source]¶
Return all sentences in the corpus or in the specified files.
- Parameters
fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
- Returns
the given file(s) as a list of sentences, each encoded as a list of word strings.
- Return type
list(list(str))
- words(fileids=None)[source]¶
Return all words and punctuation symbols in the corpus or in the specified files.
- Parameters
fileids – a list or regexp specifying the ids of the files whose words have to be returned.
- Returns
the given file(s) as a list of words and punctuation symbols.
- Return type
list(str)
- class nltk.corpus.reader.SemcorCorpusReader[source]¶
Bases:
XMLCorpusReader
Corpus reader for the SemCor Corpus. For access to the complete XML data structure, use the
xml()
method. For access to simple word lists and tagged word lists, usewords()
,sents()
,tagged_words()
, andtagged_sents()
.- __init__(root, fileids, wordnet, lazy=True)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- chunk_sents(fileids=None)[source]¶
- Returns
the given file(s) as a list of sentences, each encoded as a list of chunks.
- Return type
list(list(list(str)))
- chunks(fileids=None)[source]¶
- Returns
the given file(s) as a list of chunks, each of which is a list of words and punctuation symbols that form a unit.
- Return type
list(list(str))
- sents(fileids=None)[source]¶
- Returns
the given file(s) as a list of sentences, each encoded as a list of word strings.
- Return type
list(list(str))
- tagged_chunks(fileids=None, tag='pos')[source]¶
- Returns
the given file(s) as a list of tagged chunks, represented in tree form.
- Return type
list(Tree)
- Parameters
tag – ‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
- tagged_sents(fileids=None, tag='pos')[source]¶
- Returns
the given file(s) as a list of sentences. Each sentence is represented as a list of tagged chunks (in tree form).
- Return type
list(list(Tree))
- Parameters
tag – ‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
- class nltk.corpus.reader.SensevalCorpusReader[source]¶
Bases:
CorpusReader
- class nltk.corpus.reader.SentiWordNetCorpusReader[source]¶
Bases:
CorpusReader
- class nltk.corpus.reader.SinicaTreebankCorpusReader[source]¶
Bases:
SyntaxCorpusReader
Reader for the sinica treebank.
- class nltk.corpus.reader.StringCategoryCorpusReader[source]¶
Bases:
CorpusReader
- class nltk.corpus.reader.SwadeshCorpusReader[source]¶
Bases:
WordListCorpusReader
- class nltk.corpus.reader.SwitchboardCorpusReader[source]¶
Bases:
CorpusReader
- __init__(root, tagset=None)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- class nltk.corpus.reader.SyntaxCorpusReader[source]¶
Bases:
CorpusReader
An abstract base class for reading corpora consisting of syntactically parsed text. Subclasses should define:
__init__
, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files._read_block
, which reads a block from the input stream._word
, which takes a block and returns a list of list of words._tag
, which takes a block and returns a list of list of tagged words._parse
, which takes a block and returns a list of parsed sentences.
- class nltk.corpus.reader.TEICorpusView[source]¶
Bases:
StreamBackedCorpusView
- __init__(corpus_file, tagged, group_by_sent, group_by_para, tagset=None, head_len=0, textids=None)[source]¶
Create a new corpus view, based on the file
fileid
, and read withblock_reader
. See the class documentation for more information.- Parameters
fileid – The path to the file that is read by this corpus view.
fileid
can either be a string or aPathPointer
.startpos – The file position at which the view will start reading. This can be used to skip over preface sections.
encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).
- class nltk.corpus.reader.TaggedCorpusReader[source]¶
Bases:
CorpusReader
Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using
nltk.tag.str2tuple
. By default,'/'
is used as the separator. I.e., words should have the form:word1/tag1 word2/tag2 word3/tag3 ...
But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.
- __init__(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]¶
Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/...path to corpus.../' >>> reader = TaggedCorpusReader(root, '.*', '.txt')
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
- paras(fileids=None)[source]¶
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
- Return type
list(list(list(str)))
- sents(fileids=None)[source]¶
- Returns
the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
- Return type
list(list(str))
- tagged_paras(fileids=None, tagset=None)[source]¶
- Returns
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of
(word,tag)
tuples.- Return type
list(list(list(tuple(str,str))))
- tagged_sents(fileids=None, tagset=None)[source]¶
- Returns
the given file(s) as a list of sentences, each encoded as a list of
(word,tag)
tuples.- Return type
list(list(tuple(str,str)))
- class nltk.corpus.reader.TimitCorpusReader[source]¶
Bases:
CorpusReader
Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats). The corpus root directory should contain the following files:
timitdic.txt: dictionary of standard transcriptions
spkrinfo.txt: table of speaker information
In addition, the root directory should contain one subdirectory for each speaker, containing three files for each utterance:
<utterance-id>.txt: text content of utterances
<utterance-id>.wrd: tokenized text content of utterances
<utterance-id>.phn: phonetic transcription of utterances
<utterance-id>.wav: utterance sound file
- __init__(root, encoding='utf8')[source]¶
Construct a new TIMIT corpus reader in the given directory. :param root: The root directory for this corpus.
- fileids(filetype=None)[source]¶
Return a list of file identifiers for the files that make up this corpus.
- Parameters
filetype – If specified, then
filetype
indicates that only the files that have the given type should be returned. Accepted values are:txt
,wrd
,phn
,wav
, ormetadata
,
- play(utterance, start=0, end=None)[source]¶
Play the given audio sample.
- Parameters
utterance – The utterance id of the sample to play
- spkrutteranceids(speaker)[source]¶
- Returns
A list of all utterances associated with a given speaker.
- transcription_dict()[source]¶
- Returns
A dictionary giving the ‘standard’ transcription for each word.
- class nltk.corpus.reader.TimitTaggedCorpusReader[source]¶
Bases:
TaggedCorpusReader
A corpus reader for tagged sentences that are included in the TIMIT corpus.
- __init__(*args, **kwargs)[source]¶
Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/...path to corpus.../' >>> reader = TaggedCorpusReader(root, '.*', '.txt')
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
- class nltk.corpus.reader.ToolboxCorpusReader[source]¶
Bases:
CorpusReader
- class nltk.corpus.reader.TwitterCorpusReader[source]¶
Bases:
CorpusReader
Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.
Individual Tweets can be tokenized using the default tokenizer, or by a custom tokenizer specified as a parameter to the constructor.
Construct a new Tweet corpus reader for a set of documents located at the given root directory.
If you made your own tweet collection in a directory called twitter-files, then you can initialise the reader as:
from nltk.corpus import TwitterCorpusReader reader = TwitterCorpusReader(root='/path/to/twitter-files', '.*\.json')
However, the recommended approach is to set the relevant directory as the value of the environmental variable TWITTER, and then invoke the reader as follows:
root = os.environ['TWITTER'] reader = TwitterCorpusReader(root, '.*\.json')
If you want to work directly with the raw Tweets, the json library can be used:
import json for tweet in reader.docs(): print(json.dumps(tweet, indent=1, sort_keys=True))
- CorpusView¶
alias of
StreamBackedCorpusView
- __init__(root, fileids=None, word_tokenizer=<nltk.tokenize.casual.TweetTokenizer object>, encoding='utf8')[source]¶
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
word_tokenizer – Tokenizer for breaking the text of Tweets into smaller units, including but not limited to words.
- docs(fileids=None)[source]¶
Returns the full Tweet objects, as specified by Twitter documentation on Tweets
- Returns
the given file(s) as a list of dictionaries deserialised from JSON.
- Return type
list(dict)
- class nltk.corpus.reader.UdhrCorpusReader[source]¶
Bases:
PlaintextCorpusReader
- ENCODINGS = [('.*-Latin1$', 'latin-1'), ('.*-Hebrew$', 'hebrew'), ('.*-Arabic$', 'cp1256'), ('Czech_Cesky-UTF8', 'cp1250'), ('Polish-Latin2', 'cp1250'), ('Polish_Polski-Latin2', 'cp1250'), ('.*-Cyrillic$', 'cyrillic'), ('.*-SJIS$', 'SJIS'), ('.*-GB2312$', 'GB2312'), ('.*-Latin2$', 'ISO-8859-2'), ('.*-Greek$', 'greek'), ('.*-UTF8$', 'utf-8'), ('Hungarian_Magyar-Unicode', 'utf-16-le'), ('Amahuaca', 'latin1'), ('Turkish_Turkce-Turkish', 'latin5'), ('Lithuanian_Lietuviskai-Baltic', 'latin4'), ('Japanese_Nihongo-EUC', 'EUC-JP'), ('Japanese_Nihongo-JIS', 'iso2022_jp'), ('Chinese_Mandarin-HZ', 'hz'), ('Abkhaz\\-Cyrillic\\+Abkh', 'cp1251')]¶
- SKIP = {'Amharic-Afenegus6..60375', 'Armenian-DallakHelv', 'Azeri_Azerbaijani_Cyrillic-Az.Times.Cyr.Normal0117', 'Azeri_Azerbaijani_Latin-Az.Times.Lat0117', 'Bhojpuri-Agra', 'Burmese_Myanmar-UTF8', 'Burmese_Myanmar-WinResearcher', 'Chinese_Mandarin-HZ', 'Chinese_Mandarin-UTF8', 'Czech-Latin2-err', 'Esperanto-T61', 'Gujarati-UTF8', 'Hungarian_Magyar-Unicode', 'Japanese_Nihongo-JIS', 'Lao-UTF8', 'Magahi-Agra', 'Magahi-UTF8', 'Marathi-UTF8', 'Navaho_Dine-Navajo-Navaho-font', 'Russian_Russky-UTF8~', 'Tamil-UTF8', 'Tigrinya_Tigrigna-VG2Main', 'Vietnamese-TCVN', 'Vietnamese-VIQR', 'Vietnamese-VPS'}¶
- __init__(root='udhr')[source]¶
Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/usr/local/share/nltk_data/corpora/webtext/' >>> reader = PlaintextCorpusReader(root, '.*\.txt')
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
word_tokenizer – Tokenizer for breaking sentences or paragraphs into words.
sent_tokenizer – Tokenizer for breaking paragraphs into words.
para_block_reader – The block reader used to divide the corpus into paragraph blocks.
- class nltk.corpus.reader.UnicharsCorpusReader[source]¶
Bases:
WordListCorpusReader
This class is used to read lists of characters from the Perl Unicode Properties (see https://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from https://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm
- available_categories = ['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'IsUpper', 'Line_Separator', 'Number', 'Open_Punctuation', 'Punctuation', 'Separator', 'Symbol']¶
- chars(category=None, fileids=None)[source]¶
This module returns a list of characters from the Perl Unicode Properties. They are very useful when porting Perl tokenizers to Python.
>>> from nltk.corpus import perluniprops as pup >>> pup.chars('Open_Punctuation')[:5] == [u'(', u'[', u'{', u'༺', u'༼'] True >>> pup.chars('Currency_Symbol')[:5] == [u'$', u'¢', u'£', u'¤', u'¥'] True >>> pup.available_categories ['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'IsUpper', 'Line_Separator', 'Number', 'Open_Punctuation', 'Punctuation', 'Separator', 'Symbol']
- Returns
a list of characters given the specific unicode character category
- class nltk.corpus.reader.VerbnetCorpusReader[source]¶
Bases:
XMLCorpusReader
An NLTK interface to the VerbNet verb lexicon.
From the VerbNet site: “VerbNet (VN) (Kipper-Schuler 2006) is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon with mappings to other lexical resources such as WordNet (Miller, 1990; Fellbaum, 1998), XTAG (XTAG Research Group, 2001), and FrameNet (Baker et al., 1998).”
For details about VerbNet see: https://verbs.colorado.edu/~mpalmer/projects/verbnet.html
- __init__(root, fileids, wrap_etree=False)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- classids(lemma=None, wordnetid=None, fileid=None, classid=None)[source]¶
Return a list of the VerbNet class identifiers. If a file identifier is specified, then return only the VerbNet class identifiers for classes (and subclasses) defined by that file. If a lemma is specified, then return only VerbNet class identifiers for classes that contain that lemma as a member. If a wordnetid is specified, then return only identifiers for classes that contain that wordnetid as a member. If a classid is specified, then return only identifiers for subclasses of the specified VerbNet class. If nothing is specified, return all classids within VerbNet
- fileids(vnclass_ids=None)[source]¶
Return a list of fileids that make up this corpus. If
vnclass_ids
is specified, then return the fileids that make up the specified VerbNet class(es).
- frames(vnclass)[source]¶
Given a VerbNet class, this method returns VerbNet frames
The members returned are: 1) Example 2) Description 3) Syntax 4) Semantics
- Parameters
vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
- Returns
frames - a list of frame dictionaries
- lemmas(vnclass=None)[source]¶
Return a list of all verb lemmas that appear in any class, or in the
classid
if specified.
- longid(shortid)[source]¶
Returns longid of a VerbNet class
Given a short VerbNet class identifier (eg ‘37.10’), map it to a long id (eg ‘confess-37.10’). If
shortid
is already a long id, then return it as-is
- pprint(vnclass)[source]¶
Returns pretty printed version of a VerbNet class
Return a string containing a pretty-printed representation of the given VerbNet class.
- Parameters
vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
- pprint_frames(vnclass, indent='')[source]¶
Returns pretty version of all frames in a VerbNet class
Return a string containing a pretty-printed representation of the list of frames within the VerbNet class.
- Parameters
vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
- pprint_members(vnclass, indent='')[source]¶
Returns pretty printed version of members in a VerbNet class
Return a string containing a pretty-printed representation of the given VerbNet class’s member verbs.
- Parameters
vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
- pprint_subclasses(vnclass, indent='')[source]¶
Returns pretty printed version of subclasses of VerbNet class
Return a string containing a pretty-printed representation of the given VerbNet class’s subclasses.
- Parameters
vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
- pprint_themroles(vnclass, indent='')[source]¶
Returns pretty printed version of thematic roles in a VerbNet class
Return a string containing a pretty-printed representation of the given VerbNet class’s thematic roles.
- Parameters
vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
- shortid(longid)[source]¶
Returns shortid of a VerbNet class
Given a long VerbNet class identifier (eg ‘confess-37.10’), map it to a short id (eg ‘37.10’). If
longid
is already a short id, then return it as-is.
- subclasses(vnclass)[source]¶
Returns subclass ids, if any exist
Given a VerbNet class, this method returns subclass ids (if they exist) in a list of strings.
- Parameters
vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
- Returns
list of subclasses
- themroles(vnclass)[source]¶
Returns thematic roles participating in a VerbNet class
Members returned as part of roles are- 1) Type 2) Modifiers
- Parameters
vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
- Returns
themroles: A list of thematic roles in the VerbNet class
- vnclass(fileid_or_classid)[source]¶
Returns VerbNet class ElementTree
Return an ElementTree containing the xml for the specified VerbNet class.
- Parameters
fileid_or_classid – An identifier specifying which class should be returned. Can be a file identifier (such as
'put-9.1.xml'
), or a VerbNet class identifier (such as'put-9.1'
) or a short VerbNet class identifier (such as'9.1'
).
- class nltk.corpus.reader.WordListCorpusReader[source]¶
Bases:
CorpusReader
List of words, one per line. Blank lines are ignored.
- class nltk.corpus.reader.WordNetCorpusReader[source]¶
Bases:
CorpusReader
A corpus reader used to access wordnet or its variants.
- ADJ = 'a'¶
- ADJ_SAT = 's'¶
- ADV = 'r'¶
- MORPHOLOGICAL_SUBSTITUTIONS = {'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'n': [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 'r': [], 's': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'v': [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')]}¶
- NOUN = 'n'¶
- VERB = 'v'¶
- __init__(root, omw_reader)[source]¶
Construct a new wordnet corpus reader, with the given root directory.
- add_exomw()[source]¶
Add languages from Extended OMW
>>> import nltk >>> from nltk.corpus import wordnet as wn >>> wn.add_exomw() >>> print(wn.synset('intrinsically.r.01').lemmas(lang="eng_wikt")) [Lemma('intrinsically.r.01.per_se'), Lemma('intrinsically.r.01.as_such')]
- all_lemma_names(pos=None, lang='eng')[source]¶
Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.
- all_synsets(pos=None, lang='eng')[source]¶
Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.
- citation(lang='eng')[source]¶
Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language
- custom_lemmas(tab_file, lang)[source]¶
Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK’s WordNet functions to then be used with that language.
See the “Tab files” section at https://omwn.org/omw1.html for documentation on the Multilingual WordNet tab file format.
- Parameters
tab_file – Tab file as a file or file-like object
- Type
lang str
- Param
lang ISO 639-3 code of the language of the tab file
- digraph(inputs, rel=<function WordNetCorpusReader.<lambda>>, pos=None, maxdepth=-1, shapes=None, attr=None, verbose=False)[source]¶
Produce a graphical representation from ‘inputs’ (a list of start nodes, which can be a mix of Synsets, Lemmas and/or words), and a synset relation, for drawing with the ‘dot’ graph visualisation program from the Graphviz package.
Return a string in the DOT graph file language, which can then be converted to an image by nltk.parse.dependencygraph.dot2img(dot_string).
Optional Parameters: :rel: Wordnet synset relation :pos: for words, restricts Part of Speech to ‘n’, ‘v’, ‘a’ or ‘r’ :maxdepth: limit the longest path :shapes: dictionary of strings that trigger a specified shape :attr: dictionary with global graph attributes :verbose: warn about cycles
>>> from nltk.corpus import wordnet as wn >>> print(wn.digraph([wn.synset('dog.n.01')])) digraph G { "Synset('animal.n.01')" -> "Synset('organism.n.01')"; "Synset('canine.n.02')" -> "Synset('carnivore.n.01')"; "Synset('carnivore.n.01')" -> "Synset('placental.n.01')"; "Synset('chordate.n.01')" -> "Synset('animal.n.01')"; "Synset('dog.n.01')" -> "Synset('canine.n.02')"; "Synset('dog.n.01')" -> "Synset('domestic_animal.n.01')"; "Synset('domestic_animal.n.01')" -> "Synset('animal.n.01')"; "Synset('living_thing.n.01')" -> "Synset('whole.n.02')"; "Synset('mammal.n.01')" -> "Synset('vertebrate.n.01')"; "Synset('object.n.01')" -> "Synset('physical_entity.n.01')"; "Synset('organism.n.01')" -> "Synset('living_thing.n.01')"; "Synset('physical_entity.n.01')" -> "Synset('entity.n.01')"; "Synset('placental.n.01')" -> "Synset('mammal.n.01')"; "Synset('vertebrate.n.01')" -> "Synset('chordate.n.01')"; "Synset('whole.n.02')" -> "Synset('object.n.01')"; }
- doc(file='README', lang='eng')[source]¶
Return the contents of readme, license or citation file use lang=lang to get the file for an individual language
- ic(corpus, weight_senses_equally=False, smoothing=1.0)[source]¶
Creates an information content lookup dictionary from a corpus.
- Parameters
corpus (CorpusReader) – The corpus from which we create an information content dictionary.
weight_senses_equally (bool) – If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.)
smoothing (float) – How much do we smooth synset counts (default is 1.0)
- Returns
An information content dictionary
- index_sense(version=None)[source]¶
Read sense key to synset id mapping from index.sense file in corpus directory
- jcn_similarity(synset1, synset2, ic, verbose=False)[source]¶
Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).
- Parameters
other (Synset) – The
Synset
that thisSynset
is being compared to.ic (dict) – an information content object (as returned by
nltk.corpus.wordnet_ic.ic()
).
- Returns
A float score denoting the similarity of the two
Synset
objects.
- lch_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]¶
Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.
- Parameters
other (Synset) – The
Synset
that thisSynset
is being compared to.simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
- Returns
A score denoting the similarity of the two
Synset
objects, normally greater than 0. None is returned if no connecting path could be found. If aSynset
is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.
- lemmas(lemma, pos=None, lang='eng')[source]¶
Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.
- license(lang='eng')[source]¶
Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language
- lin_similarity(synset1, synset2, ic, verbose=False)[source]¶
Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).
- Parameters
other (Synset) – The
Synset
that thisSynset
is being compared to.ic (dict) – an information content object (as returned by
nltk.corpus.wordnet_ic.ic()
).
- Returns
A float score denoting the similarity of the two
Synset
objects, in the range 0 to 1.
- morphy(form, pos=None, check_exceptions=True)[source]¶
Find a possible base form for the given form, with the given part of speech, by checking WordNet’s list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.
>>> from nltk.corpus import wordnet as wn >>> print(wn.morphy('dogs')) dog >>> print(wn.morphy('churches')) church >>> print(wn.morphy('aardwolves')) aardwolf >>> print(wn.morphy('abaci')) abacus >>> wn.morphy('hardrock', wn.ADV) >>> print(wn.morphy('book', wn.NOUN)) book >>> wn.morphy('book', wn.ADJ)
- path_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]¶
Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.
- Parameters
other (Synset) – The
Synset
that thisSynset
is being compared to.simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
- Returns
A score denoting the similarity of the two
Synset
objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if aSynset
is compared with itself.
- readme(lang='eng')[source]¶
Return the contents of README (for omw) use lang=lang to get the readme for an individual language
- res_similarity(synset1, synset2, ic, verbose=False)[source]¶
Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).
- Parameters
other (Synset) – The
Synset
that thisSynset
is being compared to.ic (dict) – an information content object (as returned by
nltk.corpus.wordnet_ic.ic()
).
- Returns
A float score denoting the similarity of the two
Synset
objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).
- synonyms(word, lang='eng')[source]¶
return nested list with the synonyms of the different senses of word in the given language
- synset_from_pos_and_offset(pos, offset)[source]¶
pos: The synset’s part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB (‘a’, ‘s’, ‘r’, ‘n’, or ‘v’).
offset: The byte offset of this synset in the WordNet dict file for this pos.
>>> from nltk.corpus import wordnet as wn >>> print(wn.synset_from_pos_and_offset('n', 1740)) Synset('entity.n.01')
- synset_from_sense_key(sense_key)[source]¶
Retrieves synset based on a given sense_key. Sense keys can be obtained from lemma.key()
From https://wordnet.princeton.edu/documentation/senseidx5wn: A sense_key is represented as:
lemma % lex_sense (e.g. 'dog%1:18:01::')
where lex_sense is encoded as:
ss_type:lex_filenum:lex_id:head_word:head_id
- Lemma
ASCII text of word/collocation, in lower case
- Ss_type
synset type for the sense (1 digit int) The synset type is encoded as follows:
1 NOUN 2 VERB 3 ADJECTIVE 4 ADVERB 5 ADJECTIVE SATELLITE
- Lex_filenum
name of lexicographer file containing the synset for the sense (2 digit int)
- Lex_id
when paired with lemma, uniquely identifies a sense in the lexicographer file (2 digit int)
- Head_word
lemma of the first word in satellite’s head synset Only used if sense is in an adjective satellite synset
- Head_id
uniquely identifies sense in a lexicographer file when paired with head_word Only used if head_word is present (2 digit int)
>>> import nltk >>> from nltk.corpus import wordnet as wn >>> print(wn.synset_from_sense_key("drive%1:04:03::")) Synset('drive.n.06')
>>> print(wn.synset_from_sense_key("driving%1:04:03::")) Synset('drive.n.06')
- synsets(lemma, pos=None, lang='eng', check_exceptions=True)[source]¶
Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.
- wup_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]¶
Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.
The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.
- Parameters
other (Synset) – The
Synset
that thisSynset
is being compared to.simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
- Returns
A float score denoting the similarity of the two
Synset
objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.
- class nltk.corpus.reader.WordNetICCorpusReader[source]¶
Bases:
CorpusReader
A corpus reader for the WordNet information content corpus.
- __init__(root, fileids)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- ic(icfile)[source]¶
Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.
- Parameters
icfile (str) – The name of the wordnet_ic file (e.g. “ic-brown.dat”)
- Returns
An information content dictionary
- class nltk.corpus.reader.XMLCorpusReader[source]¶
Bases:
CorpusReader
Corpus reader for corpora whose documents are xml files.
Note that the
XMLCorpusReader
constructor does not take anencoding
argument, because the unicode encoding is specified by the XML files themselves. See the XML specs for more info.- __init__(root, fileids, wrap_etree=False)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- words(fileid=None)[source]¶
Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.
- Returns
the given file’s text nodes as a list of words and punctuation symbols
- Return type
list(str)
- class nltk.corpus.reader.YCOECorpusReader[source]¶
Bases:
CorpusReader
Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.
- __init__(root, encoding='utf8')[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- documents(fileids=None)[source]¶
Return a list of document identifiers for all documents in this corpus, or for the documents with the given file(s) if specified.