nltk.corpus.reader.pros_cons module¶
CorpusReader for the Pros and Cons dataset.
Pros and Cons dataset information -
- Contact: Bing Liu, liub@cs.uic.edu
Distributed with permission.
Related papers:
- Murthy Ganapathibhotla and Bing Liu. “Mining Opinions in Comparative Sentences”.
Proceedings of the 22nd International Conference on Computational Linguistics (Coling-2008), Manchester, 18-22 August, 2008.
- Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and Comparing
Opinions on the Web”. Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan.
- class nltk.corpus.reader.pros_cons.ProsConsCorpusReader[source]¶
Bases:
CategorizedCorpusReader
,CorpusReader
Reader for the Pros and Cons sentence dataset.
>>> from nltk.corpus import pros_cons >>> pros_cons.sents(categories='Cons') [['East', 'batteries', '!', 'On', '-', 'off', 'switch', 'too', 'easy', 'to', 'maneuver', '.'], ['Eats', '...', 'no', ',', 'GULPS', 'batteries'], ...] >>> pros_cons.words('IntegratedPros.txt') ['Easy', 'to', 'use', ',', 'economical', '!', ...]
- CorpusView¶
alias of
StreamBackedCorpusView
- __init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), encoding='utf8', **kwargs)[source]¶
- Parameters
root – The root directory for the corpus.
fileids – a list or regexp specifying the fileids in the corpus.
word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer
encoding – the encoding that should be used to read the corpus.
kwargs – additional parameters passed to CategorizedCorpusReader.
- sents(fileids=None, categories=None)[source]¶
Return all sentences in the corpus or in the specified files/categories.
- Parameters
fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
categories – a list specifying the categories whose sentences have to be returned.
- Returns
the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.
- Return type
list(list(str))
- words(fileids=None, categories=None)[source]¶
Return all words and punctuation symbols in the corpus or in the specified files/categories.
- Parameters
fileids – a list or regexp specifying the ids of the files whose words have to be returned.
categories – a list specifying the categories whose words have to be returned.
- Returns
the given file(s) as a list of words and punctuation symbols.
- Return type
list(str)