nltk.corpus.reader.comparative_sents module¶
CorpusReader for the Comparative Sentence Dataset.
Comparative Sentence Dataset information -
- Annotated by: Nitin Jindal and Bing Liu, 2006.
Department of Computer Sicence University of Illinois at Chicago
- Contact: Nitin Jindal, njindal@cs.uic.edu
Bing Liu, liub@cs.uic.edu (https://www.cs.uic.edu/~liub)
Distributed with permission.
Related papers:
- Nitin Jindal and Bing Liu. “Identifying Comparative Sentences in Text Documents”.
Proceedings of the ACM SIGIR International Conference on Information Retrieval (SIGIR-06), 2006.
- Nitin Jindal and Bing Liu. “Mining Comprative Sentences and Relations”.
Proceedings of Twenty First National Conference on Artificial Intelligence (AAAI-2006), 2006.
- Murthy Ganapathibhotla and Bing Liu. “Mining Opinions in Comparative Sentences”.
Proceedings of the 22nd International Conference on Computational Linguistics (Coling-2008), Manchester, 18-22 August, 2008.
- class nltk.corpus.reader.comparative_sents.ComparativeSentencesCorpusReader[source]¶
Bases:
CorpusReader
Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).
>>> from nltk.corpus import comparative_sentences >>> comparison = comparative_sentences.comparisons()[0] >>> comparison.text ['its', 'fast-forward', 'and', 'rewind', 'work', 'much', 'more', 'smoothly', 'and', 'consistently', 'than', 'those', 'of', 'other', 'models', 'i', "'ve", 'had', '.'] >>> comparison.entity_2 'models' >>> (comparison.feature, comparison.keyword) ('rewind', 'more') >>> len(comparative_sentences.comparisons()) 853
- CorpusView¶
alias of
StreamBackedCorpusView
- __init__(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), sent_tokenizer=None, encoding='utf8')[source]¶
- Parameters
root – The root directory for this corpus.
fileids – a list or regexp specifying the fileids in this corpus.
word_tokenizer – tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer
sent_tokenizer – tokenizer for breaking paragraphs into sentences.
encoding – the encoding that should be used to read the corpus.
- comparisons(fileids=None)[source]¶
Return all comparisons in the corpus.
- Parameters
fileids – a list or regexp specifying the ids of the files whose comparisons have to be returned.
- Returns
the given file(s) as a list of Comparison objects.
- Return type
list(Comparison)
- keywords(fileids=None)[source]¶
Return a set of all keywords used in the corpus.
- Parameters
fileids – a list or regexp specifying the ids of the files whose keywords have to be returned.
- Returns
the set of keywords and comparative phrases used in the corpus.
- Return type
set(str)
- keywords_readme()[source]¶
Return the list of words and constituents considered as clues of a comparison (from listOfkeywords.txt).
- sents(fileids=None)[source]¶
Return all sentences in the corpus.
- Parameters
fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
- Returns
all sentences of the corpus as lists of tokens (or as plain strings, if no word tokenizer is specified).
- Return type
list(list(str)) or list(str)
- class nltk.corpus.reader.comparative_sents.Comparison[source]¶
Bases:
object
A Comparison represents a comparative sentence and its constituents.
- __init__(text=None, comp_type=None, entity_1=None, entity_2=None, feature=None, keyword=None)[source]¶
- Parameters
text – a string (optionally tokenized) containing a comparison.
comp_type – an integer defining the type of comparison expressed. Values can be: 1 (Non-equal gradable), 2 (Equative), 3 (Superlative), 4 (Non-gradable).
entity_1 – the first entity considered in the comparison relation.
entity_2 – the second entity considered in the comparison relation.
feature – the feature considered in the comparison relation.
keyword – the word or phrase which is used for that comparative relation.