nltk.corpus.reader.twitter module¶
A reader for corpora that consist of Tweets. It is assumed that the Tweets have been serialised into line-delimited JSON.
- class nltk.corpus.reader.twitter.TwitterCorpusReader[source]¶
Bases:
CorpusReader
Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.
Individual Tweets can be tokenized using the default tokenizer, or by a custom tokenizer specified as a parameter to the constructor.
Construct a new Tweet corpus reader for a set of documents located at the given root directory.
If you made your own tweet collection in a directory called twitter-files, then you can initialise the reader as:
from nltk.corpus import TwitterCorpusReader reader = TwitterCorpusReader(root='/path/to/twitter-files', '.*\.json')
However, the recommended approach is to set the relevant directory as the value of the environmental variable TWITTER, and then invoke the reader as follows:
root = os.environ['TWITTER'] reader = TwitterCorpusReader(root, '.*\.json')
If you want to work directly with the raw Tweets, the json library can be used:
import json for tweet in reader.docs(): print(json.dumps(tweet, indent=1, sort_keys=True))
- CorpusView¶
The corpus view class used by this reader.
alias of
StreamBackedCorpusView
- __init__(root, fileids=None, word_tokenizer=<nltk.tokenize.casual.TweetTokenizer object>, encoding='utf8')[source]¶
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
word_tokenizer – Tokenizer for breaking the text of Tweets into smaller units, including but not limited to words.
- docs(fileids=None)[source]¶
Returns the full Tweet objects, as specified by Twitter documentation on Tweets
- Returns
the given file(s) as a list of dictionaries deserialised from JSON.
- Return type
list(dict)