nltk.corpus.reader.aligned module¶
- class nltk.corpus.reader.aligned.AlignedCorpusReader[source]¶
Bases:
CorpusReader
Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.
- __init__(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), alignedsent_block_reader=<function read_alignedsent_block>, encoding='latin1')[source]¶
Construct a new Aligned Corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/...path to corpus.../' >>> reader = AlignedCorpusReader(root, '.*', '.txt')
- Parameters
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
- aligned_sents(fileids=None)[source]¶
- Returns
the given file(s) as a list of AlignedSent objects.
- Return type
list(AlignedSent)
- class nltk.corpus.reader.aligned.AlignedSentCorpusView[source]¶
Bases:
StreamBackedCorpusView
A specialized corpus view for aligned sentences.
AlignedSentCorpusView
objects are typically created byAlignedCorpusReader
(not directly by nltk users).- __init__(corpus_file, encoding, aligned, group_by_sent, word_tokenizer, sent_tokenizer, alignedsent_block_reader)[source]¶
Create a new corpus view, based on the file
fileid
, and read withblock_reader
. See the class documentation for more information.- Parameters
fileid – The path to the file that is read by this corpus view.
fileid
can either be a string or aPathPointer
.startpos – The file position at which the view will start reading. This can be used to skip over preface sections.
encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).