nltk.downloader module¶
The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.
Downloading Packages¶
If called with no arguments, download()
will display an interactive
interface which can be used to download and install new packages.
If Tkinter is available, then a graphical interface will be shown,
otherwise a simple text interface will be provided.
Individual packages can be downloaded by calling the download()
function with a single argument, giving the package identifier for the
package that should be downloaded:
>>> download('treebank')
[nltk_data] Downloading package 'treebank'...
[nltk_data] Unzipping corpora/treebank.zip.
NLTK also provides a number of “package collections”, consisting of
a group of related packages. To download all packages in a
colleciton, simply call download()
with the collection’s
identifier:
>>> download('all-corpora')
[nltk_data] Downloading package 'abc'...
[nltk_data] Unzipping corpora/abc.zip.
[nltk_data] Downloading package 'alpino'...
[nltk_data] Unzipping corpora/alpino.zip.
...
[nltk_data] Downloading package 'words'...
[nltk_data] Unzipping corpora/words.zip.
Download Directory¶
By default, packages are installed in either a system-wide directory
(if Python has sufficient access to write to it); or in the current
user’s home directory. However, the download_dir
argument may be
used to specify a different installation target, if desired.
See Downloader.default_download_dir()
for more a detailed
description of how the default download directory is chosen.
NLTK Download Server¶
Before downloading any packages, the corpus and module downloader
contacts the NLTK download server, to retrieve an index file
describing the available packages. By default, this index file is
loaded from https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
.
If necessary, it is possible to create a new Downloader
object,
specifying a different URL for the package index file.
Usage:
python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
or:
python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
- class nltk.downloader.Collection[source]¶
Bases:
object
A directory entry for a collection of downloadable packages. These entries are extracted from the XML index file that is downloaded by
Downloader
.- children¶
A list of the
Collections
orPackages
directly contained by this collection.
- id¶
A unique identifier for this collection.
- name¶
A string name for this collection.
- packages¶
A list of
Packages
contained by this collection or any collections it recursively contains.
- class nltk.downloader.Downloader[source]¶
Bases:
object
A class used to access the NLTK data server, which can be used to download corpora and other data packages.
- DEFAULT_URL = 'https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml'¶
The default URL for the NLTK data server’s index. An alternative URL can be specified when creating a new
Downloader
object.
- INDEX_TIMEOUT = 3600¶
The amount of time after which the cached copy of the data server index will be considered ‘stale,’ and will be re-downloaded.
- INSTALLED = 'installed'¶
A status string indicating that a package or collection is installed and up-to-date.
- NOT_INSTALLED = 'not installed'¶
A status string indicating that a package or collection is not installed.
- PARTIAL = 'partial'¶
A status string indicating that a collection is partially installed (i.e., only some of its packages are installed.)
- STALE = 'out of date'¶
A status string indicating that a package or collection is corrupt or out-of-date.
- default_download_dir()[source]¶
Return the directory to which packages will be downloaded by default. This value can be overridden using the constructor, or on a case-by-case basis using the
download_dir
argument when callingdownload()
.On Windows, the default download directory is
PYTHONHOME/lib/nltk
, where PYTHONHOME is the directory containing Python, e.g.C:\Python25
.On all other platforms, the default directory is the first of the following which exists or which can be created with write permission:
/usr/share/nltk_data
,/usr/local/share/nltk_data
,/usr/lib/nltk_data
,/usr/local/lib/nltk_data
,~/nltk_data
.
- download(info_or_id=None, download_dir=None, quiet=False, force=False, prefix='[nltk_data] ', halt_on_error=True, raise_on_error=False, print_error_to=<colorama.ansitowin32.StreamWrapper object>)[source]¶
- property download_dir¶
The default directory to which packages will be downloaded. This defaults to the value returned by
default_download_dir()
. To override this default on a case-by-case basis, use thedownload_dir
argument when callingdownload()
.
- index()[source]¶
Return the XML index describing the packages available from the data server. If necessary, this index will be downloaded from the data server.
- list(download_dir=None, show_packages=True, show_collections=True, header=True, more_prompt=False, skip_installed=False)[source]¶
- status(info_or_id, download_dir=None)[source]¶
Return a constant describing the status of the given package or collection. Status can be one of
INSTALLED
,NOT_INSTALLED
,STALE
, orPARTIAL
.
- property url¶
The URL for the data server’s index file.
- class nltk.downloader.DownloaderGUI[source]¶
Bases:
object
Graphical interface for downloading packages from the NLTK data server.
- COLUMNS = ['', 'Identifier', 'Name', 'Size', 'Status', 'Unzipped Size', 'Copyright', 'Contact', 'License', 'Author', 'Subdir', 'Checksum']¶
A list of the names of columns. This controls the order in which the columns will appear. If this is edited, then
_package_to_columns()
may need to be edited to match.
- COLUMN_WEIGHTS = {'': 0, 'Name': 5, 'Size': 0, 'Status': 0}¶
A dictionary specifying how columns should be resized when the table is resized. Columns with weight 0 will not be resized at all; and columns with high weight will be resized more. Default weight (for columns not explicitly listed) is 1.
- COLUMN_WIDTHS = {'': 1, 'Identifier': 20, 'Name': 45, 'Size': 10, 'Status': 12, 'Unzipped Size': 10}¶
A dictionary specifying how wide each column should be, in characters. The default width (for columns not explicitly listed) is specified by
DEFAULT_COLUMN_WIDTH
.
- DEFAULT_COLUMN_WIDTH = 30¶
The default width for columns that are not explicitly listed in
COLUMN_WIDTHS
.
- HELP = 'This tool can be used to download a variety of corpora and models\nthat can be used with NLTK. Each corpus or model is distributed\nin a single zip file, known as a "package file." You can\ndownload packages individually, or you can download pre-defined\ncollections of packages.\n\nWhen you download a package, it will be saved to the "download\ndirectory." A default download directory is chosen when you run\n\nthe downloader; but you may also select a different download\ndirectory. On Windows, the default download directory is\n\n\n"package."\n\nThe NLTK downloader can be used to download a variety of corpora,\nmodels, and other data packages.\n\nKeyboard shortcuts::\n [return]\t Download\n [up]\t Select previous package\n [down]\t Select next package\n [left]\t Select previous tab\n [right]\t Select next tab\n'¶
- INITIAL_COLUMNS = ['', 'Identifier', 'Name', 'Size', 'Status']¶
The set of columns that should be displayed by default.
- c = 'Status'¶
- class nltk.downloader.DownloaderMessage[source]¶
Bases:
object
A status message object, used by
incr_download
to communicate its progress.
- class nltk.downloader.ErrorMessage[source]¶
Bases:
DownloaderMessage
Data server encountered an error
- class nltk.downloader.FinishCollectionMessage[source]¶
Bases:
DownloaderMessage
Data server has finished working on a collection of packages.
- class nltk.downloader.FinishDownloadMessage[source]¶
Bases:
DownloaderMessage
Data server has finished downloading a package.
- class nltk.downloader.FinishPackageMessage[source]¶
Bases:
DownloaderMessage
Data server has finished working on a package.
- class nltk.downloader.FinishUnzipMessage[source]¶
Bases:
DownloaderMessage
Data server has finished unzipping a package.
- class nltk.downloader.Package[source]¶
Bases:
object
A directory entry for a downloadable package. These entries are extracted from the XML index file that is downloaded by
Downloader
. Each package consists of a single file; but if that file is a zip file, then it can be automatically decompressed when the package is installed.- __init__(id, url, name=None, subdir='', size=None, unzipped_size=None, checksum=None, svn_revision=None, copyright='Unknown', contact='Unknown', license='Unknown', author='Unknown', unzip=True, **kw)[source]¶
- author¶
Author of this package.
- checksum¶
The MD-5 checksum of the package file.
- contact¶
Name & email of the person who should be contacted with questions about this package.
- copyright¶
Copyright holder for this package.
- filename¶
The filename that should be used for this package’s file. It is formed by joining
self.subdir
withself.id
, and using the same extension asurl
.
- id¶
A unique identifier for this package.
- license¶
License information for this package.
- name¶
A string name for this package.
- size¶
The filesize (in bytes) of the package file.
- subdir¶
The subdirectory where this package should be installed. E.g.,
'corpora'
or'taggers'
.
- svn_revision¶
A subversion revision number for this package.
- unzip¶
A flag indicating whether this corpus should be unzipped by default.
- unzipped_size¶
The total filesize of the files contained in the package’s zipfile.
- url¶
A URL that can be used to download this package’s file.
- class nltk.downloader.ProgressMessage[source]¶
Bases:
DownloaderMessage
Indicates how much progress the data server has made
- class nltk.downloader.SelectDownloadDirMessage[source]¶
Bases:
DownloaderMessage
Indicates what download directory the data server is using
- class nltk.downloader.StaleMessage[source]¶
Bases:
DownloaderMessage
The package download file is out-of-date or corrupt
- class nltk.downloader.StartCollectionMessage[source]¶
Bases:
DownloaderMessage
Data server has started working on a collection of packages.
- class nltk.downloader.StartDownloadMessage[source]¶
Bases:
DownloaderMessage
Data server has started downloading a package.
- class nltk.downloader.StartPackageMessage[source]¶
Bases:
DownloaderMessage
Data server has started working on a package.
- class nltk.downloader.StartUnzipMessage[source]¶
Bases:
DownloaderMessage
Data server has started unzipping a package.
- class nltk.downloader.UpToDateMessage[source]¶
Bases:
DownloaderMessage
The package download file is already up-to-date
- nltk.downloader.build_index(root, base_url)[source]¶
Create a new data.xml index file, by combining the xml description files for various packages and collections.
root
should be the path to a directory containing the package xml and zip files; and the collection xml files. Theroot
directory is expected to have the following subdirectories:root/ packages/ .................. subdirectory for packages corpora/ ................. zip & xml files for corpora grammars/ ................ zip & xml files for grammars taggers/ ................. zip & xml files for taggers tokenizers/ .............. zip & xml files for tokenizers etc. collections/ ............... xml files for collections
For each package, there should be two files:
package.zip
(where package is the package name) which contains the package itself as a compressed zip file; andpackage.xml
, which is an xml description of the package. The zipfilepackage.zip
should expand to a single subdirectory namedpackage/
. The base filenamepackage
must match the identifier given in the package’s xml file.For each collection, there should be a single file
collection.zip
describing the collection, where collection is the name of the collection.All identifiers (for both packages and collections) must be unique.
- nltk.downloader.download(info_or_id=None, download_dir=None, quiet=False, force=False, prefix='[nltk_data] ', halt_on_error=True, raise_on_error=False, print_error_to=<colorama.ansitowin32.StreamWrapper object>)¶
- nltk.downloader.md5_hexdigest(file)[source]¶
Calculate and return the MD5 checksum for a given file.
file
may either be a filename or an open stream.