nltk.corpus.reader.panlex_lite module¶
CorpusReader for PanLex Lite, a stripped down version of PanLex distributed as an SQLite database. See the README.txt in the panlex_lite corpus directory for more information on PanLex Lite.
- class nltk.corpus.reader.panlex_lite.Meaning[source]¶
Bases:
dict
Represents a single PanLex meaning. A meaning is a translation set derived from a single source.
- class nltk.corpus.reader.panlex_lite.PanLexLiteCorpusReader[source]¶
Bases:
CorpusReader
- MEANING_Q = '\n SELECT dnx2.mn, dnx2.uq, dnx2.ap, dnx2.ui, ex2.tt, ex2.lv\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.tt = ? AND ex.lv = ?\n ORDER BY dnx2.uq DESC\n '¶
- TRANSLATION_Q = '\n SELECT s.tt, sum(s.uq) AS trq FROM (\n SELECT ex2.tt, max(dnx.uq) AS uq\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.lv = ? AND ex.tt = ? AND ex2.lv = ?\n GROUP BY ex2.tt, dnx.ui\n ) s\n GROUP BY s.tt\n ORDER BY trq DESC, s.tt\n '¶
- __init__(root)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.
- language_varieties(lc=None)[source]¶
Return a list of PanLex language varieties.
- Parameters
lc – ISO 639 alpha-3 code. If specified, filters returned varieties by this code. If unspecified, all varieties are returned.
- Returns
the specified language varieties as a list of tuples. The first element is the language variety’s seven-character uniform identifier, and the second element is its default name.
- Return type
list(tuple)
- meanings(expr_uid, expr_tt)[source]¶
Return a list of meanings for an expression.
- Parameters
expr_uid – the expression’s language variety, as a seven-character uniform identifier.
expr_tt – the expression’s text.
- Returns
a list of Meaning objects.
- Return type
list(Meaning)
- translations(from_uid, from_tt, to_uid)[source]¶
Return a list of translations for an expression into a single language variety.
- Parameters
from_uid – the source expression’s language variety, as a seven-character uniform identifier.
from_tt – the source expression’s text.
to_uid – the target language variety, as a seven-character uniform identifier.
- Returns
a list of translation tuples. The first element is the expression text and the second element is the translation quality.
- Return type
list(tuple)