nltk.corpus.reader.nkjp module¶
- class nltk.corpus.reader.nkjp.NKJPCorpusReader[source]¶
Bases:
XMLCorpusReader
- HEADER_MODE = 2¶
- RAW_MODE = 3¶
- SENTS_MODE = 1¶
- WORDS_MODE = 0¶
- __init__(root, fileids='.*')[source]¶
Corpus reader designed to work with National Corpus of Polish. See http://nkjp.pl/ for more details about NKJP. use example: import nltk import nkjp from nkjp import NKJPCorpusReader x = NKJPCorpusReader(root=’/home/USER/nltk_data/corpora/nkjp/’, fileids=’’) # obtain the whole corpus x.header() x.raw() x.words() x.tagged_words(tags=[‘subst’, ‘comp’]) #Link to find more tags: nkjp.pl/poliqarp/help/ense2.html x.sents() x = NKJPCorpusReader(root=’/home/USER/nltk_data/corpora/nkjp/’, fileids=’Wilk*’) # obtain particular file(s) x.header(fileids=[‘WilkDom’, ‘/home/USER/nltk_data/corpora/nkjp/WilkWilczy’]) x.tagged_words(fileids=[‘WilkDom’, ‘/home/USER/nltk_data/corpora/nkjp/WilkWilczy’], tags=[‘subst’, ‘comp’])
- class nltk.corpus.reader.nkjp.NKJPCorpus_Header_View[source]¶
Bases:
XMLCorpusView
- __init__(filename, **kwargs)[source]¶
HEADER_MODE A stream backed corpus view specialized for use with header.xml files in NKJP corpus.
- handle_elt(elt, context)[source]¶
Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the
elt_handler
constructor argument, this method simply returnselt
.- Returns
The view value corresponding to
elt
.- Parameters
elt (ElementTree) – The element that should be converted.
context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string
'foo/bar/baz'
indicates that the element is abaz
element whose parent is abar
element and whose grandparent is a top-levelfoo
element.
- class nltk.corpus.reader.nkjp.NKJPCorpus_Morph_View[source]¶
Bases:
XMLCorpusView
A stream backed corpus view specialized for use with ann_morphosyntax.xml files in NKJP corpus.
- __init__(filename, **kwargs)[source]¶
Create a new corpus view based on a specified XML file.
Note that the
XMLCorpusView
constructor does not take anencoding
argument, because the unicode encoding is specified by the XML files themselves.- Parameters
tagspec (str) – A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.
elt_handler –
A function used to transform each element to a value for the view. If no handler is specified, then
self.handle_elt()
is called, which returns the element as an ElementTree object. The signature of elt_handler is:elt_handler(elt, tagspec) -> value
- handle_elt(elt, context)[source]¶
Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the
elt_handler
constructor argument, this method simply returnselt
.- Returns
The view value corresponding to
elt
.- Parameters
elt (ElementTree) – The element that should be converted.
context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string
'foo/bar/baz'
indicates that the element is abaz
element whose parent is abar
element and whose grandparent is a top-levelfoo
element.
- class nltk.corpus.reader.nkjp.NKJPCorpus_Segmentation_View[source]¶
Bases:
XMLCorpusView
A stream backed corpus view specialized for use with ann_segmentation.xml files in NKJP corpus.
- __init__(filename, **kwargs)[source]¶
Create a new corpus view based on a specified XML file.
Note that the
XMLCorpusView
constructor does not take anencoding
argument, because the unicode encoding is specified by the XML files themselves.- Parameters
tagspec (str) – A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.
elt_handler –
A function used to transform each element to a value for the view. If no handler is specified, then
self.handle_elt()
is called, which returns the element as an ElementTree object. The signature of elt_handler is:elt_handler(elt, tagspec) -> value
- handle_elt(elt, context)[source]¶
Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the
elt_handler
constructor argument, this method simply returnselt
.- Returns
The view value corresponding to
elt
.- Parameters
elt (ElementTree) – The element that should be converted.
context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string
'foo/bar/baz'
indicates that the element is abaz
element whose parent is abar
element and whose grandparent is a top-levelfoo
element.
- class nltk.corpus.reader.nkjp.NKJPCorpus_Text_View[source]¶
Bases:
XMLCorpusView
A stream backed corpus view specialized for use with text.xml files in NKJP corpus.
- RAW_MODE = 1¶
- SENTS_MODE = 0¶
- __init__(filename, **kwargs)[source]¶
Create a new corpus view based on a specified XML file.
Note that the
XMLCorpusView
constructor does not take anencoding
argument, because the unicode encoding is specified by the XML files themselves.- Parameters
tagspec (str) – A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.
elt_handler –
A function used to transform each element to a value for the view. If no handler is specified, then
self.handle_elt()
is called, which returns the element as an ElementTree object. The signature of elt_handler is:elt_handler(elt, tagspec) -> value
- handle_elt(elt, context)[source]¶
Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the
elt_handler
constructor argument, this method simply returnselt
.- Returns
The view value corresponding to
elt
.- Parameters
elt (ElementTree) – The element that should be converted.
context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string
'foo/bar/baz'
indicates that the element is abaz
element whose parent is abar
element and whose grandparent is a top-levelfoo
element.
- class nltk.corpus.reader.nkjp.XML_Tool[source]¶
Bases:
object
Helper class creating xml file to one without references to nkjp: namespace. That’s needed because the XMLCorpusView assumes that one can find short substrings of XML that are valid XML, which is not true if a namespace is declared at top level