nltk.sem.relextract module¶
Code for extracting relational triples from the ieer and conll2002 corpora.
Relations are stored internally as dictionaries (‘reldicts’).
The two serialization outputs are “rtuple” and “clause”.
An rtuple is a tuple of the form
(subj, filler, obj)
, wheresubj
andobj
are pairs of Named Entity mentions, andfiller
is the string of words occurring betweensub
andobj
(with no intervening NEs). Strings are printed viarepr()
to circumvent locale variations in rendering utf-8 encoded strings.A clause is an atom of the form
relsym(subjsym, objsym)
, where the relation, subject and object have been canonicalized to single strings.
- nltk.sem.relextract.class_abbrev(type)[source]¶
Abbreviate an NE class name. :type type: str :rtype: str
- nltk.sem.relextract.clause(reldict, relsym)[source]¶
Print the relation in clausal form. :param reldict: a relation dictionary :type reldict: defaultdict :param relsym: a label for the relation :type relsym: str
- nltk.sem.relextract.conllned(trace=1)[source]¶
Find the copula+’van’ relation (‘of’) in the Dutch tagged training corpus from CoNLL 2002.
- nltk.sem.relextract.descape_entity(m, defs={'AElig': 'Æ', 'Aacute': 'Á', 'Acirc': 'Â', 'Agrave': 'À', 'Alpha': 'Α', 'Aring': 'Å', 'Atilde': 'Ã', 'Auml': 'Ä', 'Beta': 'Β', 'Ccedil': 'Ç', 'Chi': 'Χ', 'Dagger': '‡', 'Delta': 'Δ', 'ETH': 'Ð', 'Eacute': 'É', 'Ecirc': 'Ê', 'Egrave': 'È', 'Epsilon': 'Ε', 'Eta': 'Η', 'Euml': 'Ë', 'Gamma': 'Γ', 'Iacute': 'Í', 'Icirc': 'Î', 'Igrave': 'Ì', 'Iota': 'Ι', 'Iuml': 'Ï', 'Kappa': 'Κ', 'Lambda': 'Λ', 'Mu': 'Μ', 'Ntilde': 'Ñ', 'Nu': 'Ν', 'OElig': 'Œ', 'Oacute': 'Ó', 'Ocirc': 'Ô', 'Ograve': 'Ò', 'Omega': 'Ω', 'Omicron': 'Ο', 'Oslash': 'Ø', 'Otilde': 'Õ', 'Ouml': 'Ö', 'Phi': 'Φ', 'Pi': 'Π', 'Prime': '″', 'Psi': 'Ψ', 'Rho': 'Ρ', 'Scaron': 'Š', 'Sigma': 'Σ', 'THORN': 'Þ', 'Tau': 'Τ', 'Theta': 'Θ', 'Uacute': 'Ú', 'Ucirc': 'Û', 'Ugrave': 'Ù', 'Upsilon': 'Υ', 'Uuml': 'Ü', 'Xi': 'Ξ', 'Yacute': 'Ý', 'Yuml': 'Ÿ', 'Zeta': 'Ζ', 'aacute': 'á', 'acirc': 'â', 'acute': '´', 'aelig': 'æ', 'agrave': 'à', 'alefsym': 'ℵ', 'alpha': 'α', 'amp': '&', 'and': '∧', 'ang': '∠', 'aring': 'å', 'asymp': '≈', 'atilde': 'ã', 'auml': 'ä', 'bdquo': '„', 'beta': 'β', 'brvbar': '¦', 'bull': '•', 'cap': '∩', 'ccedil': 'ç', 'cedil': '¸', 'cent': '¢', 'chi': 'χ', 'circ': 'ˆ', 'clubs': '♣', 'cong': '≅', 'copy': '©', 'crarr': '↵', 'cup': '∪', 'curren': '¤', 'dArr': '⇓', 'dagger': '†', 'darr': '↓', 'deg': '°', 'delta': 'δ', 'diams': '♦', 'divide': '÷', 'eacute': 'é', 'ecirc': 'ê', 'egrave': 'è', 'empty': '∅', 'emsp': '\u2003', 'ensp': '\u2002', 'epsilon': 'ε', 'equiv': '≡', 'eta': 'η', 'eth': 'ð', 'euml': 'ë', 'euro': '€', 'exist': '∃', 'fnof': 'ƒ', 'forall': '∀', 'frac12': '½', 'frac14': '¼', 'frac34': '¾', 'frasl': '⁄', 'gamma': 'γ', 'ge': '≥', 'gt': '>', 'hArr': '⇔', 'harr': '↔', 'hearts': '♥', 'hellip': '…', 'iacute': 'í', 'icirc': 'î', 'iexcl': '¡', 'igrave': 'ì', 'image': 'ℑ', 'infin': '∞', 'int': '∫', 'iota': 'ι', 'iquest': '¿', 'isin': '∈', 'iuml': 'ï', 'kappa': 'κ', 'lArr': '⇐', 'lambda': 'λ', 'lang': '〈', 'laquo': '«', 'larr': '←', 'lceil': '⌈', 'ldquo': '“', 'le': '≤', 'lfloor': '⌊', 'lowast': '∗', 'loz': '◊', 'lrm': '\u200e', 'lsaquo': '‹', 'lsquo': '‘', 'lt': '<', 'macr': '¯', 'mdash': '—', 'micro': 'µ', 'middot': '·', 'minus': '−', 'mu': 'μ', 'nabla': '∇', 'nbsp': '\xa0', 'ndash': '–', 'ne': '≠', 'ni': '∋', 'not': '¬', 'notin': '∉', 'nsub': '⊄', 'ntilde': 'ñ', 'nu': 'ν', 'oacute': 'ó', 'ocirc': 'ô', 'oelig': 'œ', 'ograve': 'ò', 'oline': '‾', 'omega': 'ω', 'omicron': 'ο', 'oplus': '⊕', 'or': '∨', 'ordf': 'ª', 'ordm': 'º', 'oslash': 'ø', 'otilde': 'õ', 'otimes': '⊗', 'ouml': 'ö', 'para': '¶', 'part': '∂', 'permil': '‰', 'perp': '⊥', 'phi': 'φ', 'pi': 'π', 'piv': 'ϖ', 'plusmn': '±', 'pound': '£', 'prime': '′', 'prod': '∏', 'prop': '∝', 'psi': 'ψ', 'quot': '"', 'rArr': '⇒', 'radic': '√', 'rang': '〉', 'raquo': '»', 'rarr': '→', 'rceil': '⌉', 'rdquo': '”', 'real': 'ℜ', 'reg': '®', 'rfloor': '⌋', 'rho': 'ρ', 'rlm': '\u200f', 'rsaquo': '›', 'rsquo': '’', 'sbquo': '‚', 'scaron': 'š', 'sdot': '⋅', 'sect': '§', 'shy': '\xad', 'sigma': 'σ', 'sigmaf': 'ς', 'sim': '∼', 'spades': '♠', 'sub': '⊂', 'sube': '⊆', 'sum': '∑', 'sup': '⊃', 'sup1': '¹', 'sup2': '²', 'sup3': '³', 'supe': '⊇', 'szlig': 'ß', 'tau': 'τ', 'there4': '∴', 'theta': 'θ', 'thetasym': 'ϑ', 'thinsp': '\u2009', 'thorn': 'þ', 'tilde': '˜', 'times': '×', 'trade': '™', 'uArr': '⇑', 'uacute': 'ú', 'uarr': '↑', 'ucirc': 'û', 'ugrave': 'ù', 'uml': '¨', 'upsih': 'ϒ', 'upsilon': 'υ', 'uuml': 'ü', 'weierp': '℘', 'xi': 'ξ', 'yacute': 'ý', 'yen': '¥', 'yuml': 'ÿ', 'zeta': 'ζ', 'zwj': '\u200d', 'zwnj': '\u200c'})[source]¶
Translate one entity to its ISO Latin value. Inspired by example from effbot.org
- nltk.sem.relextract.extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10)[source]¶
Filter the output of
semi_rel2reldict
according to specified NE classes and a filler pattern.The parameters
subjclass
andobjclass
can be used to restrict the Named Entities to particular types (any of ‘LOCATION’, ‘ORGANIZATION’, ‘PERSON’, ‘DURATION’, ‘DATE’, ‘CARDINAL’, ‘PERCENT’, ‘MONEY’, ‘MEASURE’).- Parameters
subjclass (str) – the class of the subject Named Entity.
objclass (str) – the class of the object Named Entity.
doc (ieer document or a list of chunk trees) – input document
corpus (str) – name of the corpus to take as input; possible values are ‘ieer’ and ‘conll2002’
pattern (SRE_Pattern) – a regular expression for filtering the fillers of retrieved triples.
window (int) – filters out fillers which exceed this threshold
- Returns
see
mk_reldicts
- Return type
list(defaultdict)
- nltk.sem.relextract.in_demo(trace=0, sql=True)[source]¶
Select pairs of organizations and locations whose mentions occur with an intervening occurrence of the preposition “in”.
If the sql parameter is set to True, then the entity pairs are loaded into an in-memory database, and subsequently pulled out using an SQL “SELECT” query.
- nltk.sem.relextract.list2sym(lst)[source]¶
Convert a list of strings into a canonical symbol. :type lst: list :return: a Unicode string without whitespace :rtype: unicode
- nltk.sem.relextract.rtuple(reldict, lcon=False, rcon=False)[source]¶
Pretty print the reldict as an rtuple. :param reldict: a relation dictionary :type reldict: defaultdict
- nltk.sem.relextract.semi_rel2reldict(pairs, window=5, trace=False)[source]¶
Converts the pairs generated by
tree2semi_rel
into a ‘reldict’: a dictionary which stores information about the subject and object NEs plus the filler between them. Additionally, a left and right context of length =< window are captured (within a given input sentence).- Parameters
pairs – a pair of list(str) and
Tree
, as generated bywindow (int) – a threshold for the number of items to include in the left and right context
- Returns
‘relation’ dictionaries whose keys are ‘lcon’, ‘subjclass’, ‘subjtext’, ‘subjsym’, ‘filler’, objclass’, objtext’, ‘objsym’ and ‘rcon’
- Return type
list(defaultdict)
- nltk.sem.relextract.tree2semi_rel(tree)[source]¶
Group a chunk structure into a list of ‘semi-relations’ of the form (list(str),
Tree
).In order to facilitate the construction of (
Tree
, string,Tree
) triples, this identifies pairs whose first member is a list (possibly empty) of terminal strings, and whose second member is aTree
of the form (NE_label, terminals).- Parameters
tree – a chunk tree
- Returns
a list of pairs (list(str),
Tree
)- Return type
list of tuple