nltk.tree package¶
Submodules¶
Module contents¶
NLTK Tree Package
This package may be used for representing hierarchical language structures, such as syntax trees and morphological trees.
- class nltk.tree.ImmutableMultiParentedTree[source]¶
Bases:
ImmutableTree
,MultiParentedTree
- class nltk.tree.ImmutableParentedTree[source]¶
Bases:
ImmutableTree
,ParentedTree
- class nltk.tree.ImmutableProbabilisticTree[source]¶
Bases:
ImmutableTree
,ProbabilisticMixIn
- class nltk.tree.ImmutableTree[source]¶
Bases:
Tree
- pop(v=None)[source]¶
Remove and return item at index (default last).
Raises IndexError if list is empty or index is out of range.
- set_label(value)[source]¶
Set the node label. This will only succeed the first time the node label is set, which should occur in ImmutableTree.__init__().
- sort()[source]¶
Sort the list in ascending order and return None.
The sort is in-place (i.e. the list itself is modified) and stable (i.e. the order of two equal elements is maintained).
If a key function is given, apply it once to each list item and sort them, ascending or descending, according to their function values.
The reverse flag can be set to sort in descending order.
- class nltk.tree.MultiParentedTree[source]¶
Bases:
AbstractParentedTree
A
Tree
that automatically maintains parent pointers for multi-parented trees. The following are methods for querying the structure of a multi-parented tree:parents()
,parent_indices()
,left_siblings()
,right_siblings()
,roots
,treepositions
.Each
MultiParentedTree
may have zero or more parents. In particular, subtrees may be shared. If a singleMultiParentedTree
is used as multiple children of the same parent, then that parent will appear multiple times in itsparents()
method.MultiParentedTrees
should never be used in the same tree asTrees
orParentedTrees
. Mixing tree implementations may result in incorrect parent pointers and inTypeError
exceptions.- left_siblings()[source]¶
A list of all left siblings of this tree, in any of its parent trees. A tree may be its own left sibling if it is used as multiple contiguous children of the same parent. A tree may appear multiple times in this list if it is the left sibling of this tree with respect to multiple parents.
- Type
list(MultiParentedTree)
- parent_indices(parent)[source]¶
Return a list of the indices where this tree occurs as a child of
parent
. If this child does not occur as a child ofparent
, then the empty list is returned. The following is always true:for parent_index in ptree.parent_indices(parent): parent[parent_index] is ptree
- parents()[source]¶
The set of parents of this tree. If this tree has no parents, then
parents
is the empty set. To check if a tree is used as multiple children of the same parent, use theparent_indices()
method.- Type
list(MultiParentedTree)
- right_siblings()[source]¶
A list of all right siblings of this tree, in any of its parent trees. A tree may be its own right sibling if it is used as multiple contiguous children of the same parent. A tree may appear multiple times in this list if it is the right sibling of this tree with respect to multiple parents.
- Type
list(MultiParentedTree)
- roots()[source]¶
The set of all roots of this tree. This set is formed by tracing all possible parent paths until trees with no parents are found.
- Type
list(MultiParentedTree)
- class nltk.tree.ParentedTree[source]¶
Bases:
AbstractParentedTree
A
Tree
that automatically maintains parent pointers for single-parented trees. The following are methods for querying the structure of a parented tree:parent
,parent_index
,left_sibling
,right_sibling
,root
,treeposition
.Each
ParentedTree
may have at most one parent. In particular, subtrees may not be shared. Any attempt to reuse a singleParentedTree
as a child of more than one parent (or as multiple children of the same parent) will cause aValueError
exception to be raised.ParentedTrees
should never be used in the same tree asTrees
orMultiParentedTrees
. Mixing tree implementations may result in incorrect parent pointers and inTypeError
exceptions.- parent_index()[source]¶
The index of this tree in its parent. I.e.,
ptree.parent()[ptree.parent_index()] is ptree
. Note thatptree.parent_index()
is not necessarily equal toptree.parent.index(ptree)
, since theindex()
method returns the first child that is equal to its argument.
- class nltk.tree.ProbabilisticTree[source]¶
Bases:
Tree
,ProbabilisticMixIn
- class nltk.tree.Tree[source]¶
Bases:
list
A Tree represents a hierarchical grouping of leaves and subtrees. For example, each constituent in a syntax tree is represented by a single Tree.
A tree’s children are encoded as a list of leaves and subtrees, where a leaf is a basic (non-tree) value; and a subtree is a nested Tree.
>>> from nltk.tree import Tree >>> print(Tree(1, [2, Tree(3, [4]), 5])) (1 2 (3 4) 5) >>> vp = Tree('VP', [Tree('V', ['saw']), ... Tree('NP', ['him'])]) >>> s = Tree('S', [Tree('NP', ['I']), vp]) >>> print(s) (S (NP I) (VP (V saw) (NP him))) >>> print(s[1]) (VP (V saw) (NP him)) >>> print(s[1,1]) (NP him) >>> t = Tree.fromstring("(S (NP I) (VP (V saw) (NP him)))") >>> s == t True >>> t[1][1].set_label('X') >>> t[1][1].label() 'X' >>> print(t) (S (NP I) (VP (V saw) (X him))) >>> t[0], t[1,1] = t[1,1], t[0] >>> print(t) (S (X him) (VP (V saw) (NP I)))
The length of a tree is the number of children it has.
>>> len(t) 2
The set_label() and label() methods allow individual constituents to be labeled. For example, syntax trees use this label to specify phrase tags, such as “NP” and “VP”.
Several Tree methods use “tree positions” to specify children or descendants of a tree. Tree positions are defined as follows:
The tree position i specifies a Tree’s ith child.
The tree position
()
specifies the Tree itself.If p is the tree position of descendant d, then p+i specifies the ith child of d.
I.e., every tree position is either a single index i, specifying
tree[i]
; or a sequence i1, i2, …, iN, specifyingtree[i1][i2]...[iN]
.Construct a new tree. This constructor can be called in one of two ways:
Tree(label, children)
constructs a new tree with thespecified label and list of children.
Tree.fromstring(s)
constructs a new tree by parsing the strings
.
- chomsky_normal_form(factor='right', horzMarkov=None, vertMarkov=0, childChar='|', parentChar='^')[source]¶
This method can modify a tree in three ways:
Convert a tree into its Chomsky Normal Form (CNF) equivalent – Every subtree has either two non-terminals or one terminal as its children. This process requires the creation of more”artificial” non-terminal nodes.
Markov (vertical) smoothing of children in new artificial nodes
Horizontal (parent) annotation of nodes
- Parameters
factor (str = [left|right]) – Right or left factoring method (default = “right”)
horzMarkov (int | None) – Markov order for sibling smoothing in artificial nodes (None (default) = include all siblings)
vertMarkov (int | None) – Markov order for parent smoothing (0 (default) = no vertical annotation)
childChar (str) – A string used in construction of the artificial nodes, separating the head of the original subtree from the child nodes that have yet to be expanded (default = “|”)
parentChar (str) – A string used to separate the node representation from its vertical annotation
- collapse_unary(collapsePOS=False, collapseRoot=False, joinChar='+')[source]¶
Collapse subtrees with a single child (ie. unary productions) into a new non-terminal (Tree node) joined by ‘joinChar’. This is useful when working with algorithms that do not allow unary productions, and completely removing the unary productions would require loss of useful information. The Tree is modified directly (since it is passed by reference) and no value is returned.
- Parameters
collapsePOS (bool) – ‘False’ (default) will not collapse the parent of leaf nodes (ie. Part-of-Speech tags) since they are always unary productions
collapseRoot (bool) – ‘False’ (default) will not modify the root production if it is unary. For the Penn WSJ treebank corpus, this corresponds to the TOP -> productions.
joinChar (str) – A string used to connect collapsed node values (default = “+”)
- classmethod convert(tree)[source]¶
Convert a tree between different subtypes of Tree.
cls
determines which class will be used to encode the new tree.- Parameters
tree (Tree) – The tree that should be converted.
- Returns
The new Tree.
- flatten()[source]¶
Return a flat version of the tree, with all non-root non-terminals removed.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> print(t.flatten()) (S the dog chased the cat)
- Returns
a tree consisting of this tree’s root connected directly to its leaves, omitting all intervening non-terminal nodes.
- Return type
- classmethod fromlist(l)[source]¶
- Parameters
l (list) – a tree represented as nested lists
- Returns
A tree corresponding to the list representation
l
.- Return type
Convert nested lists to a NLTK Tree
- classmethod fromstring(s, brackets='()', read_node=None, read_leaf=None, node_pattern=None, leaf_pattern=None, remove_empty_top_bracketing=False)[source]¶
Read a bracketed tree string and return the resulting tree. Trees are represented as nested brackettings, such as:
(S (NP (NNP John)) (VP (V runs)))
- Parameters
s (str) – The string to read
brackets (str (length=2)) – The bracket characters used to mark the beginning and end of trees and subtrees.
read_leaf (read_node,) –
If specified, these functions are applied to the substrings of
s
corresponding to nodes and leaves (respectively) to obtain the values for those nodes and leaves. They should have the following signature:read_node(str) -> value
For example, these functions could be used to process nodes and leaves whose values should be some type other than string (such as
FeatStruct
). Note that by default, node strings and leaf strings are delimited by whitespace and brackets; to override this default, use thenode_pattern
andleaf_pattern
arguments.leaf_pattern (node_pattern,) – Regular expression patterns used to find node and leaf substrings in
s
. By default, both nodes patterns are defined to match any sequence of non-whitespace non-bracket characters.remove_empty_top_bracketing (bool) – If the resulting tree has an empty node label, and is length one, then return its single child instead. This is useful for treebank trees, which sometimes contain an extra level of bracketing.
- Returns
A tree corresponding to the string representation
s
. If this class method is called using a subclass of Tree, then it will return a tree of that type.- Return type
- height()[source]¶
Return the height of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.height() 5 >>> print(t[0,0]) (D the) >>> t[0,0].height() 2
- Returns
The height of this tree. The height of a tree containing no children is 1; the height of a tree containing only leaves is 2; and the height of any other tree is one plus the maximum of its children’s heights.
- Return type
int
- label()[source]¶
Return the node label of the tree.
>>> t = Tree.fromstring('(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))') >>> t.label() 'S'
- Returns
the node label (typically a string)
- Return type
any
- leaf_treeposition(index)[source]¶
- Returns
The tree position of the
index
-th leaf in this tree. I.e., iftp=self.leaf_treeposition(i)
, thenself[tp]==self.leaves()[i]
.- Raises
IndexError – If this tree contains fewer than
index+1
leaves, or ifindex<0
.
- leaves()[source]¶
Return the leaves of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.leaves() ['the', 'dog', 'chased', 'the', 'cat']
- Returns
a list containing this tree’s leaves. The order reflects the order of the leaves in the tree’s hierarchical structure.
- Return type
list
- property node¶
Outdated method to access the node value; use the label() method instead.
@deprecated: Use label() instead
- pformat(margin=70, indent=0, nodesep='', parens='()', quotes=False)[source]¶
- Returns
A pretty-printed string representation of this tree.
- Return type
str
- Parameters
margin (int) – The right margin at which to do line-wrapping.
indent (int) – The indentation level at which printing begins. This number is used to decide how far to indent subsequent lines.
nodesep – A string that is used to separate the node from the children. E.g., the default value
':'
gives trees like(S: (NP: I) (VP: (V: saw) (NP: it)))
.
- pformat_latex_qtree()[source]¶
Returns a representation of the tree compatible with the LaTeX qtree package. This consists of the string
\Tree
followed by the tree represented in bracketed notation.For example, the following result was generated from a parse tree of the sentence
The announcement astounded us
:\Tree [.I'' [.N'' [.D The ] [.N' [.N announcement ] ] ] [.I' [.V'' [.V' [.V astounded ] [.N'' [.N' [.N us ] ] ] ] ] ] ]
See https://www.ling.upenn.edu/advice/latex.html for the LaTeX style file for the qtree package.
- Returns
A latex qtree representation of this tree.
- Return type
str
- pos()[source]¶
Return a sequence of pos-tagged words extracted from the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.pos() [('the', 'D'), ('dog', 'N'), ('chased', 'V'), ('the', 'D'), ('cat', 'N')]
- Returns
a list of tuples containing leaves and pre-terminals (part-of-speech tags). The order reflects the order of the leaves in the tree’s hierarchical structure.
- Return type
list(tuple)
- pretty_print(sentence=None, highlight=(), stream=None, **kwargs)[source]¶
Pretty-print this tree as ASCII or Unicode art. For explanation of the arguments, see the documentation for nltk.tree.prettyprinter.TreePrettyPrinter.
- productions()[source]¶
Generate the productions that correspond to the non-terminal nodes of the tree. For each subtree of the form (P: C1 C2 … Cn) this produces a production of the form P -> C1 C2 … Cn.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.productions() [S -> NP VP, NP -> D N, D -> 'the', N -> 'dog', VP -> V NP, V -> 'chased', NP -> D N, D -> 'the', N -> 'cat']
- Return type
list(Production)
- set_label(label)[source]¶
Set the node label of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.set_label("T") >>> print(t) (T (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))
- Parameters
label (any) – the node label (typically a string)
- subtrees(filter=None)[source]¶
Generate all the subtrees of this tree, optionally restricted to trees matching the filter function.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> for s in t.subtrees(lambda t: t.height() == 2): ... print(s) (D the) (N dog) (V chased) (D the) (N cat)
- Parameters
filter (function) – the function to filter all local trees
- treeposition_spanning_leaves(start, end)[source]¶
- Returns
The tree position of the lowest descendant of this tree that dominates
self.leaves()[start:end]
.- Raises
ValueError – if
end <= start
- treepositions(order='preorder')[source]¶
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.treepositions() [(), (0,), (0, 0), (0, 0, 0), (0, 1), (0, 1, 0), (1,), (1, 0), (1, 0, 0), ...] >>> for pos in t.treepositions('leaves'): ... t[pos] = t[pos][::-1].upper() >>> print(t) (S (NP (D EHT) (N GOD)) (VP (V DESAHC) (NP (D EHT) (N TAC))))
- Parameters
order – One of:
preorder
,postorder
,bothorder
,leaves
.
- un_chomsky_normal_form(expandUnary=True, childChar='|', parentChar='^', unaryChar='+')[source]¶
This method modifies the tree in three ways:
Transforms a tree in Chomsky Normal Form back to its original structure (branching greater than two)
Removes any parent annotation (if it exists)
(optional) expands unary subtrees (if previously collapsed with collapseUnary(…) )
- Parameters
expandUnary (bool) – Flag to expand unary or not (default = True)
childChar (str) – A string separating the head node from its children in an artificial node (default = “|”)
parentChar (str) – A string separating the node label from its parent annotation (default = “^”)
unaryChar (str) – A string joining two non-terminals in a unary production (default = “+”)
- class nltk.tree.TreePrettyPrinter[source]¶
Bases:
object
Pretty-print a tree in text format, either as ASCII or Unicode. The tree can be a normal tree, or discontinuous.
TreePrettyPrinter(tree, sentence=None, highlight=())
creates an object from which different visualizations can be created.- Parameters
tree – a Tree object.
sentence – a list of words (strings). If sentence is given, tree must contain integers as leaves, which are taken as indices in sentence. Using this you can display a discontinuous tree.
highlight – Optionally, a sequence of Tree objects in tree which should be highlighted. Has the effect of only applying colors to nodes in this sequence (nodes should be given as Tree objects, terminals as indices).
>>> from nltk.tree import Tree >>> tree = Tree.fromstring('(S (NP Mary) (VP walks))') >>> print(TreePrettyPrinter(tree).text()) ... S ____|____ NP VP | | Mary walks
- static nodecoords(tree, sentence, highlight)[source]¶
Produce coordinates of nodes on a grid.
Objective:
- Produce coordinates for a non-overlapping placement of nodes and
horizontal lines.
- Order edges so that crossing edges cross a minimal number of previous
horizontal lines (never vertical lines).
Approach:
bottom up level order traversal (start at terminals)
at each level, identify nodes which cannot be on the same row
identify nodes which cannot be in the same column
place nodes into a grid at (row, column)
order child-parent edges with crossing edges last
Coordinates are (row, column); the origin (0, 0) is at the top left; the root node is on row 0. Coordinates do not consider the size of a node (which depends on font, &c), so the width of a column of the grid should be automatically determined by the element with the greatest width in that column. Alternatively, the integer coordinates could be converted to coordinates in which the distances between adjacent nodes are non-uniform.
Produces tuple (nodes, coords, edges, highlighted) where:
nodes[id]: Tree object for the node with this integer id
coords[id]: (n, m) coordinate where to draw node with id in the grid
edges[id]: parent id of node with this id (ordered dictionary)
highlighted: set of ids that should be highlighted
- svg(nodecolor='blue', leafcolor='red', funccolor='green')[source]¶
- Returns
SVG representation of a tree.
- text(nodedist=1, unicodelines=False, html=False, ansi=False, nodecolor='blue', leafcolor='red', funccolor='green', abbreviate=None, maxwidth=16)[source]¶
- Returns
ASCII art for a discontinuous tree.
- Parameters
unicodelines – whether to use Unicode line drawing characters instead of plain (7-bit) ASCII.
html – whether to wrap output in html code (default plain text).
ansi – whether to produce colors with ANSI escape sequences (only effective when html==False).
nodecolor (leafcolor,) – specify colors of leaves and phrasal nodes; effective when either html or ansi is True.
abbreviate – if True, abbreviate labels longer than 5 characters. If integer, abbreviate labels longer than abbr characters.
maxwidth – maximum number of characters before a label starts to wrap; pass None to disable.
- nltk.tree.chomsky_normal_form(tree, factor='right', horzMarkov=None, vertMarkov=0, childChar='|', parentChar='^')[source]¶
- nltk.tree.collapse_unary(tree, collapsePOS=False, collapseRoot=False, joinChar='+')[source]¶
Collapse subtrees with a single child (ie. unary productions) into a new non-terminal (Tree node) joined by ‘joinChar’. This is useful when working with algorithms that do not allow unary productions, and completely removing the unary productions would require loss of useful information. The Tree is modified directly (since it is passed by reference) and no value is returned.
- Parameters
tree (Tree) – The Tree to be collapsed
collapsePOS (bool) – ‘False’ (default) will not collapse the parent of leaf nodes (ie. Part-of-Speech tags) since they are always unary productions
collapseRoot (bool) – ‘False’ (default) will not modify the root production if it is unary. For the Penn WSJ treebank corpus, this corresponds to the TOP -> productions.
joinChar (str) – A string used to connect collapsed node values (default = “+”)
- nltk.tree.sinica_parse(s)[source]¶
Parse a Sinica Treebank string and return a tree. Trees are represented as nested brackettings, as shown in the following example (X represents a Chinese character): S(goal:NP(Head:Nep:XX)|theme:NP(Head:Nhaa:X)|quantity:Dab:X|Head:VL2:X)#0(PERIODCATEGORY)
- Returns
A tree corresponding to the string representation.
- Return type
- Parameters
s (str) – The string to be converted