nltk.tokenize.api module¶
Tokenizer Interface
- class nltk.tokenize.api.StringTokenizer[source]¶
Bases:
TokenizerI
A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).
- class nltk.tokenize.api.TokenizerI[source]¶
Bases:
ABC
A processing interface for tokenizing a string. Subclasses must define
tokenize()
ortokenize_sents()
(or both).- span_tokenize(s: str) Iterator[Tuple[int, int]] [source]¶
Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.- Return type
Iterator[Tuple[int, int]]
- Parameters
s (str) –
- span_tokenize_sents(strings: List[str]) Iterator[List[Tuple[int, int]]] [source]¶
Apply
self.span_tokenize()
to each element ofstrings
. I.e.:return [self.span_tokenize(s) for s in strings]
- Yield
List[Tuple[int, int]]
- Parameters
strings (List[str]) –
- Return type
Iterator[List[Tuple[int, int]]]