nltk.test.unit.test_tokenize module
Unit tests for nltk.tokenize.
See also nltk/test/tokenize.doctest
-
nltk.test.unit.test_tokenize.load_stanford_segmenter()[source]
-
class nltk.test.unit.test_tokenize.TestTokenize[source]
Bases: object
Test TweetTokenizer using words with special and accented characters.
-
test_tweet_tokenizer_expanded(test_input: str, expecteds: Tuple[List[str], List[str]])[source]
Test match_phone_numbers in TweetTokenizer.
- Note that TweetTokenizer is also passed the following for these tests:
strip_handles=True
reduce_len=True
- Parameters
test_input (str) – The input string to tokenize using TweetTokenizer.
expecteds (Tuple[List[str], List[str]]) – A 2-tuple of tokenized sentences. The first of the two
tokenized is the expected output of tokenization with match_phone_numbers=True.
The second of the two tokenized lists is the expected output of tokenization
with match_phone_numbers=False.
-
test_sonority_sequencing_syllable_tokenizer()[source]
Test SyllableTokenizer tokenizer.
-
test_syllable_tokenizer_numbers()[source]
Test SyllableTokenizer tokenizer.
-
test_legality_principle_syllable_tokenizer()[source]
Test LegalitySyllableTokenizer tokenizer.
-
test_stanford_segmenter_arabic()[source]
Test the Stanford Word Segmenter for Arabic (default config)
-
test_stanford_segmenter_chinese()[source]
Test the Stanford Word Segmenter for Chinese (default config)
-
test_phone_tokenizer()[source]
Test a string that resembles a phone number but contains a newline
-
test_emoji_tokenizer()[source]
Test a string that contains Emoji ZWJ Sequences and skin tone modifier
-
test_pad_asterisk()[source]
Test padding of asterisk for word tokenization.
-
test_pad_dotdot()[source]
Test padding of dotdot* for word tokenization.
-
test_remove_handle()[source]
Test remove_handle() from casual.py with specially crafted edge cases
-
test_treebank_span_tokenizer()[source]
Test TreebankWordTokenizer.span_tokenize function
-
test_word_tokenize()[source]
Test word_tokenize function
-
test_punkt_pair_iter()[source]
-
test_punkt_pair_iter_handles_stop_iteration_exception()[source]
-
test_punkt_tokenize_words_handles_stop_iteration_exception()[source]
-
test_punkt_tokenize_custom_lang_vars()[source]
-
test_punkt_tokenize_no_custom_lang_vars()[source]
-
punkt_debug_decisions(input_text, n_sents, n_splits, lang_vars=None)[source]
-
test_punkt_debug_decisions_custom_end()[source]
-
test_sent_tokenize(sentences: str, expected: List[str])[source]
- Parameters
sentences (str) –
expected (List[str]) –