fastNLP.modules.tokenizer package

class fastNLP.modules.tokenizer.BertTokenizer(vocab_file, do_lower_case=True, max_len=None, do_basic_tokenize=True, never_split='[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]')[源代码]

基类:object

别名 fastNLP.modules.BertTokenizer fastNLP.modules.tokenizer.bert_tokenizer.BertTokenizer Runs end-to-end tokenization: punctuation splitting + wordpiece

__init__(vocab_file, do_lower_case=True, max_len=None, do_basic_tokenize=True, never_split='[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]')[源代码]

Constructs a BertTokenizer.

Args:

vocab_file: Path to a one-wordpiece-per-line vocabulary file do_lower_case: Whether to lower case the input

Only has an effect when do_wordpiece_only=False

do_basic_tokenize: Whether to do basic tokenization before wordpiece. max_len: An artificial maximum length to truncate tokenized sequences to;

Effective maximum length is always the minimum of this value (if specified) and the underlying BERT model’s sequence length.

never_split: List of tokens which will never be split during tokenization.

Only has an effect when do_wordpiece_only=False

convert_tokens_to_ids(tokens)[源代码]

Converts a sequence of tokens into ids using the vocab.

convert_ids_to_tokens(ids)[源代码]

将token ids转换为一句话

save_vocabulary(vocab_path)[源代码]

Save the tokenizer vocabulary to a directory or file.

classmethod from_pretrained(model_dir_or_name, *inputs, **kwargs)[源代码]

给定模型的名字或者路径,直接读取vocab.

encode(text, add_special_tokens=True)[源代码]

给定text输入将数据encode为index的形式。

Example:

>>> from fastNLP.modules import BertTokenizer
>>> bert_tokenizer = BertTokenizer.from_pretrained('en')
>>> print(bert_tokenizer.encode('from'))
>>> print(bert_tokenizer.encode("This is a demo sentence"))
>>> print(bert_tokenizer.encode(["This", "is", 'a']))
参数
  • text (List[str],str) – 输入的一条认为是一句话。

  • add_special_tokens (bool) – 是否保证句首和句尾是cls和sep。

返回

class fastNLP.modules.tokenizer.GPT2Tokenizer(vocab_file, merges_file, errors='replace', unk_token='<|endoftext|>', bos_token='<|endoftext|>', eos_token='<|endoftext|>', **kwargs)[源代码]

基类:object

别名 fastNLP.modules.GPT2Tokenizer fastNLP.modules.tokenizer.GPT2Tokenizer

GPT-2 BPE tokenizer. Peculiarities:
  • Byte-level Byte-Pair-Encoding

  • Requires a space to start the input string => the encoding and tokenize methods should be called with the add_prefix_space flag set to True. Otherwise, this tokenizer’s encode, decode, and tokenize methods will not conserve the spaces at the beginning of a string: tokenizer.decode(tokenizer.encode(” Hello”)) = “Hello”

SPECIAL_TOKENS_ATTRIBUTES = ['bos_token', 'eos_token', 'unk_token', 'pad_token', 'cls_token', 'mask_token', 'sep_token']
padding_side = 'right'
property bos_token

Beginning of sentence token (string). Log an error if used while not having been set.

property eos_token

End of sentence token (string). Log an error if used while not having been set.

property unk_token

Unknown token (string). Log an error if used while not having been set.

property pad_token

Padding token (string). Log an error if used while not having been set.

property cls_token

Classification token (string). E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set.

property mask_token

Mask token (string). E.g. when training a model with masked-language modeling. Log an error if used while not having been set.

property bos_index

Id of the beginning of sentence token in the vocabulary. Log an error if used while not having been set.

property eos_index

Id of the end of sentence token in the vocabulary. Log an error if used while not having been set.

property unk_index

Id of the unknown token in the vocabulary. Log an error if used while not having been set.

property pad_index

Id of the padding token in the vocabulary. Log an error if used while not having been set.

property pad_token_type_id

Id of the padding token type in the vocabulary.

property cls_index

Id of the classification token in the vocabulary. E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set.

property mask_index

Id of the mask token in the vocabulary. E.g. when training a model with masked-language modeling. Log an error if used while not having been set.

convert_tokens_to_string(tokens)[源代码]

Converts a sequence of tokens (string) in a single string.

save_vocabulary(save_directory)[源代码]

Save the tokenizer vocabulary and merge files to a directory.

classmethod from_pretrained(model_dir_or_name)[源代码]
tokenize(text, add_prefix_space=True)[源代码]

Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).

Take care of added tokens. Args:

  • text: The sequence to be encoded.

  • add_prefix_space (boolean, default True):

    Begin the sentence with at least one space to get invariance to word order in GPT-2 (and RoBERTa) tokenizers.

convert_tokens_to_ids(tokens)[源代码]

Converts a single token, or a sequence of tokens, (str) in a single integer id (resp. a sequence of ids), using the vocabulary.

convert_ids_to_tokens(ids, skip_special_tokens=False)[源代码]

Converts a single index or a sequence of indices (integers) in a token ” (resp.) a sequence of tokens (str), using the vocabulary and added tokens.

Args:

skip_special_tokens: Don’t decode special tokens (self.all_special_tokens). Default: False

convert_id_to_tokens(token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True)[源代码]

Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).

Args:

token_ids: list of tokenized input ids. Can be obtained using the encode or encode_plus methods. skip_special_tokens: if set to True, will replace special tokens. clean_up_tokenization_spaces: if set to True, will clean up the tokenization spaces.

property special_tokens_map

A dictionary mapping special token class attribute (cls_token, unk_token…) to their values (‘<unk>’, ‘<cls>’…)

property all_special_tokens

List all the special tokens (‘<unk>’, ‘<cls>’…) mapped to class attributes (cls_token, unk_token…).

property all_special_ids

List the vocabulary indices of the special tokens (‘<unk>’, ‘<cls>’…) mapped to class attributes (cls_token, unk_token…).

static clean_up_tokenization(out_string)[源代码]

Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.

encode(text, add_special_tokens=False, add_prefix_space=True)[源代码]

给定text输入将数据encode为index的形式。

Example:

>>> from fastNLP.modules import GPT2Tokenizer
>>> gpt2_tokenizer = GPT2Tokenizer.from_pretrained('en')
>>> print(gpt2_tokenizer.encode('from'))
>>> print(gpt2_tokenizer.encode("This is a demo sentence"))
>>> print(gpt2_tokenizer.encode(["This", "is", 'a']))
参数
  • text (List[str],str) – 输入的一条认为是一句话。

  • add_special_tokens (bool) – 是否保证句首和句尾是cls和sep。GPT2没有cls和sep这一说

返回