# fastNLP.modules.tokenizer.gpt2_tokenizer module¶

undocumented 这个页面的代码很大程度上参考(复制粘贴)了https://github.com/huggingface/pytorch-pretrained-BERT的代码， 如果你发现该代码对你

class fastNLP.modules.tokenizer.gpt2_tokenizer.GPT2Tokenizer(vocab_file, merges_file, errors='replace', unk_token='<|endoftext|>', bos_token='<|endoftext|>', eos_token='<|endoftext|>', **kwargs)[源代码]

GPT-2 BPE tokenizer. Peculiarities:
• Byte-level Byte-Pair-Encoding

• Requires a space to start the input string => the encoding and tokenize methods should be called with the add_prefix_space flag set to True. Otherwise, this tokenizer’s encode, decode, and tokenize methods will not conserve the spaces at the beginning of a string: tokenizer.decode(tokenizer.encode(” Hello”)) = “Hello”

SPECIAL_TOKENS_ATTRIBUTES = ['bos_token', 'eos_token', 'unk_token', 'pad_token', 'cls_token', 'mask_token', 'sep_token']
padding_side = 'right'
property bos_token

Beginning of sentence token (string). Log an error if used while not having been set.

property eos_token

End of sentence token (string). Log an error if used while not having been set.

property unk_token

Unknown token (string). Log an error if used while not having been set.

property pad_token

Padding token (string). Log an error if used while not having been set.

property cls_token

Classification token (string). E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set.

property mask_token

Mask token (string). E.g. when training a model with masked-language modeling. Log an error if used while not having been set.

property bos_index

Id of the beginning of sentence token in the vocabulary. Log an error if used while not having been set.

property eos_index

Id of the end of sentence token in the vocabulary. Log an error if used while not having been set.

property unk_index

Id of the unknown token in the vocabulary. Log an error if used while not having been set.

property pad_index

Id of the padding token in the vocabulary. Log an error if used while not having been set.

property pad_token_type_id

Id of the padding token type in the vocabulary.

property cls_index

Id of the classification token in the vocabulary. E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set.

property mask_index

Id of the mask token in the vocabulary. E.g. when training a model with masked-language modeling. Log an error if used while not having been set.

convert_tokens_to_string(tokens)[源代码]

Converts a sequence of tokens (string) in a single string.

save_vocabulary(save_directory)[源代码]

Save the tokenizer vocabulary and merge files to a directory.

classmethod from_pretrained(model_dir_or_name)[源代码]
tokenize(text, add_prefix_space=True)[源代码]

Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).

Take care of added tokens. Args:

• text: The sequence to be encoded.

• add_prefix_space (boolean, default True):

Begin the sentence with at least one space to get invariance to word order in GPT-2 (and RoBERTa) tokenizers.

convert_tokens_to_ids(tokens)[源代码]

Converts a single token, or a sequence of tokens, (str) in a single integer id (resp. a sequence of ids), using the vocabulary.

convert_ids_to_tokens(ids, skip_special_tokens=False)[源代码]

Converts a single index or a sequence of indices (integers) in a token ” (resp.) a sequence of tokens (str), using the vocabulary and added tokens.

Args:

skip_special_tokens: Don’t decode special tokens (self.all_special_tokens). Default: False

convert_id_to_tokens(token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True)[源代码]

Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).

Args:

token_ids: list of tokenized input ids. Can be obtained using the encode or encode_plus methods. skip_special_tokens: if set to True, will replace special tokens. clean_up_tokenization_spaces: if set to True, will clean up the tokenization spaces.

property special_tokens_map

A dictionary mapping special token class attribute (cls_token, unk_token…) to their values (‘<unk>’, ‘<cls>’…)

property all_special_tokens

List all the special tokens (‘<unk>’, ‘<cls>’…) mapped to class attributes (cls_token, unk_token…).

property all_special_ids

List the vocabulary indices of the special tokens (‘<unk>’, ‘<cls>’…) mapped to class attributes (cls_token, unk_token…).

static clean_up_tokenization(out_string)[源代码]

Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.

encode(text, add_special_tokens=False, add_prefix_space=True)[源代码]

Example:

>>> from fastNLP.modules import GPT2Tokenizer
>>> gpt2_tokenizer = GPT2Tokenizer.from_pretrained('en')
>>> print(gpt2_tokenizer.encode('from'))
>>> print(gpt2_tokenizer.encode("This is a demo sentence"))
>>> print(gpt2_tokenizer.encode(["This", "is", 'a']))


• text (List[str],str) – 输入的一条认为是一句话。

• add_special_tokens (bool) – 是否保证句首和句尾是cls和sep。GPT2没有cls和sep这一说