fastNLP.modules.tokenizer.bert_tokenizer module

class fastNLP.modules.tokenizer.bert_tokenizer.BertTokenizer(vocab_file, do_lower_case=True, max_len=None, do_basic_tokenize=True, never_split='[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]')[源代码]

基类:object

别名 fastNLP.modules.BertTokenizer fastNLP.modules.tokenizer.bert_tokenizer.BertTokenizer Runs end-to-end tokenization: punctuation splitting + wordpiece

__init__(vocab_file, do_lower_case=True, max_len=None, do_basic_tokenize=True, never_split='[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]')[源代码]

Constructs a BertTokenizer.

Args:

vocab_file: Path to a one-wordpiece-per-line vocabulary file do_lower_case: Whether to lower case the input

Only has an effect when do_wordpiece_only=False

do_basic_tokenize: Whether to do basic tokenization before wordpiece. max_len: An artificial maximum length to truncate tokenized sequences to;

Effective maximum length is always the minimum of this value (if specified) and the underlying BERT model’s sequence length.

never_split: List of tokens which will never be split during tokenization.

Only has an effect when do_wordpiece_only=False

convert_tokens_to_ids(tokens)[源代码]

Converts a sequence of tokens into ids using the vocab.

convert_ids_to_tokens(ids)[源代码]

将token ids转换为一句话

save_vocabulary(vocab_path)[源代码]

Save the tokenizer vocabulary to a directory or file.

classmethod from_pretrained(model_dir_or_name, *inputs, **kwargs)[源代码]

给定模型的名字或者路径,直接读取vocab.

encode(text, add_special_tokens=True)[源代码]

给定text输入将数据encode为index的形式。

Example:

>>> from fastNLP.modules import BertTokenizer
>>> bert_tokenizer = BertTokenizer.from_pretrained('en')
>>> print(bert_tokenizer.encode('from'))
>>> print(bert_tokenizer.encode("This is a demo sentence"))
>>> print(bert_tokenizer.encode(["This", "is", 'a']))
参数
  • text (List[str],str) – 输入的一条认为是一句话。

  • add_special_tokens (bool) – 是否保证句首和句尾是cls和sep。

返回