fastNLP.io.pipe.matching module

undocumented

class fastNLP.io.pipe.matching.MatchingBertPipe(lower=False, tokenizer: str = 'raw')[源代码]

基类:fastNLP.io.pipe.pipe.Pipe

基类 fastNLP.io.Pipe

别名 fastNLP.io.MatchingBertPipe fastNLP.io.pipe.MatchingBertPipe

Matching任务的Bert pipe,输出的DataSet将包含以下的field

raw_words1

raw_words2

target

words

seq_len

The new rights are…

Everyone really likes..

1

[2, 3, 4, 5, …]

10

This site includes a…

The Government Executive…

0

[11, 12, 13,…]

5

.

[…]

.

words列是将raw_words1(即premise), raw_words2(即hypothesis)使用”[SEP]”链接起来转换为index的。 words列被设置为input,target列被设置为target和input(设置为input以方便在forward函数中计算loss, 如果不在forward函数中计算loss也不影响,fastNLP将根据forward函数的形参名进行传参).

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:

+-------------+------------+------------+--------+-------+---------+
| field_names | raw_words1 | raw_words2 | target | words | seq_len |
+-------------+------------+------------+--------+-------+---------+
|   is_input  |   False    |   False    | False  |  True |   True  |
|  is_target  |   False    |   False    |  True  | False |  False  |
| ignore_type |            |            | False  | False |  False  |
|  pad_value  |            |            |   0    |   0   |    0    |
+-------------+------------+------------+--------+-------+---------+
__init__(lower=False, tokenizer: str = 'raw')[源代码]
参数
  • lower (bool) – 是否将word小写化。

  • tokenizer (str) – 使用什么tokenizer来将句子切分为words. 支持spacy, raw两种。raw即使用空格拆分。

process(data_bundle)[源代码]

输入的data_bundle中的dataset需要具有以下结构:

raw_words1

raw_words2

target

Dana Reeve, the widow of the actor…

Christopher Reeve had an…

not_entailment

参数

data_bundle

返回

class fastNLP.io.pipe.matching.MatchingPipe(lower=False, tokenizer: str = 'raw')[源代码]

基类:fastNLP.io.pipe.pipe.Pipe

基类 fastNLP.io.Pipe

别名 fastNLP.io.MatchingPipe fastNLP.io.pipe.MatchingPipe

Matching任务的Pipe。输出的DataSet将包含以下的field

raw_words1

raw_words2

target

words1

words2

seq_len1

seq_len2

The new rights are…

Everyone really likes..

1

[2, 3, 4, 5, …]

[10, 20, 6]

10

13

This site includes a…

The Government Executive…

0

[11, 12, 13,…]

[2, 7, …]

6

7

.

[…]

[…]

.

.

words1是premise,words2是hypothesis。其中words1,words2,seq_len1,seq_len2被设置为input;target被设置为target 和input(设置为input以方便在forward函数中计算loss,如果不在forward函数中计算loss也不影响,fastNLP将根据forward函数 的形参名进行传参)。

dataset的print_field_meta()函数输出的各个field的被设置成input和target的情况为:

+-------------+------------+------------+--------+--------+--------+----------+----------+
| field_names | raw_words1 | raw_words2 | target | words1 | words2 | seq_len1 | seq_len2 |
+-------------+------------+------------+--------+--------+--------+----------+----------+
|   is_input  |   False    |   False    | False  |  True  |  True  |   True   |   True   |
|  is_target  |   False    |   False    |  True  | False  | False  |  False   |  False   |
| ignore_type |            |            | False  | False  | False  |  False   |  False   |
|  pad_value  |            |            |   0    |   0    |   0    |    0     |    0     |
+-------------+------------+------------+--------+--------+--------+----------+----------+
__init__(lower=False, tokenizer: str = 'raw')[源代码]
参数
  • lower (bool) – 是否将所有raw_words转为小写。

  • tokenizer (str) – 将原始数据tokenize的方式。支持spacy, raw. spacy是使用spacy切分,raw就是用空格切分。

process(data_bundle)[源代码]

接受的DataBundle中的DataSet应该具有以下的field, target列可以没有

raw_words1

raw_words2

target

The new rights are…

Everyone really likes..

entailment

This site includes a…

The Government Executive…

not_entailment

参数

data_bundle (DataBundle) – 通过loader读取得到的data_bundle,里面包含了数据集的原始数据内容

返回

data_bundle