由于时效问题,该文某些代码、技术可能已经过期,请注意!!!本文最后更新于:3 年前
文本数据
一,准备数据 imdb数据集的目标是根据电影评论的文本内容预测评论的情感标签。
训练集有20000条电影评论文本,测试集有5000条电影评论文本,其中正面评论和负面评论都各占一半。
文本数据预处理较为繁琐,包括中文切词(本示例不涉及),构建词典,编码转换,序列填充,构建数据管道等等。
在torch中预处理文本数据一般使用torchtext或者自定义Dataset,torchtext功能非常强大,可以构建文本分类,序列标注,问答模型,机器翻译等NLP任务的数据集。
下面仅演示使用它来构建文本分类数据集的方法。
较完整的教程可以参考以下知乎文章:《pytorch学习笔记—Torchtext》
torchtext常见API一览
torchtext.data.Example : 用来表示一个样本,数据和标签
torchtext.vocab.Vocab: 词汇表,可以导入一些预训练词向量
torchtext.data.Datasets: 数据集类,__getitem__返回 Example实例, torchtext.data.TabularDataset是其子类。
torchtext.data.Field : 用来定义字段的处理方法(文本字段,标签字段)创建 Example时的 预处理,batch 时的一些处理操作。
torchtext.data.Iterator: 迭代器,用来生成 batch
torchtext.datasets: 包含了常见的数据集.
1 2 3 4 5 6 7 8 9 10 11 12 13 import numpy as np import pandas as pd from collections import OrderedDictimport re,stringMAX_WORDS = 10000 # 仅考虑最高频的10000 个词MAX_LEN = 200 # 每个样本保留200 个词的长度BATCH_SIZE = 20 train_data_path = '/home/kesci/input/data6936/data /imdb/train.tsv'test_data_path = '/home/kesci/input/data6936/data /imdb/test.tsv'train_token_path = '/home/kesci/input/data6936/data /imdb/train_token.tsv'test_token_path = '/home/kesci/input/data6936/data /imdb/test_token.tsv'train_samples_path = '/home/kesci/input/data6936/data /imdb/train_samples/'test_samples_path = '/home/kesci/input/data6936/data /imdb/test_samples/'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 word_count_dict = {} def clean_text(text ): lowercase = text .lower().replace("\n" ," " ) stripped_html = re.sub('<br />', ' ',lowercase) cleaned_punctuation = re.sub('[%s]'%re.escape(string .punctuation),'',stripped_html) return cleaned_punctuationwith open(train_data_path,"r" ,encoding = 'utf-8 ') as f: for line in f: label,text = line.split("\t" ) cleaned_text = clean_text(text ) for word in cleaned_text.split(" " ): word_count_dict[word ] = word_count_dict.get (word ,0 )+1 df_word_dict = pd.DataFrame(pd.Series(word_count_dict,name = "count" )) df_word_dict = df_word_dict.sort_values(by = "count" ,ascending =False) df_word_dict = df_word_dict[0 :MAX_WORDS-2 ] df_word_dict["word_id" ] = range(2 ,MAX_WORDS) word_id_dict = df_word_dict["word_id" ].to_dict() df_word_dict.head(10 )
利用构建好的词典,将文本转换成token序号。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 def pad(data_list,pad_length): padded_list = data_list.copy() if len (data_list)> pad_length: padded_list = data_list[-pad_length:] if len (data_list)< pad_length: padded_list = [1 ]*(pad_length-len (data_list))+data_list return padded_list def text_to_token(text_file,token_file): with open (text_file,"r" ,encoding = 'utf-8' ) as fin,\ open (token_file,"w" ,encoding = 'utf-8' ) as fout: for line in fin: label,text = line .split ("\t" ) cleaned_text = clean_text(text ) word_token_list = [word_id_dict.get (word , 0 ) for word in cleaned_text.split (" " )] pad_list = pad(word_token_list,MAX_LEN) out_line = label+"\t" +" " .join([str(x) for x in pad_list]) fout.write (out_line+"\n" ) text_to_token(train_data_path,train_token_path) text_to_token(test_data_path,test_token_path)
接着将token文本按照样本分割,每个文件存放一个样本的数据。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # 分割样本 import osif not os.path.exists(train_samples_path): os.mkdir(train_samples_path) if not os.path.exists(test_samples_path): os.mkdir(test_samples_path) def split_samples(token_path ,samples_dir ) : with open (token_path,"r" ,encoding = 'utf-8 ') as fin: i = 0 for line in fin: with open (samples_dir+"%d.txt" %i,"w" ,encoding = "utf-8" ) as fout: fout.write(line) i = i+1 split_samples(train_token_path ,train_samples_path ) split_samples(test_token_path ,test_samples_path )
创建数据集Dataset, 从文件名称列表中读取文件内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import osfrom torch.utils.data import Dataset,DataLoader class imdbDataset (Dataset ): def __init__ (self,samples_dir ): self.samples_dir = samples_dir self.samples_paths = os.listdir(samples_dir) def __len__ (self ): return len (self.samples_paths) def __getitem__ (self,index ): path = self.samples_dir + self.samples_paths[index] with open (path,"r" ,encoding = "utf-8" ) as f: line = f.readline() label,tokens = line.split("\t" ) label = torch.tensor([float (label)],dtype = torch.float ) feature = torch.tensor([int (x) for x in tokens.split(" " )],dtype = torch.long) return (feature,label) ds_train = imdbDataset(train_samples_path) ds_test = imdbDataset(test_samples_path)print (len (ds_train))print (len (ds_test)) dl_train = DataLoader(ds_train,batch_size = BATCH_SIZE,shuffle = True ,num_workers=4 ) dl_test = DataLoader(ds_test,batch_size = BATCH_SIZE,num_workers=4 )for features,labels in dl_train: print (features) print (labels) break
二,定义模型 使用Pytorch通常有三种方式构建模型:使用nn.Sequential按层顺序构建模型,继承nn.Module基类构建自定义模型,继承nn.Module基类构建模型并辅助应用模型容器(nn.Sequential,nn.ModuleList,nn.ModuleDict)进行封装。
此处选择使用第三种方式进行构建。
由于接下来使用类形式的训练循环,我们将模型封装成torchkeras.Model类来获得类似Keras中高阶模型接口的功能。
Model类实际上继承自nn.Module类。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 import torch from torch import nn import torchkeras torch.random.seed() import torch from torch import nn class Net(torchkeras .Model) : def __init__(self ) : super(Net, self).__init__() #设置padding_idx参数后将在训练过程中将填充的token始终赋值为0 向量 self.embedding = nn.Embedding(num_embeddings = MAX_WORDS,embedding_dim = 3,padding_idx = 1) self.conv = nn.Sequential() self.conv.add_module("conv_1" ,nn .Conv1d(in_channels = 3,out_channels = 16,kernel_size = 5) ) self.conv.add_module("pool_1" ,nn .MaxPool1d(kernel_size = 2) ) self.conv.add_module("relu_1" ,nn .ReLU() ) self.conv.add_module("conv_2" ,nn .Conv1d(in_channels = 16,out_channels = 128,kernel_size = 2) ) self.conv.add_module("pool_2" ,nn .MaxPool1d(kernel_size = 2) ) self.conv.add_module("relu_2" ,nn .ReLU() ) self.dense = nn.Sequential() self.dense.add_module("flatten" ,nn .Flatten() ) self.dense.add_module("linear" ,nn .Linear(6144,1) ) self.dense.add_module("sigmoid" ,nn .Sigmoid() ) def forward(self,x): x = self.embedding(x).transpose(1 ,2 ) x = self.conv(x) y = self.dense(x) return y model = Net() print(model) model.summary(input_shape = (200 ,),input_dtype = torch.LongTensor)
训练模型 训练Pytorch通常需要用户编写自定义训练循环,训练循环的代码风格因人而异。
有3类典型的训练循环代码风格:脚本形式训练循环,函数形式训练循环,类形式训练循环。
此处介绍一种类形式的训练循环。
我们仿照Keras定义了一个高阶的模型接口Model,实现 fit, validate,predict, summary 方法,相当于用户自定义高阶API。
1 2 3 4 5 6 7 8 9 # 准确率 def accuracy(y_pred,y_true): y_pred = torch.where(y_pred>0.5 ,torch.ones_like(y_pred ,dtype = torch .float32 ) , torch.zeros_like(y_pred ,dtype = torch .float32 ) ) acc = torch.mean(1 -torch.abs(y_true-y_pred)) return acc model.compile(loss_func = nn.BCELoss() ,optimizer= torch.optim.Adagrad(model .parameters () ,lr = 0.02 ), metrics_dict={"accuracy" :accuracy})
1 2 dfhistory = model.fit(20 ,dl_train,dl_val=dl_test,log_step_freq= 200 )
四,评估模型 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 %matplotlib inline %config InlineBackend.figure_format = 'svg' import matplotlib.pyplot as pltdef plot_metric (dfhistory, metric ): train_metrics = dfhistory[metric] val_metrics = dfhistory['val_' +metric] epochs = range (1 , len (train_metrics) + 1 ) plt.plot(epochs, train_metrics, 'bo--' ) plt.plot(epochs, val_metrics, 'ro-' ) plt.title('Training and validation ' + metric) plt.xlabel("Epochs" ) plt.ylabel(metric) plt.legend(["train_" +metric, 'val_' +metric]) plt.show() plot_metric(dfhistory,"loss" ) plot_metric(dfhistory,"accuracy" ) model.evaluate(dl_test)
五,使用模型
六,保存模型 1 2 3 4 5 6 7 8 9 10 11 12 # 保存模型参数 torch.save(model.state_dict() , "./data/model_parameter.pkl" ) model_clone = Net() model_clone.load_state_dict(torch .load ("./data/model_parameter.pkl" ) ) model_clone.compile(loss_func = nn.BCELoss() ,optimizer= torch.optim.Adagrad(model .parameters () ,lr = 0.02 ), metrics_dict={"accuracy" :accuracy}) # 评估模型 model_clone.evaluate(dl_test)
搬运自: