作者丨Ai
来源丨宅码
编辑丨极市平台
本文分享如何从loss和gradient方面,优化多任务模型,缓解负迁移或跷跷板的问题。不足之处,还望批评指正。
最近工作中,有使用到多任务模型,但实际使用时,会面临负迁移、跷跷板等现象。
除了从模型角度优化,这里介绍从loss和gradient方面的优化策略。阿里云云栖号的一篇文章[1]总结的很好,多目标优化策略主要关注三个问题:
1. Magnitude(Loss量级):Loss值有大有小,出现取值大的Loss主导的现象,怎么办?
2. Velocity (Loss学习速度):任务有难有易,Loss学习速度有快有慢,怎么办?
3. Direction(Loss梯度冲突):多个Loss的反向梯度,更新方向冲突,出现翘翘板、负迁移现象,怎么办?
为了解决以上问题,本文分享以下8种方法:
论文: Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition
引用量:2047
def UW_Loss(labels, preds,
log_vars,
device="gpu:1"):
"""Uncertainty Weighting
Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses
for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and
pattern recognition(pp. 7482-7491).
"""
assert len(preds) == labels.shape[1]-1
assert len(loss_weights) == labels.shape[1]-1
labels = labels.to(device)
# 计算各环节的ce
# ['click','cart', 'order']
loss1 = nn.functional.binary_cross_entropy(preds[0], labels[:,0])
loss2 = nn.functional.binary_cross_entropy(preds[1], labels[:,1])
loss3 = nn.functional.binary_cross_entropy(preds[2], labels[:,2])
losses = [loss1, loss2, loss3]
for i, log_var in enumerate(log_vars):
log_var = log_var.to(device)
losses[i] = (1/2) * (torch.exp(-log_var[0])**2) * losses[i] + torch.log(torch.exp(log_var[0])+1)
loss = sum(losses)
return loss, losses
论文: Sener, O., & Koltun, V. (2018). Multi-task learning as multi-objective optimization. *Advances in neural information processing systems*, *31*.
引用量: 638
代码: https://github.com/isl-org/MultiObjectiveOptimization
据[2]解释,作者将MTL看作一个带约束优化问题,求解过程相当于寻找帕累托最优过程。假定固有有一群任务和可分配的任务损失权重,从一种分配状态到另一种状态的变化中,在没有使任何任务境况变坏的前提下,使得至少一个任务变得更好,这就达到了帕累托最优化。
MTL的优化目标函数:
文中,作者总结一个方法:多重梯度下降算法 (multiple gradient descent algorithm, MGDA),该算法针对共享参数和任务独立参数,声明KKT (Karush-Kuhn-Tucker) 条件:
第二个条件,是让每个任务独立的参数梯度为0,直接对每个任务独立分支的部分上,各自做梯度下降即可。而第一个条件是要找到一个帕累托最优点 (即最好的alpha组合),使得共享层参数梯度为0。这边作者使用了Frank-Wolfe算法。
虽然Frank-Wolfe求解器有效率和质量,但我们需要对每个任务t计算其在共享层的参数梯度,用于反向传播,因此要T次反向传播后,才能前向传播1次。考虑到后向传播比前向传播耗时,本文提出一个更有效的办法,只需要1次反向传播。先把任务t的预测函数做个变换:从x为自变量,delta_sh和delta_t为参数,变换到以下,以表征函数作为自变量,delta_t为参数。而表征函数内部又是由x为自变量,delta_sh为参数。
如果将表征函数 g 下样本 xi 的表征, 表达为zi, 则我们可以得到共享层部分的帕累托最优的上界:
简单来说就是共享层的表征向量取代了原来的 x 。
源码上看,先看:MGDA先对共享层和任务独立层,获取梯度和反向传播。
而MGDA-UB是冻结共享层的反向传播,先通过共享层获取表征向量,再对任务独立层算梯度和反向传播,最后基于任务独立层的梯度,算出alpha后,计算共享层的损失,进而反向传播。
论文: Chen, Z., Badrinarayanan, V., Lee, C. Y., & Rabinovich, A. (2018, July). Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In *International conference on machine learning* (pp. 794-803). PMLR.
引用量: 670
代码: https://github.com/hav4ik/Hydra/blob/master/src/applications/trainers/gradnorm.py
该文章[3]写的很详细,建议直接阅读它。GradNorm是基于不同任务的学习速度调整任务权重,学习速度越快,权重越小。
每个任务的学习速度反值r为:
得到学习速率后,便可以更新任务权重:
其中alpha为调整力度,它越大,调整力度越大。GradNorm整个训练流程为:
for step, batch in enumerate(train_loader):
con_feats = batch['features'].float().to(device)
targets = batch['targets'].float().to(device)
preds = model(con_feats) # 模型训练
loss, task_loss, multi_losses = model.loss(targets, preds, loss_weights=loss_weights, device=device) # 计算损失
# 获得第一轮的任务损失
if epoch == 0 and step == 0:
initial_task_loss = task_loss.data.cpu().detach().numpy()
optimizer.zero_grad() # 梯度清空
loss.backward(retain_graph = True) # 计算梯度
# 重置L_grad对w的梯度
model.weights.grad.data = model.weights.grad.data * 0.0
# 获取共享层最后一层网络权重W
W = model.get_last_shared_layer()
# 计算每个任务的G^i_W(t)
norms = []
for i in range(len(task_loss)):
# Loss对W求导 * w -> Gradnorm
gn = torch.autograd.grad(task_loss[i], W.parameters(), retain_graph = True)
norms.append(torch.norm(torch.mul(model.weights[i], gn[0])))
norms = torch.stack(norms)
# 计算均值Gradnorm
mean_norm = np.mean(norms.data.cpu().detach().numpy())
# 计算学习速度反值r
loss_ratio = task_loss.data.cpu().detach().numpy() / initial_task_loss
inv_train_rate = loss_ratio / np.mean(loss_ratio)
# 计算L_grad
const = torch.tensor(mean_norm * (inv_train_rate ** alpha), requires_grad = False)
gn_loss = torch.sum(torch.abs(norms - const.to(device)))
# 计算l_grad对w的梯度(即GradNorm Loss的梯度)
model.weights.grad = torch.autograd.grad(gn_loss, model.weights)[0]
# 共同反向传播更新 l_grad对w的梯度 和 网络参数
optimizer.step()
epoch_trn_loss += loss.cpu().detach().numpy()
for i in range(len(labels)-1):
epoch_trn_multi_loss[i] += np.round(multi_losses[i].cpu().detach().numpy(),4)
train_history[i].append(multi_losses[i].cpu().detach().numpy())
# 收集个任务的w、Gradnorm和GradNorm Target
weight_history[i].append(model.weights.data[i].clone())
gn_history[i].append(norms[i])
const_history[i].append(const[i])
epoch_trn_loss = round(epoch_trn_loss/(step+1),4)
# renormalize the loss weights after each epoch
norm_coeff = (len(labels)-1) / torch.sum(model.weights.data, dim = 0)
model.weights.data = model.weights.data * norm_coeff
论文: Liu, S., Johns, E., & Davison, A. J. (2019). End-to-end multi-task learning with attention. In *Proceedings of the IEEE/CVF conference on computer vision and pattern recognition* (pp. 1871-1880).
引用量: 571
DWA全称是Dynamic Weight Average,受到GradNorm的启发,它也通过考虑每个任务的损失改变,去学习平均不同训练轮数下各任务的权重。GradNorm需要接触网络内部梯度,而DWA提出只要任务的损失数值,所以实施起来更简单。作者把任务k的权重lambda_k定义如下:
w_k计算损失相对衰减率,越小,说明学习速度越快。我们要减少其重要度,为此,喂入softmax获取各任务的权重,这样w_k越小,权重越小,即近期训练过程中学习速度快的任务,要给低的重要度。T越大,权重越趋近于1,因此任务权重更平均,所以T调大,有利于让各任务的权重更均匀。乘上K是为了确保各任务的权重之和为K,保证权重在同一量纲下缩放。DWA的缺点是容易受损失量级大的任务主导[3]。另外,前2轮w_k初始化为1。
# dynamic weight averaging
loss_weights = np.ones((model_params['total_epoch'], len(labels)))
avg_task_loss = np.zeros((model_params['total_epoch'], len(labels)))
for epoch in range(model_params['total_epoch']):
# 前2轮w_k初始化为1
if epoch == 0 or epoch == 1:
pass
# 算权重
else:
sum_w = []
for i in range(len(labels)):
w = avg_task_loss[epoch - 1, i] / avg_task_loss[epoch - 2, i]
sum_w.append(np.exp(w / T))
loss_weights[epoch] = [len(labels) * w / np.sum(sum_w) for w in sum_w]
model.train()
for step, batch in enumerate(train_loader):
# model training code
# save loss in each epoch 每epoch结束追加loss
avg_task_loss[epoch] = [task_loss / len(train_loader) for task_loss in epoch_trn_multi_loss]
5. PE-LTR
论文: Lin, X., Chen, H., Pei, C., Sun, F., Xiao, X., Sun, H., ... & Jiang, P. (2019, September). A pareto-efficient algorithm for multiple objective optimization in e-commerce recommendation. In *Proceedings of the 13th ACM Conference on recommender systems* (pp. 20-28).
引用量: 64
代码: https://github.com/weberrr/PE-LTR/blob/master/PE-LTR.py](https://github.com/weberrr/PE-LTR/blob/master/PE-LTR.py
本文是在阿里电商LTR (Learing to Rank) 场景下,提出用帕累托最优算法找到各任务适合的权重,跟前面提到的Multi-Objective Optimization的思路有些相似。直接看训练过程:
先定义损失函数:
为了求解该问题,作者提出了PECsolver:
简单来说,利用KKT条件来进行求解,放松条件求解w,然后考虑原约束调节进一步收紧解集。[4],涉及数学细节,感兴趣可阅读原文。
def pareto_step(w, c, G):
"""
ref:http://ofey.me/papers/Pareto.pdf
K : the number of task
M : the dim of NN's params
:param W: # (K,1)
:param C: # (K,1)
:param G: # (K,M)
:return:
"""
GGT = np.matmul(G, np.transpose(G)) # (K, K)
e = np.mat(np.ones(np.shape(w))) # (K, 1)
m_up = np.hstack((GGT, e)) # (K, K+1)
m_down = np.hstack((np.transpose(e), np.mat(np.zeros((1, 1))))) # (1, K+1)
M = np.vstack((m_up, m_down)) # (K+1, K+1)
z = np.vstack((-np.matmul(GGT, c), 1 - np.sum(c))) # (K+1, 1)
hat_w = np.matmul(np.matmul(np.linalg.inv(np.matmul(np.transpose(M), M)), M), z) # (K+1, 1)
hat_w = hat_w[:-1] # (K, 1)
hat_w = np.reshape(np.array(hat_w), (hat_w.shape[0],)) # (K,)
c = np.reshape(np.array(c), (c.shape[0],)) # (K,)
new_w = ASM(hat_w, c)
return new_w
def ASM(hat_w, c):
"""
ref:
http://ofey.me/papers/Pareto.pdf,
https://stackoverflow.com/questions/33385898/how-to-include-constraint-to-scipy-nnls-function-solution-so-that-it-sums-to-1
:param hat_w: # (K,)
:param c: # (K,)
:return:
"""
A = np.array([[0 if i != j else 1 for i in range(len(c))] for j in range(len(c))])
b = hat_w
x0, _ = nnls(A, b)
def _fn(x, A, b):
return np.linalg.norm(A.dot(x) - b)
cons = {'type': 'eq', 'fun': lambda x: np.sum(x) + np.sum(c) - 1}
bounds = [[0., None] for _ in range(len(hat_w))]
min_out = minimize(_fn, x0, args=(A, b), method='SLSQP', bounds=bounds, constraints=cons)
new_w = min_out.x + c
return new_w
loss_weights = np.full(len(labels)-1, 1/(len(labels)-1)) # (K,)
w_constraint = np.full((len(labels)-1, 1), 0.) # (K, 1)
for epoch in range(model_params['total_epoch']):
model.train()
for step, batch in enumerate(train_loader):
con_feats = batch['features'].float().to(device)
targets = batch['targets'].float().to(device)
preds = model(con_feats) # 模型训练
loss, task_loss, multi_losses = model.loss(targets, preds,
loss_weights=loss_weights,
device=device) # 计算损失
optimizer.zero_grad() # 梯度清空
loss.backward(retain_graph=True) # 计算梯度
# G: (K, m)
grads = []
for l in task_loss:
grad = []
optimizer.zero_grad()
l.backward(retain_graph=True)
for param in model.parameters():
if param.grad is not None:
grad.append(param.grad.view(-1).cpu().detach().numpy())
grads.append(np.hstack(grad))
G = np.vstack(grads)
loss_weights = pareto_step(len(labels)-1, w_constraint, G)
optimizer.step() # 梯度回传
论文: Kongyoung, S., Macdonald, C., & Ounis, I. (2020). Multi-task learning using dynamic task weighting for conversational question answering.
引用量: 5
在现有的MTL权重分配方法中,对所有任务都是一视同仁,但其实存在一些业务场景是有主次任务之分。HTW (Hybrid Task Weighting) 则是分别对主任务和辅助任务分别使用:
整个训练流程如下:
# 记录首轮loss
if step == 0:
initial_task_loss = task_loss
loss = 0
mutil_losses = []
# 对辅助任务算权重,并加权辅助任务的loss
for i in range(len(task_loss) - 1):
loss_weights[i] = (task_loss[i] / initial_task_loss[i]) ** alpha
loss += loss_weights[i] * task_loss[i]
mutil_losses.append(loss_weights[i] * task_loss[i])
# 对主任务算权重,并加权主任务的loss
if step_count <= step_threshold:
loss_weights[-1] = step_count / total_step
step_count += 1
else:
loss_weights[-1] = 1.0
loss += loss_weights[-1] * task_loss[-1]
mutil_losses.append(loss_weights[-1] * task_loss[-1])
论文: Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., & Finn, C. (2020). Gradient surgery for multi-task learning. *Advances in Neural Information Processing Systems*, *33*, 5824-5836.
引用量: 353
代码: https://github.com/chenllliang/Gradient-Vaccine/blob/17fa758fdd4f87475ee2847db6fc0a013631fee3/fairseq/fairseq/optim/pcgrad.py
如果两个梯度在方向上存在冲突,就把任务i的梯度投影到具有冲突梯度的任何其他任务j的梯度的法向量平面上。如下图所示,若任务i和任务j的余弦相似度是正值,如图(d),不相互冲突,那任务i和j保持各自原有梯度,做更新。若如图(a),梯度方向有相互冲突,对于任务i的梯度,就把它投影在任务j梯度的法向量上,如图(b),作为任务i的新梯度去更新。对于任务j就反之,如图(c)。
投影计算方式如下[5]:
如上图所示,gi和gj有冲突,所以要把gi投影在gj的法向量平面上作为任务i的新梯度 (即蓝色虚线)。假设绿色向量为a*gj,基于向量的加法三角形法则,可以得到蓝色虚线x=gi+a*gj。之后做以下计算,便可得到a和蓝色虚线向量 (即任务i的新梯度):
效果如下:
整体训练流程如下:
# Compute gradient projections.
def proj_grad(grad_task):
"""计算投影梯度"""
for k in range(num_tasks):
inner_product = tf.reduce_sum(grad_task*grads_task[k])
proj_direction = inner_product / tf.reduce_sum(grads_task[k]*grads_task[k])
grad_task = grad_task - tf.minimum(proj_direction, 0.) * grads_task[k]
return grad_task
论文: Wang, Z., Tsvetkov, Y., Firat, O., & Cao, Y. (2020). Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. *arXiv preprint arXiv:2010.05874*.
引用量: 57
代码: https://github.com/chenllliang/Gradient-Vaccine
PCGrad只在梯度之间的余弦相似度为负值时生效,这导致PCGrad在训练过程中的表现是非常稀疏的,如下面的左图,而真实情况,存在不少任务梯度之间是正余弦相似度,但它们却没被PCGrad考虑,于是作者提出GradVac。
论文附录E有详细推导过程:
整个训练过程如下:
总结如下:
参考资料
[1] 多任务多目标 CTR 预估技术 - 阿里云云栖号,文章: https://baijiahao.baidu.com/s?id=1713545722047735100&wfr=spider&for=pc
[2] 深度学习的多个loss如何平衡?- 陈瀚清的回答 - 知乎:https://www.zhihu.com/question/375794498/answer/2657267272
[3] 多目标样本权重-GradNorm和DWA原理详解和实现 - 知乎:https://zhuanlan.zhihu.com/p/542296680
[4] 阿里多目标优化: PE-LTR - 知乎:https://zhuanlan.zhihu.com/p/159459480
[5] 多任务学习——【ICLR 2020】PCGrad - 知乎:https://zhuanlan.zhihu.com/p/39
页面更新:2024-05-10
本站资料均由网友自行发布提供,仅用于学习交流。如有版权问题,请与我联系,QQ:4156828
© CopyRight 2008-2024 All Rights Reserved. Powered By bs178.com 闽ICP备11008920号-3
闽公网安备35020302034844号