「多任务」任务损失/梯度优化策略合集

作者丨Ai

来源丨宅码

编辑丨极市平台

本文分享如何从loss和gradient方面,优化多任务模型,缓解负迁移或跷跷板的问题。不足之处,还望批评指正。

背景

最近工作中,有使用到多任务模型,但实际使用时,会面临负迁移、跷跷板等现象。

除了从模型角度优化,这里介绍从loss和gradient方面的优化策略。阿里云云栖号的一篇文章[1]总结的很好,多目标优化策略主要关注三个问题:

1. Magnitude(Loss量级):Loss值有大有小,出现取值大的Loss主导的现象,怎么办?

2. Velocity (Loss学习速度):任务有难有易,Loss学习速度有快有慢,怎么办?

3. Direction(Loss梯度冲突):多个Loss的反向梯度,更新方向冲突,出现翘翘板、负迁移现象,怎么办?

为了解决以上问题,本文分享以下8种方法:

1. Uncertainty Weighting

论文 Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition

引用量:2047

def UW_Loss(labels, preds,
            log_vars,
            device="gpu:1"):
    """Uncertainty Weighting
    Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses
    for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and 
    pattern recognition(pp. 7482-7491).
    """
    assert len(preds) == labels.shape[1]-1
    assert len(loss_weights) == labels.shape[1]-1
    labels = labels.to(device)
    
    # 计算各环节的ce
    # ['click','cart', 'order']
    loss1 = nn.functional.binary_cross_entropy(preds[0], labels[:,0])
    loss2 = nn.functional.binary_cross_entropy(preds[1], labels[:,1])
    loss3 = nn.functional.binary_cross_entropy(preds[2], labels[:,2])
    losses = [loss1, loss2, loss3]

    for i, log_var in enumerate(log_vars):
        log_var = log_var.to(device)
        losses[i] = (1/2) * (torch.exp(-log_var[0])**2) * losses[i] + torch.log(torch.exp(log_var[0])+1)
       
    loss = sum(losses)
    return loss, losses

2. MGDA

论文 Sener, O., & Koltun, V. (2018). Multi-task learning as multi-objective optimization. *Advances in neural information processing systems*, *31*.

引用量: 638

代码: https://github.com/isl-org/MultiObjectiveOptimization

据[2]解释,作者将MTL看作一个带约束优化问题,求解过程相当于寻找帕累托最优过程。假定固有有一群任务和可分配的任务损失权重,从一种分配状态到另一种状态的变化中,在没有使任何任务境况变坏的前提下,使得至少一个任务变得更好,这就达到了帕累托最优化。

MTL的优化目标函数:

文中,作者总结一个方法:多重梯度下降算法 (multiple gradient descent algorithm, MGDA),该算法针对共享参数和任务独立参数,声明KKT (Karush-Kuhn-Tucker) 条件:

第二个条件,是让每个任务独立的参数梯度为0,直接对每个任务独立分支的部分上,各自做梯度下降即可。而第一个条件是要找到一个帕累托最优点 (即最好的alpha组合),使得共享层参数梯度为0。这边作者使用了Frank-Wolfe算法。

虽然Frank-Wolfe求解器有效率和质量,但我们需要对每个任务t计算其在共享层的参数梯度,用于反向传播,因此要T次反向传播后,才能前向传播1次。考虑到后向传播比前向传播耗时,本文提出一个更有效的办法,只需要1次反向传播。先把任务t的预测函数做个变换:从x为自变量,delta_sh和delta_t为参数,变换到以下,以表征函数作为自变量,delta_t为参数。而表征函数内部又是由x为自变量,delta_sh为参数。

如果将表征函数 g 下样本 xi 的表征, 表达为zi, 则我们可以得到共享层部分的帕累托最优的上界:

简单来说就是共享层的表征向量取代了原来的 x 。

源码上看,先看:MGDA先对共享层和任务独立层,获取梯度和反向传播。

而MGDA-UB是冻结共享层的反向传播,先通过共享层获取表征向量,再对任务独立层算梯度和反向传播,最后基于任务独立层的梯度,算出alpha后,计算共享层的损失,进而反向传播。

3. GradNorm

论文 Chen, Z., Badrinarayanan, V., Lee, C. Y., & Rabinovich, A. (2018, July). Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In *International conference on machine learning* (pp. 794-803). PMLR.

引用量: 670

代码: https://github.com/hav4ik/Hydra/blob/master/src/applications/trainers/gradnorm.py

该文章[3]写的很详细,建议直接阅读它。GradNorm是基于不同任务的学习速度调整任务权重,学习速度越快,权重越小。

每个任务的学习速度反值r为:

得到学习速率后,便可以更新任务权重:

其中alpha为调整力度,它越大,调整力度越大。GradNorm整个训练流程为:

for step, batch in enumerate(train_loader):
    con_feats = batch['features'].float().to(device)
    targets = batch['targets'].float().to(device)
    
    preds = model(con_feats) # 模型训练
    loss, task_loss, multi_losses = model.loss(targets, preds, loss_weights=loss_weights, device=device) # 计算损失
    
    # 获得第一轮的任务损失
    if epoch == 0 and step == 0: 
        initial_task_loss = task_loss.data.cpu().detach().numpy()    
    
    optimizer.zero_grad()               # 梯度清空
    loss.backward(retain_graph = True)                     # 计算梯度
    
    # 重置L_grad对w的梯度
    model.weights.grad.data = model.weights.grad.data * 0.0
    # 获取共享层最后一层网络权重W
    W = model.get_last_shared_layer()
    # 计算每个任务的G^i_W(t)
    norms = []
    for i in range(len(task_loss)): 
        # Loss对W求导 * w -> Gradnorm
        gn = torch.autograd.grad(task_loss[i], W.parameters(), retain_graph = True)
        norms.append(torch.norm(torch.mul(model.weights[i], gn[0])))
    norms = torch.stack(norms)            
    # 计算均值Gradnorm
    mean_norm = np.mean(norms.data.cpu().detach().numpy())
    
    # 计算学习速度反值r
    loss_ratio = task_loss.data.cpu().detach().numpy() / initial_task_loss
    inv_train_rate = loss_ratio / np.mean(loss_ratio)
    # 计算L_grad
    const = torch.tensor(mean_norm * (inv_train_rate ** alpha), requires_grad = False)
    gn_loss = torch.sum(torch.abs(norms - const.to(device)))
    # 计算l_grad对w的梯度(即GradNorm Loss的梯度)
    model.weights.grad = torch.autograd.grad(gn_loss, model.weights)[0]
    
    # 共同反向传播更新 l_grad对w的梯度 和 网络参数
    optimizer.step()                    
    
    epoch_trn_loss += loss.cpu().detach().numpy()
    for i in range(len(labels)-1):
        epoch_trn_multi_loss[i] += np.round(multi_losses[i].cpu().detach().numpy(),4)
        train_history[i].append(multi_losses[i].cpu().detach().numpy())
        # 收集个任务的w、Gradnorm和GradNorm Target
        weight_history[i].append(model.weights.data[i].clone())
        gn_history[i].append(norms[i])
        const_history[i].append(const[i])   

epoch_trn_loss = round(epoch_trn_loss/(step+1),4)
# renormalize the loss weights after each epoch
norm_coeff = (len(labels)-1) / torch.sum(model.weights.data, dim = 0)
model.weights.data = model.weights.data * norm_coeff

4. DWA

论文 Liu, S., Johns, E., & Davison, A. J. (2019). End-to-end multi-task learning with attention. In *Proceedings of the IEEE/CVF conference on computer vision and pattern recognition* (pp. 1871-1880).

引用量: 571

DWA全称是Dynamic Weight Average,受到GradNorm的启发,它也通过考虑每个任务的损失改变,去学习平均不同训练轮数下各任务的权重。GradNorm需要接触网络内部梯度,而DWA提出只要任务的损失数值,所以实施起来更简单。作者把任务k的权重lambda_k定义如下:

w_k计算损失相对衰减率,越小,说明学习速度越快。我们要减少其重要度,为此,喂入softmax获取各任务的权重,这样w_k越小,权重越小,即近期训练过程中学习速度快的任务,要给低的重要度。T越大,权重越趋近于1,因此任务权重更平均,所以T调大,有利于让各任务的权重更均匀。乘上K是为了确保各任务的权重之和为K,保证权重在同一量纲下缩放。DWA的缺点是容易受损失量级大的任务主导[3]。另外,前2轮w_k初始化为1。

# dynamic weight averaging
loss_weights = np.ones((model_params['total_epoch'], len(labels)))
avg_task_loss = np.zeros((model_params['total_epoch'], len(labels)))

for epoch in range(model_params['total_epoch']):
  # 前2轮w_k初始化为1
  if epoch == 0 or epoch == 1: 
      pass
  # 算权重
  else: 
      sum_w = []
      for i in range(len(labels)): 
          w = avg_task_loss[epoch - 1, i] / avg_task_loss[epoch - 2, i]
          sum_w.append(np.exp(w / T))
      loss_weights[epoch] = [len(labels) * w / np.sum(sum_w) for w in sum_w]

  model.train()
  for step, batch in enumerate(train_loader):
        # model training code

  # save loss in each epoch 每epoch结束追加loss
  avg_task_loss[epoch] = [task_loss / len(train_loader) for task_loss in epoch_trn_multi_loss]

5. PE-LTR

论文 Lin, X., Chen, H., Pei, C., Sun, F., Xiao, X., Sun, H., ... & Jiang, P. (2019, September). A pareto-efficient algorithm for multiple objective optimization in e-commerce recommendation. In *Proceedings of the 13th ACM Conference on recommender systems* (pp. 20-28).

引用量: 64

代码: https://github.com/weberrr/PE-LTR/blob/master/PE-LTR.py](https://github.com/weberrr/PE-LTR/blob/master/PE-LTR.py

本文是在阿里电商LTR (Learing to Rank) 场景下,提出用帕累托最优算法找到各任务适合的权重,跟前面提到的Multi-Objective Optimization的思路有些相似。直接看训练过程:

先定义损失函数:

为了求解该问题,作者提出了PECsolver:

简单来说,利用KKT条件来进行求解,放松条件求解w,然后考虑原约束调节进一步收紧解集。[4],涉及数学细节,感兴趣可阅读原文。

def pareto_step(w, c, G):
    """
    ref:http://ofey.me/papers/Pareto.pdf
    K : the number of task
    M : the dim of NN's params
    :param W: # (K,1)
    :param C: # (K,1)
    :param G: # (K,M)
    :return:
    """
    GGT = np.matmul(G, np.transpose(G))  # (K, K)
    e = np.mat(np.ones(np.shape(w)))  # (K, 1)
    m_up = np.hstack((GGT, e))  # (K, K+1)
    m_down = np.hstack((np.transpose(e), np.mat(np.zeros((1, 1)))))  # (1, K+1)
    M = np.vstack((m_up, m_down))  # (K+1, K+1)
    z = np.vstack((-np.matmul(GGT, c), 1 - np.sum(c)))  # (K+1, 1)
    hat_w = np.matmul(np.matmul(np.linalg.inv(np.matmul(np.transpose(M), M)), M), z)  # (K+1, 1)
    hat_w = hat_w[:-1]  # (K, 1)
    hat_w = np.reshape(np.array(hat_w), (hat_w.shape[0],))  # (K,)
    c = np.reshape(np.array(c), (c.shape[0],))  # (K,)
    new_w = ASM(hat_w, c)
    return new_w

def ASM(hat_w, c):
    """
    ref:
    http://ofey.me/papers/Pareto.pdf,
    https://stackoverflow.com/questions/33385898/how-to-include-constraint-to-scipy-nnls-function-solution-so-that-it-sums-to-1
    :param hat_w: # (K,)
    :param c: # (K,)
    :return:
    """
    A = np.array([[0 if i != j else 1 for i in range(len(c))] for j in range(len(c))])
    b = hat_w
    x0, _ = nnls(A, b)

    def _fn(x, A, b):
        return np.linalg.norm(A.dot(x) - b)

    cons = {'type': 'eq', 'fun': lambda x: np.sum(x) + np.sum(c) - 1}
    bounds = [[0., None] for _ in range(len(hat_w))]
    min_out = minimize(_fn, x0, args=(A, b), method='SLSQP', bounds=bounds, constraints=cons)
    new_w = min_out.x + c
    return new_w

loss_weights = np.full(len(labels)-1, 1/(len(labels)-1)) # (K,)
w_constraint = np.full((len(labels)-1, 1), 0.) # (K, 1)
for epoch in range(model_params['total_epoch']):
  model.train()
  for step, batch in enumerate(train_loader):
      con_feats = batch['features'].float().to(device)
      targets = batch['targets'].float().to(device)
      
      preds = model(con_feats) # 模型训练
      loss, task_loss, multi_losses = model.loss(targets, preds,
                             loss_weights=loss_weights,
                             device=device) # 计算损失
      optimizer.zero_grad()               # 梯度清空
      loss.backward(retain_graph=True)                     # 计算梯度
  
      # G: (K, m)
      grads = []
        for l in task_loss: 
          grad = []
          optimizer.zero_grad()
          l.backward(retain_graph=True)
          for param in model.parameters(): 
              if param.grad is not None: 
                  grad.append(param.grad.view(-1).cpu().detach().numpy())
          grads.append(np.hstack(grad))
      G = np.vstack(grads)
      
      loss_weights = pareto_step(len(labels)-1, w_constraint, G)
      optimizer.step()                    # 梯度回传

6. HTW

论文 Kongyoung, S., Macdonald, C., & Ounis, I. (2020). Multi-task learning using dynamic task weighting for conversational question answering.

引用量: 5

在现有的MTL权重分配方法中,对所有任务都是一视同仁,但其实存在一些业务场景是有主次任务之分。HTW (Hybrid Task Weighting) 则是分别对主任务和辅助任务分别使用:

整个训练流程如下:

# 记录首轮loss
if step == 0: 
    initial_task_loss = task_loss

loss = 0
mutil_losses = []
# 对辅助任务算权重,并加权辅助任务的loss
for i in range(len(task_loss) - 1): 
    loss_weights[i] = (task_loss[i] / initial_task_loss[i]) ** alpha
    loss += loss_weights[i] * task_loss[i]
    mutil_losses.append(loss_weights[i] * task_loss[i])
# 对主任务算权重,并加权主任务的loss
if step_count <= step_threshold: 
    loss_weights[-1] = step_count / total_step
    step_count += 1
else: 
    loss_weights[-1] = 1.0
loss += loss_weights[-1] * task_loss[-1]
mutil_losses.append(loss_weights[-1] * task_loss[-1])

7. PCGrad

论文 Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., & Finn, C. (2020). Gradient surgery for multi-task learning. *Advances in Neural Information Processing Systems*, *33*, 5824-5836.

引用量: 353

代码: https://github.com/chenllliang/Gradient-Vaccine/blob/17fa758fdd4f87475ee2847db6fc0a013631fee3/fairseq/fairseq/optim/pcgrad.py

如果两个梯度在方向上存在冲突,就把任务i的梯度投影到具有冲突梯度的任何其他任务j的梯度的法向量平面上。如下图所示,若任务i和任务j的余弦相似度是正值,如图(d),不相互冲突,那任务i和j保持各自原有梯度,做更新。若如图(a),梯度方向有相互冲突,对于任务i的梯度,就把它投影在任务j梯度的法向量上,如图(b),作为任务i的新梯度去更新。对于任务j就反之,如图(c)。

投影计算方式如下[5]:

如上图所示,gi和gj有冲突,所以要把gi投影在gj的法向量平面上作为任务i的新梯度 (即蓝色虚线)。假设绿色向量为a*gj,基于向量的加法三角形法则,可以得到蓝色虚线x=gi+a*gj。之后做以下计算,便可得到a和蓝色虚线向量 (即任务i的新梯度):

效果如下:

整体训练流程如下:

# Compute gradient projections.
def proj_grad(grad_task):
    """计算投影梯度"""
    for k in range(num_tasks):
        inner_product = tf.reduce_sum(grad_task*grads_task[k])
        proj_direction = inner_product / tf.reduce_sum(grads_task[k]*grads_task[k])
        grad_task = grad_task - tf.minimum(proj_direction, 0.) * grads_task[k]
    return grad_task

8. GradVac

论文 Wang, Z., Tsvetkov, Y., Firat, O., & Cao, Y. (2020). Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. *arXiv preprint arXiv:2010.05874*.

引用量: 57

代码: https://github.com/chenllliang/Gradient-Vaccine

PCGrad只在梯度之间的余弦相似度为负值时生效,这导致PCGrad在训练过程中的表现是非常稀疏的,如下面的左图,而真实情况,存在不少任务梯度之间是正余弦相似度,但它们却没被PCGrad考虑,于是作者提出GradVac。

论文附录E有详细推导过程:

整个训练过程如下:

总结

总结如下:

参考资料

[1] 多任务多目标 CTR 预估技术 - 阿里云云栖号,文章: https://baijiahao.baidu.com/s?id=1713545722047735100&wfr=spider&for=pc

[2] 深度学习的多个loss如何平衡?- 陈瀚清的回答 - 知乎:https://www.zhihu.com/question/375794498/answer/2657267272

[3] 多目标样本权重-GradNorm和DWA原理详解和实现 - 知乎:https://zhuanlan.zhihu.com/p/542296680

[4] 阿里多目标优化: PE-LTR - 知乎:https://zhuanlan.zhihu.com/p/159459480

[5] 多任务学习——【ICLR 2020】PCGrad - 知乎:https://zhuanlan.zhihu.com/p/39

展开阅读全文

页面更新:2024-05-10

标签:梯度   损失   向量   表征   权重   模型   主任   速度   策略   参数   论文

1 2 3 4 5

上滑加载更多 ↓
推荐阅读:
友情链接:
更多:

本站资料均由网友自行发布提供,仅用于学习交流。如有版权问题,请与我联系,QQ:4156828  

© CopyRight 2008-2024 All Rights Reserved. Powered By bs178.com 闽ICP备11008920号-3
闽公网安备35020302034844号

Top