深度学习框架03--自动微分算法和实现-CSDN博客

本文链接：https://blog.csdn.net/weixin_42818618/article/details/147888258

1.数值微分

可以直接通过定义计算偏导数：
$\frac{\partial f(\theta)}{\partial \theta_i}=\mathop{lim}\limits_{\epsilon\rightarrow 0}\frac{f(\theta+\epsilon e_i)-f(\theta)}{\epsilon}$
其中 $e_i$ 为第 i 位为 1 的单位向量。如果可以计算出所有的偏导数，即可得到梯度。以更高的精度计算偏导数：
$\frac{\partial f(\theta)}{\partial \theta_i}=\mathop{lim}\limits_{\epsilon\rightarrow 0}\frac{f(\theta+\epsilon e_i)-f(\theta-\epsilon e_i)}{2\epsilon}+o(\epsilon^2)$
因为有：
$f(\theta+\delta)=f(\theta)+f'(\theta)\delta+\frac{1}{2}f''(\theta)\delta^2+o(\delta^3)$
在上面令 $\delta=\epsilon e_i$ 和 $\delta=\epsilon -e_i$ 联立即可得到。第二种表示的误差要比第一种小一级，因此是更精确的表示。

这种梯度计算方法在实际中并不使用，因为计算机能表示的数值范围是有限的，无法表示很小的 $\epsilon$ ，存在数值误差。并且由于需要计算 $\theta$ 中每个元素的偏导数，这样的计算方法开销非常大。

这种数值微分计算方法经常作为一个参考检测自动微分算法是否计算正确。因为计算开销很大，所以采用另一种形式：
$\delta^T\nabla_\theta f(\theta)=\frac{f(\theta+\epsilon \delta)-f(\theta-\epsilon\delta)}{2\epsilon}+o(\epsilon^2)$
其中 $\delta$ 为指向任意方向的单位向量。任意方向的方向导数是梯度和单位方向向量的点积，这里采用梯度计算任意方向的方向导数，和数值定义计算的方向导数比较，看两者是否相等，相等则说明梯度正确。关于梯度、偏导数、方向导数的理解参考这里。

2.符号微分

根据下面的公式和链式求导法则推导出梯度：
$\frac{\partial (f(\theta)+g(\theta))}{\partial\theta}=\frac{\partial f(\theta)}{\partial\theta}+\frac{\partial g(\theta)}{\partial\theta}\\ \frac{\partial (f(\theta)g(\theta))}{\partial\theta}=g(\theta)\frac{\partial f(\theta)}{\partial\theta}+f(\theta)\frac{\partial g(\theta)}{\partial\theta}\\ \frac{\partial f(g(\theta))}{\partial\theta}=\frac{\partial f(g(\theta))}{\partial g(\theta)}\frac{\partial g(\theta)}{\partial\theta}\\$
这样通常会导致很多重复的计算，例如下面的求导计算：
$f(\theta)=\prod\limits_{i=1}^{n}\theta_i,\frac{\partial f(\theta)}{\partial\theta_k}=\prod\limits_{j!=k}^{n}\theta_j$
如果计算出所有的偏导数，将有n(n-2)次的重复计算。

可用符号微分推导自动微分。

3.自动微分

3.1 计算图

计算图是所有的机器学习框架的核心，计算图通常和表达式的计算顺序有关。

上面的计算图表示的计算表达式为：
$y=f(x_1,x_2)=ln(x_1)+x_1x_2-sinx_2$
其前向计算过程如下：
$\begin{align} v_1&=x_1=2\\ v_2&=x_2=5\\ v_3&=ln(v_1)=ln(2)=0.693\\ v_4&=v_1\times v_2=10\\ v_5&=sin(v_2)=sin(5)=-0.959\\ v_6&=v_3+v_4=10.693\\ v_7&=v_6-v_5=10.693+0.959=11.652\\ y&=v_7=11.652 \end{align}$
根据拓扑排序遍历计算图，可以得到计算结果。

3.2 前向自动微分

定义 $\dot{v}=\frac{\partial v_i}{\partial x_1}$ ，可以根据计算图依照拓扑排序迭代计算 $\dot{v_i}$ ：
$\begin{align} \dot{v_1}&=1\\ \dot{v_2}&=0\\ \dot{v_3}&=\frac{\dot{v_1}}{v_1}=1/2=0.5\\ \dot{v_4}&=\dot{v_1}v_2+\dot{v_2}v_1=5\\ \dot{v_5}&=\dot{v_2}cos(v_2)=0\\ \dot{v_6}&=\dot{v_3}+\dot{v_4}=5.5\\ \dot{v_7}&=\dot{v_6}-\dot{v_5}=5.5 \end{align}$
所以有 $\frac{\partial y}{\partial x_1}=\dot{v_7}=5.5$ 。以上过程为前向自动微分。前向自动微分有一定的局限性。对于 $f:\mathbb{R}^n\rightarrow\mathbb{R}^k$ ，需要进行 $n$ 次的前向自动微分得到对于每个输入元素的偏导数。然而在通常的深度学习应用中 $n$ 很大而 $k = 1$ ，这样计算效率会很低，所以需要探索另一种自动微分方案，即反向自动微分。

3.3 反向自动微分

定义 $\bar{v_i}=\frac{\partial y}{\partial v_i}$ ，可以根据计算图的反向拓扑排序迭代计算 $\bar{v_i}$ ：
$\begin{align} \overline{v_7}&=\frac{\partial y}{\partial v_7}=1\\ \overline{v_6}&=\overline{v_7}\frac{\partial v_7}{\partial v_6}=\overline{v_7}\times 1=1\\ \overline{v_5}&=\overline{v_7}\frac{\partial v_7}{\partial v_5}=\overline{v_7}\times (-1)=-1\\ \overline{v_4}&=\overline{v_6}\frac{\partial v_6}{\partial v_4}=\overline{v_6}\times 1=1\\ \overline{v_3}&=\overline{v_6}\frac{\partial v_6}{\partial v_3}=\overline{v_6}\times 1=1\\ \overline{v_2}&=\overline{v_5}\frac{\partial v_5}{\partial v_2}+\overline{v_4}\frac{\partial v_4}{\partial v_2}=\overline{v_5}cos(v_2)+\overline{v_4}v_1=-cos(5)+2=-0.284+2=1.716\\ \overline{v_1}&=\overline{v_4}\frac{\partial v_4}{\partial v_1}+\overline{v_3}\frac{\partial v_3}{\partial v_1}=\overline{v_4}v_2+\frac{\overline{v_3}}{v_1}=5+\frac{1}{2}=5.5\\ \end{align}$
多路径导数，如下图所示， $v_1$ 被用于多路径中， $v_2$ 和 $v_3$ 都由 $v_1$ 推导而来：

$y$ 可以被写为 $v_2$ 和 $v_3$ 的函数 $f(v_2,v_3)$ ，那么：
$\overline{v_1}=\frac{\partial y}{\partial v_1}=\frac{\partial f(v_2,v_3)}{\partial v_2}\frac{\partial v_2}{\partial v_1}+\frac{\partial f(v_2,v_3)}{\partial v_3}\frac{\partial v_3}{\partial v_1}=\overline{v_2}\frac{\partial v_2}{\partial v_1}+\overline{v_3}\frac{\partial v_3}{\partial v_1}$
对于每个输入输出节点对 $i, j$ ，定义 $\overline{v_{i\rightarrow j}}=\overline{v_j}\frac{\partial v_j}{\partial v_i}$ ，那么有：
$\overline{v_i}=\sum_{j\in next(i)}\overline{v_{i\rightarrow j}}$
反向自动微分算法实现：

可以考虑使用多维数组存储 $\overline{v_i}$ ，但实际中并不这样做，而是构建反向微分计算图。根据以上算法构建反向自动微分计算图的步骤如下图所示：

其中 id 表示 identity function，即 $i d (x) = x$ ，输入任何值都返回相同的值。通过构建反向自动微分计算图，对于不同的输入值可以复用计算图得到偏导致值，而不用重新跑一遍自动微分。如今大部分的深度学习框架采用这样的方法。

反向传播计算和自动反向微分的区别在于，反向传播计算根据前向计算图一步步逆向计算，计算过程中不产生新节点，而自动反向微分会加入新的节点扩充前向计算图。

反向自动微分的张量形式可以通过标量形式推导出来，过程是和标量一样的，只是表示形式不同。

对于上面的计算图，前向计算过程为：
$\begin{align} d&=\{'cat':a_0\,,'dog':a_1\}\\ b&=d['cat']\\ v&=f(b) \end{align}$
定义该数据结构的偏微分为 $\overline{d}=\{'cat':\frac{\partial y}{\partial a_0},'dog':\frac{\partial y}{\partial a_1}\}$ ，那么反向过程为：
$\begin{align} \overline{b}&=\frac{\partial v}{\partial b}\overline{v}\\ \overline{d}&=\{'cat':\overline{b}\} \end{align}$
这种方法一般称为可微分编程。

4.实验

实验完成代码：https://github.com/Redtorm/dlsycourse/tree/main/hw1

4.1 needle实现

value类表示计算图上的每个节点，包含输入值，操作符，和根据输出计算出的当前值。
lazy模式，构建计算图的时候不进行计算，仅构建计算图，计算图构建时间相比计算时间可以忽略不计，可以先构建计算图，然后将计算图中的所有计算合成一个batch进行计算，从而提高计算效率，例如 pytorch 的TPU端就有这样的优化。在needle中使用的eager模式，即边构建计算图边进行计算。
detach，不构建中间计算图，节省内存空间。

4.2 问题一：实现前向计算过程（Question 1: Implementing forward computation）

实现以下前向计算函数：

PowerScalar: raise input to an integer (scalar) power
EWiseDiv: true division of the inputs, element-wise (2 inputs)
DivScalar: true division of the input by a scalar, element-wise (1 input, scalar - number)
MatMul: matrix multiplication of the inputs (2 inputs)
Summation: sum of array elements over given axes (1 input, axes - tuple)
BroadcastTo: broadcast an array to a new shape (1 input, shape - tuple)
Reshape: gives a new shape to an array without changing its data (1 input, shape - tuple)
Negate: numerical negative, element-wise (1 input)
Transpose: reverses the order of two axes (axis1, axis2), defaults to the last two axes (1 input, axes - tuple)

实现代码参考4.3。

4.3 问题二：实现反向计算过程（Question 2: Implementing backward computation）

对几个有难度的求导进行解释：

Matmul运算求导

在实现求导时，需要注意广播问题。当两个张量的维度不一致时，matmul操作会对输入张量进行广播操作，因此在求导时同样需要额外注意。

求导时，对某个变量求导得到的梯度应该与该变量的shape保持一致。因此，我们可以通过判断shape是否一致来判断在前向传播过程中是否进行了广播操作。如果进行了广播操作，我们需要将广播维度对应的梯度进行reduce操作。为什么需要进行reduce操作，因为广播操作本质上是对数据进行了复制，因此在计算某个变量的梯度时，在其前向传播过程中就相当于多了对应于该变量的因变量（但是数据相同），因此需要进行reduce操作。
Summation运算求导

实现summation的求导其实就是实现broadcast操作。因此只需要在summation对应的维度上进行广播还原即可。
Broadcast运算求导

同理，broadcast的求导即对out_grad进行reduce。

class PowerScalar(TensorOp):
    """Op raise a tensor to an (integer) power."""

    def __init__(self, scalar: int):
        self.scalar = scalar

    def compute(self, a: NDArray) -> NDArray:
        ### BEGIN YOUR SOLUTION
        return array_api.power(a, self.scalar)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        if not isinstance(node.inputs[0], Tensor):
            raise ValueError("Both inputs must be tensors (NDArray).")
        a = node.inputs[0]
        grad_a = out_grad * self.scalar * array_api.power(a, self.scalar - 1)
        return grad_a
        ### END YOUR SOLUTION

class EWiseDiv(TensorOp):
    """Op to element-wise divide two nodes."""

    def compute(self, a, b):
        ### BEGIN YOUR SOLUTION
        if a.shape != b.shape:
          raise RuntimeError("the shape is not consistent.")
        if b.all() == 0:
          raise RuntimeError("can not be divided by zero.")
        return a / b
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        if not isinstance(node.inputs[0], Tensor) or not isinstance(
            node.inputs[1], Tensor
        ):
            raise ValueError("Both inputs must be tensors (NDArray).")

        a, b = node.inputs[0], node.inputs[1]
        grad_a = out_grad / b
        grad_b = out_grad * (-a / (b * b))
        return grad_a, grad_b
        ### END YOUR SOLUTION

class DivScalar(TensorOp):
    def __init__(self, scalar):
        self.scalar = scalar

    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        if self.scalar == 0:
            raise RuntimeError("can not be divided by zero.")
        return a / self.scalar
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        if not isinstance(node.inputs[0], Tensor):
            raise ValueError("Both inputs must be tensors (NDArray).")
        a = node.inputs[0]
        grad_a = out_grad / self.scalar
        return grad_a
        ### END YOUR SOLUTION

class MatMul(TensorOp):
    def compute(self, a, b):
        ### BEGIN YOUR SOLUTION
        return a @ b
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        if not isinstance(node.inputs[0], Tensor) or not isinstance(
            node.inputs[1], Tensor
        ):
            raise ValueError("Both inputs must be tensors (NDArray).")
        a, b = node.inputs[0], node.inputs[1]
        grad_a = matmul(out_grad, transpose(b))
        grad_b = matmul(transpose(a), out_grad)
        if grad_a.shape != a.shape:
            grad_a = summation(grad_a, tuple(range(len(grad_a.shape) - len(a.shape))))
        if grad_b.shape != b.shape:
            grad_b = summation(grad_b, tuple(range(len(grad_b.shape) - len(b.shape))))
        return grad_a, grad_b
        ### END YOUR SOLUTION

class Summation(TensorOp):
    def __init__(self, axes: Optional[tuple] = None):
        self.axes = axes

    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.sum(a, self.axes)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        if not isinstance(node.inputs[0], Tensor):
            raise ValueError("Both inputs must be tensors (NDArray).")
        a = node.inputs[0]
        shape = list(a.shape)
        if self.axes is None:
            shape = [1 for _ in shape]
        else:
            if isinstance(self.axes, int):
                shape[self.axes] = 1
            else:
                for i in self.axes:
                    shape[i] = 1
        out_grad = reshape(out_grad, shape)
        grad_a = out_grad * array_api.ones(a.shape)
        return grad_a
        ### END YOUR SOLUTION

class BroadcastTo(TensorOp):
    def __init__(self, shape):
        self.shape = shape

    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.broadcast_to(a, self.shape)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        if not isinstance(node.inputs[0], Tensor):
            raise ValueError("Both inputs must be tensors (NDArray).")
        a = node.inputs[0]
        axes = []
        shape = [1 for _ in self.shape]
        dis = len(self.shape) - len(a.shape)
        shape[dis:] = a.shape
        for i in range(len(shape)):
            if shape[i] != self.shape[i]:
                axes.append(i)
        axes = tuple(axes)
        grad_a = summation(out_grad, axes)
        grad_a = reshape(grad_a, a.shape)
        return grad_a
        ### END YOUR SOLUTION

class Reshape(TensorOp):
    def __init__(self, shape):
        self.shape = shape

    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.reshape(a, self.shape)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        if not isinstance(node.inputs[0], Tensor):
            raise ValueError("Both inputs must be tensors (NDArray).")
        a = node.inputs[0]
        grad_a = reshape(out_grad, a.shape)
        return grad_a
        ### END YOUR SOLUTION

class Negate(TensorOp):
    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return -a
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        return negate(out_grad)
        ### END YOUR SOLUTION

class Transpose(TensorOp):
    def __init__(self, axes: Optional[tuple] = None):
        self.axes = axes

    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        self._axes = array_api.arange(a.ndim)
        if self.axes is None:
            self._axes[-2], self._axes[-1] = self._axes[-1], self._axes[-2]
        else:
            self._axes[self.axes[0]], self._axes[self.axes[1]] = self.axes[1], self.axes[0]
        return array_api.transpose(a, self._axes)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        return transpose(out_grad, self.axes)
        ### END YOUR SOLUTION

class Log(TensorOp):
    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.log(a)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        if not isinstance(node.inputs[0], Tensor):
            raise ValueError("Both inputs must be tensors (NDArray).")
        a = node.inputs[0]
        grad_a = divide(out_grad, a)
        return grad_a
        ### END YOUR SOLUTION

class Exp(TensorOp):
    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.exp(a)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        if not isinstance(node.inputs[0], Tensor):
            raise ValueError("Both inputs must be tensors (NDArray).")
        a = node.inputs[0]
        grad_a = out_grad * exp(a)
        return grad_a
        ### END YOUR SOLUTION

class ReLU(TensorOp):
    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.maximum(0, a)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        if not isinstance(node.inputs[0], Tensor):
            raise ValueError("Both inputs must be tensors (NDArray).")
        a = node.inputs[0]
        b = relu(a)
        b = b.numpy()
        b[b > 0] = 1
        grad_a = out_grad * Tensor(b)
        return grad_a
        ### END YOUR SOLUTION

4.4 问题三：实现拓扑排序（Question 3: Topological sort）

采用深度优先遍历生成拓扑排序序列。

def find_topo_sort(node_list: List[Value]) -> List[Value]:
    """Given a list of nodes, return a topological sort list of nodes ending in them.

    A simple algorithm is to do a post-order DFS traversal on the given nodes,
    going backwards based on input edges. Since a node is added to the ordering
    after all its predecessors are traversed due to post-order DFS, we get a topological
    sort.
    """
    ### BEGIN YOUR SOLUTION
    visited = []
    topo_order = []
    for node in node_list:
        topo_sort_dfs(node, visited, topo_order)
    return topo_order
    ### END YOUR SOLUTION


def topo_sort_dfs(node, visited, topo_order):
    """Post-order DFS"""
    ### BEGIN YOUR SOLUTION
    visited.append(id(node))
    for input in node.inputs:
        if id(input) not in visited:
            topo_sort_dfs(input, visited, topo_order)
    topo_order.append(node)
    ### END YOUR SOLUTION

4.5 问题四实现反向自动微分（Question 4: Implementing reverse mode differentiation ）

def compute_gradient_of_variables(output_tensor, out_grad):
    """Take gradient of output node with respect to each node in node_list.

    Store the computed result in the grad field of each Variable.
    """
    # a map from node to a list of gradient contributions from each output node
    node_to_output_grads_list: dict[Tensor, List[Tensor]] = {}
    # Special note on initializing gradient of
    # We are really taking a derivative of the scalar reduce_sum(output_node)
    # instead of the vector output_node. But this is the common case for loss function.
    node_to_output_grads_list[output_tensor] = [out_grad]

    # Traverse graph in reverse topological order given the output_node that we are taking gradient wrt.
    reverse_topo_order = list(reversed(find_topo_sort([output_tensor])))

    ### BEGIN YOUR SOLUTION
    for node in reverse_topo_order:
        node.grad = sum(node_to_output_grads_list[node])

        if node.op is not None:
            grad_list = node.op.gradient_as_tuple(node.grad, node);

        for input, grad in zip(node.inputs, grad_list):
            if input not in node_to_output_grads_list:
                node_to_output_grads_list.setdefault(input, [])
            node_to_output_grads_list[input].append(grad)
    
    ### END YOUR SOLUTION

4.6 问题五实现softmax损失（Question 5: Softmax loss）

def softmax_loss(Z, y_one_hot):
    """Return softmax loss.  Note that for the purposes of this assignment,
    you don't need to worry about "nicely" scaling the numerical properties
    of the log-sum-exp computation, but can just compute this directly.

    Args:
        Z (ndl.Tensor[np.float32]): 2D Tensor of shape
            (batch_size, num_classes), containing the logit predictions for
            each class.
        y (ndl.Tensor[np.int8]): 2D Tensor of shape (batch_size, num_classes)
            containing a 1 at the index of the true label of each example and
            zeros elsewhere.

    Returns:
        Average softmax loss over the sample. (ndl.Tensor[np.float32])
    """
    ### BEGIN YOUR SOLUTION
    lhs = ndl.log(ndl.summation(ndl.exp(Z), axes=1))
    rhs = ndl.summation(ndl.multiply(Z, y_one_hot), axes=1)
    loss = ndl.summation(lhs - rhs) / Z.shape[0]
    return loss
    ### END YOUR SOLUTION

4.7 问题六随机梯度下降两层神经网络实现（Question 6: SGD for a two-layer neural network）


def nn_epoch(X, y, W1, W2, lr=0.1, batch=100):
    """Run a single epoch of SGD for a two-layer neural network defined by the
    weights W1 and W2 (with no bias terms):
        logits = ReLU(X * W1) * W1
    The function should use the step size lr, and the specified batch size (and
    again, without randomizing the order of X).

    Args:
        X (np.ndarray[np.float32]): 2D input array of size
            (num_examples x input_dim).
        y (np.ndarray[np.uint8]): 1D class label array of size (num_examples,)
        W1 (ndl.Tensor[np.float32]): 2D array of first layer weights, of shape
            (input_dim, hidden_dim)
        W2 (ndl.Tensor[np.float32]): 2D array of second layer weights, of shape
            (hidden_dim, num_classes)
        lr (float): step size (learning rate) for SGD
        batch (int): size of SGD mini-batch

    Returns:
        Tuple: (W1, W2)
            W1: ndl.Tensor[np.float32]
            W2: ndl.Tensor[np.float32]
    """

    ### BEGIN YOUR SOLUTION
    iteration = (X.shape[0] + batch - 1) // batch
    for i in range(iteration) :
      l = i * batch
      r = min(X.shape[0], (i + 1) * batch)
      b_X = X[l:r, :]
      b_y = y[l:r]
      ndl_X = ndl.Tensor(b_X)
      Z1 = ndl.relu(ndl.matmul(ndl_X, W1))
      Z2 = ndl.matmul(Z1, W2)
      Iy = np.zeros(Z2.shape)
      Iy[np.arange(Iy.shape[0]), b_y] = 1
      y_onehot = ndl.Tensor(Iy)
      loss = softmax_loss(Z2, y_onehot)
      loss.backward()
      W1 = ndl.Tensor(W1.numpy() - lr * W1.grad.numpy())
      W2 = ndl.Tensor(W2.numpy() - lr * W2.grad.numpy())
    return (W1, W2)
    ### END YOUR SOLUTION