Residual Networks and Skip Connections (DL 15) | Highlights and Annotations by Gistr.

Deep neural networks are hard to train due to vanishing gradients. Skip connections (residual blocks) mitigate this by adding a path around layers, allowing gradients to flow directly to earlier layers. This simplifies learning and speeds training, making deeper networks feasible. Careful consideration of layer shapes is needed, especially with convolutions, often using 1x1 convolutions to match dimensions for addition. This segment explains how the random initialization of weights in deep neural networks leads to the scrambling of input data into random noise as it passes through multiple layers, hindering effective training. The multiplication of activations by random weight matrices at each layer progressively diminishes the signal related to the input, resulting in a loss of meaningful information by the time the data reaches the output layer. This segment details how the scrambled inputs and gradients negatively affect the training process. The loss computed on the mostly random noise at the output layer is not informative for the later layers, and the backpropagated gradients are also scrambled, rendering updates to early layers ineffective. This leads to slow training progress and minimal loss decrease, as gradient descent essentially wanders randomly. This segment introduces the concept of skip connections as a solution to the vanishing gradient problem. Skip connections group layers into blocks, allowing data to flow through both the layers within a block and around them via a direct connection from the input to the output. This creates two paths for data, one passing through the layers of the block and the other bypassing them, enabling the combination of the block's input and output. This segment explains the advantages of residual blocks, which incorporate skip connections. Residual blocks simplify the learning process by allowing each block to focus on augmenting the existing data rather than figuring out everything from scratch. They also speed up gradient propagation by providing shorter paths for gradients to reach each layer, leading to more effective updates and faster training.