DLN Sharpness

Train a deep linear network on a matrix factorization task via full-batch gradient descent. All weight matrices are $n \times n$; the goal is to recover a random target $M^*$.

Setup

network ($M = W_1 \cdots W_L$)

$$\mathcal{L} = \left\| M^* - W_1 W_2 \right\|_F^2$$

target

$$M^*_{ij} \overset{\text{iid}}{\sim} \mathcal{N}(0,\,1)$$

initialization

$$(W_k)_{ij}(0) \overset{\text{iid}}{\sim} \mathcal{N}(0,\, \sigma^2)$$

update

$$W_k \leftarrow W_k - \eta \,\nabla_{W_k}\mathcal{L}$$

Hyperparameters

depth $d$

width $n$ 4

init scale $\sigma$ 0.1

learning rate $\eta$ 0.001

max steps/sec no limit

Simulation

0 steps/sec

Plots

loss singular values of $M = W_1 \cdots W_L$ singular values of each $W_k$ Frobenius norms $\|W_k\|_F$ sharpness (top Hessian eigenvalues)