DLN Sharpness
Train a deep linear network on a matrix factorization task via full-batch gradient descent. All weight matrices are $n \times n$; the goal is to recover a random target $M^*$.
Setup
network ($M = W_1 \cdots W_L$)
$$\mathcal{L} = \left\| M^* - W_1 W_2 \right\|_F^2$$
target
$$M^*_{ij} \overset{\text{iid}}{\sim} \mathcal{N}(0,\,1)$$
initialization
$$(W_k)_{ij}(0) \overset{\text{iid}}{\sim} \mathcal{N}(0,\, \sigma^2)$$
update
$$W_k \leftarrow W_k - \eta \,\nabla_{W_k}\mathcal{L}$$
Hyperparameters
depth
$d$
width
$n$
4
init scale
$\sigma$
0.1
learning rate
$\eta$
0.001
max steps/sec
no limit
Simulation
0 steps/sec
Plots