MLP Sharpness
Train a multilayer perceptron on a 1D regression task via full-batch gradient descent. Weights are initialized and updated under $\mu$P (maximal update parameterization).
Setup
objective
$$\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n} \bigl(\hat{f}(x_i) - f^*(x_i)\bigr)^2$$
hidden init ($\mu$P)
$$(W_k)_{ij} \overset{\text{iid}}{\sim} \mathcal{N}\!\left(0,\, \frac{\sigma^2}{n_{\text{in}}}\right)$$
output init ($\mu$P)
$$(W_{\text{out}})_{ij} \overset{\text{iid}}{\sim} \mathcal{N}(0,\, \sigma^2)$$
update
$$W_k \leftarrow W_k - \eta \cdot n_{\text{in}} \cdot \nabla_{W_k}\mathcal{L}$$
Hyperparameters
depth
$d$
hidden dim
$n$
activation
param.
biases
centering
target
degree
$k$
6
data points
$N$
20
init scale
$\sigma$
1.0
learning rate
$\eta$
0.01
max steps/sec
no limit
Simulation
0 steps/sec
Plots
training
Hessian