MLP Sharpness

Train a multilayer perceptron on a 1D regression task via full-batch gradient descent. Weights are initialized and updated under $\mu$P (maximal update parameterization).


objective
$$\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n} \bigl(\hat{f}(x_i) - f^*(x_i)\bigr)^2$$
hidden init ($\mu$P)
$$(W_k)_{ij} \overset{\text{iid}}{\sim} \mathcal{N}\!\left(0,\, \frac{\sigma^2}{n_{\text{in}}}\right)$$
output init ($\mu$P)
$$(W_{\text{out}})_{ij} \overset{\text{iid}}{\sim} \mathcal{N}(0,\, \sigma^2)$$
update
$$W_k \leftarrow W_k - \eta \cdot n_{\text{in}} \cdot \nabla_{W_k}\mathcal{L}$$

depth $d$
hidden dim $n$ 20
activation
param.
biases
centering
target
degree $k$ 6
data points $N$ 20
init scale $\sigma$ 1.0
learning rate $\eta$ 0.01
max steps/sec no limit

0 steps/sec

training
Hessian

jsi@berkeley.edu
james-simon
gScholar
@fakejamiesimon