Driving Problem
Given a two-layer MLP with fixed random initialization $\theta_0$, and a target parameter vector $\theta^*$, can we engineer a sequence of inputs $x_1, x_2, \ldots$ such that gradient steps drive $\theta \to \theta^*$?
Here we explore the simplest version: at each step, sample a random $x$, compute the gradient $g(x)$, and take the step along $\hat{g}(x)$ that best closes the gap.
Update rule
residual gap
$$\Delta\theta_t = \theta^* - \theta_t$$
optimal step size along $\hat{g}(x_t)$
$$\alpha_t^* = \langle \Delta\theta_t,\, \hat{g}(x_t) \rangle$$
update
$$\theta_{t+1} = \theta_t + \alpha_t^*\, \hat{g}(x_t)$$
Architecture
input dim
$d_{in}$
5
hidden dim
$d_h$
10
output dim
$d_{out}$
1
depth
activation
biases
$x$ distribution
$x$ scale
$\sigma$
1.0
Speed
max steps/sec
no limit
Simulation
Distance to target $\|\theta^* - \theta\|$
Convergence rate $-d\|\Delta\theta\|/dt$
EMA window
20
Per-step $\|\Delta\theta\| \cdot \cos^2(g(x),\, \Delta\theta)$
Cosine similarity — distribution (last 1000 steps)
Optimize $x$ to maximize $|\cos(g(x),\,\Delta\theta)|$
Gradient ascent on $x$ via finite differences on the scalar objective.
learning rate
$\eta_x$
0.1
steps
500
$|\cos(g(x),\,\Delta\theta)|$ vs. step
$\|x\|$ vs. step
Weight matrices