Driving Problem

Given a two-layer MLP with fixed random initialization $\theta_0$, and a target parameter vector $\theta^*$, can we engineer a sequence of inputs $x_1, x_2, \ldots$ such that gradient steps drive $\theta \to \theta^*$?

Here we explore the simplest version: at each step, sample a random $x$, compute the gradient $g(x)$, and take the step along $\hat{g}(x)$ that best closes the gap.

Update rule

residual gap

$$\Delta\theta_t = \theta^* - \theta_t$$

optimal step size along $\hat{g}(x_t)$

$$\alpha_t^* = \langle \Delta\theta_t,\, \hat{g}(x_t) \rangle$$

update

$$\theta_{t+1} = \theta_t + \alpha_t^*\, \hat{g}(x_t)$$

Architecture

input dim $d_{in}$ 5

hidden dim $d_h$ 10

output dim $d_{out}$ 1

depth

activation

biases

$x$ distribution

$x$ scale $\sigma$ 1.0

Speed

max steps/sec no limit

Simulation

step 0 · 0 steps/sec

Distance to target $\|\theta^* - \theta\|$

Convergence rate $-d\|\Delta\theta\|/dt$

EMA window 20

Per-step $\|\Delta\theta\| \cdot \cos^2(g(x),\, \Delta\theta)$

Cosine similarity — distribution (last 1000 steps)

Optimize $x$ to maximize $|\cos(g(x),\,\Delta\theta)|$

Gradient ascent on $x$ via finite differences on the scalar objective.

learning rate $\eta_x$ 0.1

steps 500

$|\cos(g(x),\,\Delta\theta)|$ vs. step

$\|x\|$ vs. step

Weight matrices