Optimizers
Optimizers live under the School.Optim prefix. Every optimizer follows the same
shape:
- A config record holds the hyperparameters (learning rate and any momentum, beta, epsilon, or weight-decay terms).
- An init function builds the initial optimizer state from the parameters,
for example
sgd_init_like(¶ms). - A step function takes the parameters, the gradients, the state, and the config, and returns a pair of the next parameters and the next state.
Because values are immutable, the step does not mutate anything in place. The
caller threads the returned parameters and state into the next step. Each
optimizer comes in two forms: a flat tensor[n, f32] step (_step) for a single
parameter tensor, and a _step_list form that operates over a
List[tensor[k, f32]] of parameter tensors at the optimizer boundary. A
_step_list_trajectory driver runs several list-steps against a fixed gradient
list.
This chapter is a reference for the optimizers defined under src/optim/ and
src/optim.ch.
The optimizer set
Section titled “The optimizer set”| Optimizer | Module | Config fields |
|---|---|---|
| SGD | School.Optim.Sgd | lr |
| SGD with momentum | School.Optim.SgdMom | lr, momentum |
| SGD with Nesterov momentum | School.Optim.SgdNes | lr, momentum |
| Adam | School.Optim.Adam | lr, beta1, beta2, eps |
| AdamW | School.Optim | lr, beta1, beta2, eps, weight_decay |
| Adagrad | School.Optim.Adagrad | lr, eps |
| RMSProp | School.Optim.RmsProp | lr, alpha, eps, weight_decay, momentum |
| Lion | School.Optim.Lion | lr, beta1, beta2, weight_decay |
| LAMB | School.Optim | lr, beta1, beta2, eps, weight_decay |
The SGD family is in src/optim/sgd.ch, src/optim/sgdmom.ch, and
src/optim/sgdnes.ch. Adam is in src/optim/adam.ch, Adagrad in
src/optim/adagrad.ch, RMSProp in src/optim/rmsprop.ch, and Lion in
src/optim/lion.ch. AdamW and LAMB are in the top-level src/optim.ch.
School.Optim.ListOps (src/optim/listops.ch) provides the list helpers the
_step_list forms share.
Plain SGD
Section titled “Plain SGD”The entry points are sgd_init_like, sgd_step, and the config SgdConfig
(examples/p2/sgd_step.ch):
import School.Optim.Sgd (SgdConfig, sgd_init_like, sgd_step)...s0 = sgd_init_like(¶ms)cfg = SgdConfig { lr: cast(0.1, f32) }result = sgd_step(params, grads, s0, cfg)result.0 is the next parameter tensor and result.1 is the next state.
RMSProp
Section titled “RMSProp”RMSProp carries the squared-gradient running average plus optional momentum and
weight decay (examples/p2/rmsprop_demo.ch):
import School.Optim.RmsProp (RmsPropConfig, rmsprop_init_like, rmsprop_step)...s0 = rmsprop_init_like(¶ms)cfg = RmsPropConfig { lr: cast(0.01, f32), alpha: cast(0.99, f32), eps: cast(0.00000001, f32), weight_decay: cast(0.0, f32), momentum: cast(0.0, f32) }result = rmsprop_step(params, grads, s0, cfg)Lion tracks a momentum buffer and updates by the sign of an interpolated momentum
(examples/p2/lion_demo.ch):
import School.Optim.Lion (LionConfig, lion_init_like, lion_step)...s0 = lion_init_like(¶ms)cfg = LionConfig { lr: cast(0.1, f32), beta1: cast(0.9, f32), beta2: cast(0.99, f32), weight_decay: cast(0.0, f32) }result = lion_step(params, grads, s0, cfg)Where the gradients come from
Section titled “Where the gradients come from”The gradients passed to a step are computed with the language's grad transform.
The MLP training step in the Model library shows the full pattern:
take grad of a scalar loss to get the per-parameter gradient, then call the
optimizer step. The optimizer itself does no differentiation; it consumes
gradients and produces updated parameters.