Skip to content

Optimizers

Optimizers live under the School.Optim prefix. Every optimizer follows the same shape:

  • A config record holds the hyperparameters (learning rate and any momentum, beta, epsilon, or weight-decay terms).
  • An init function builds the initial optimizer state from the parameters, for example sgd_init_like(&params).
  • A step function takes the parameters, the gradients, the state, and the config, and returns a pair of the next parameters and the next state.

Because values are immutable, the step does not mutate anything in place. The caller threads the returned parameters and state into the next step. Each optimizer comes in two forms: a flat tensor[n, f32] step (_step) for a single parameter tensor, and a _step_list form that operates over a List[tensor[k, f32]] of parameter tensors at the optimizer boundary. A _step_list_trajectory driver runs several list-steps against a fixed gradient list.

This chapter is a reference for the optimizers defined under src/optim/ and src/optim.ch.

OptimizerModuleConfig fields
SGDSchool.Optim.Sgdlr
SGD with momentumSchool.Optim.SgdMomlr, momentum
SGD with Nesterov momentumSchool.Optim.SgdNeslr, momentum
AdamSchool.Optim.Adamlr, beta1, beta2, eps
AdamWSchool.Optimlr, beta1, beta2, eps, weight_decay
AdagradSchool.Optim.Adagradlr, eps
RMSPropSchool.Optim.RmsProplr, alpha, eps, weight_decay, momentum
LionSchool.Optim.Lionlr, beta1, beta2, weight_decay
LAMBSchool.Optimlr, beta1, beta2, eps, weight_decay

The SGD family is in src/optim/sgd.ch, src/optim/sgdmom.ch, and src/optim/sgdnes.ch. Adam is in src/optim/adam.ch, Adagrad in src/optim/adagrad.ch, RMSProp in src/optim/rmsprop.ch, and Lion in src/optim/lion.ch. AdamW and LAMB are in the top-level src/optim.ch. School.Optim.ListOps (src/optim/listops.ch) provides the list helpers the _step_list forms share.

The entry points are sgd_init_like, sgd_step, and the config SgdConfig (examples/p2/sgd_step.ch):

import School.Optim.Sgd (SgdConfig, sgd_init_like, sgd_step)
...
s0 = sgd_init_like(&params)
cfg = SgdConfig { lr: cast(0.1, f32) }
result = sgd_step(params, grads, s0, cfg)

result.0 is the next parameter tensor and result.1 is the next state.

RMSProp carries the squared-gradient running average plus optional momentum and weight decay (examples/p2/rmsprop_demo.ch):

import School.Optim.RmsProp (RmsPropConfig, rmsprop_init_like, rmsprop_step)
...
s0 = rmsprop_init_like(&params)
cfg = RmsPropConfig { lr: cast(0.01, f32), alpha: cast(0.99, f32), eps: cast(0.00000001, f32), weight_decay: cast(0.0, f32), momentum: cast(0.0, f32) }
result = rmsprop_step(params, grads, s0, cfg)

Lion tracks a momentum buffer and updates by the sign of an interpolated momentum (examples/p2/lion_demo.ch):

import School.Optim.Lion (LionConfig, lion_init_like, lion_step)
...
s0 = lion_init_like(&params)
cfg = LionConfig { lr: cast(0.1, f32), beta1: cast(0.9, f32), beta2: cast(0.99, f32), weight_decay: cast(0.0, f32) }
result = lion_step(params, grads, s0, cfg)

The gradients passed to a step are computed with the language's grad transform. The MLP training step in the Model library shows the full pattern: take grad of a scalar loss to get the per-parameter gradient, then call the optimizer step. The optimizer itself does no differentiation; it consumes gradients and produces updated parameters.