Getting started
This chapter builds up from a single activation to a trained model, using
examples from the corpus under examples/. Each example is a self-contained
Chelis module with a main function, and each is run as a test:
chelis test tests/examples/p2/mlp_module.chRequirements
Section titled “Requirements”School is a reef package. From a checkout, build the package and run the tests
with the Chelis toolchain pinned in reef.toml:
chelis reef buildchelis test tests/ --jobs autoA single layer
Section titled “A single layer”The smallest thing you can do is apply an activation. School.Nn.Relu.relu_forward
takes a tensor of any rank and returns the elementwise rectified value
(examples/p0/relu_demo.ch):
module School.Examples.P0.Relu_Demoimport School.Nn.Relu (relu_forward)export (main)def main() -> f32 = { xs = to_tensor([cast(-2.0, f32), cast(-0.5, f32), cast(0.0, f32), cast(0.5, f32), cast(2.0, f32)]) ys = relu_forward(xs) total = fold(fn (acc: f32, x: f32) -> add(acc, x), cast(0.0, f32), to_list(ys)) total}The negative inputs clamp to zero, so the sum is 0.5 + 2.0 = 2.5.
A loss
Section titled “A loss”A loss reduces a prediction and a target to a scalar. School.Loss.Mse.mse_loss
returns the mean squared error (examples/p1/mse_smoke.ch):
module School.Examples.P1.Mse_Smokeimport School.Loss.Mse (mse_loss)export (main)def main() -> f32 = { pred = to_tensor([cast(1.0, f32), cast(2.0, f32), cast(3.0, f32)]) target = to_tensor([cast(0.0, f32), cast(0.0, f32), cast(0.0, f32)]) v = mse_loss(pred, target) v}The mean of 1, 4, 9 is 14 / 3.
An optimizer step
Section titled “An optimizer step”An optimizer turns a gradient into a parameter update. Every optimizer follows
the same shape: build a config, initialize state from the parameters, then call
the step, which returns the next parameters and the next state as a pair
(examples/p2/sgd_step.ch):
module School.Examples.P2.Sgd_Stepimport School.Optim.Sgd (SgdConfig, sgd_init_like, sgd_step)export (main)def main() -> f32 = { params = to_tensor([cast(1.0, f32), cast(2.0, f32), cast(3.0, f32)]) grads = to_tensor([cast(0.1, f32), cast(0.1, f32), cast(0.1, f32)]) s0 = sgd_init_like(¶ms) cfg = SgdConfig { lr: cast(0.1, f32) } result = sgd_step(params, grads, s0, cfg) first = index(to_list(result.0), cast(0, int64)) first}With a learning rate of 0.1 and a gradient of 0.1, the first parameter moves
from 1.0 to 0.99. The next parameters are result.0 and the next state is
result.1.
Putting it together: training an MLP
Section titled “Putting it together: training an MLP”The pieces compose into a training run. The example below initializes an MLP from
a config, then runs full-batch gradient descent over a small toy set for thirty
steps and returns the final loss (examples/p2/mlp_module.ch):
module School.Examples.P2.Mlp_Moduleimport School.Models.Mlp (MlpConfig, MlpParams, mlp_init_like, mlp_sgd_step, mlp_sq_loss_value)export (main)def features() -> tensor[4, 2, f32] = reshape(to_tensor([cast(0.9, f32), cast(0.1, f32), cast(0.8, f32), cast(0.2, f32), cast(0.1, f32), cast(0.9, f32), cast(0.2, f32), cast(0.8, f32)]), [cast(4, int64), cast(2, int64)])def targets() -> tensor[4, 2, f32] = reshape(to_tensor([cast(1.0, f32), cast(0.0, f32), cast(1.0, f32), cast(0.0, f32), cast(0.0, f32), cast(1.0, f32), cast(0.0, f32), cast(1.0, f32)]), [cast(4, int64), cast(2, int64)])def train_steps(params: MlpParams, epochs: int64, i: int64) -> MlpParams = { if gte(i, epochs) then params else { stepped = mlp_sgd_step(params, features(), targets(), cast(0.1, f32)) train_steps(stepped.0, epochs, add(i, cast(1, int64))) }}def main() -> f32 = { cfg = MlpConfig { in_dim: cast(2, int64), hidden_dim: cast(4, int64), out_dim: cast(2, int64), seed: cast(7, int64) } params0 = mlp_init_like(cast(0, int64), cfg) trained = train_steps(params0, cast(30, int64), cast(0, int64)) mlp_sq_loss_value(trained, features(), targets())}The mechanism worth noting is mlp_sgd_step. It calls
grad(mlp_sq_loss)(...), which returns the gradient of the sum-of-squares loss
with respect to each weight and bias tensor, then subtracts the scaled gradient
from each parameter. The training driver train_steps is an ordinary recursive
function. The loop itself is not differentiated; each step differentiates the
loss, and the recursion threads the updated parameters forward. The returned loss
is far below the loss at initialization, which is what tells you the model
trained.
From here, the reference chapters cover the full surface: Layers, Losses, Optimizers, Schedules, the Training loop and data, and the Model library.