Getting started

This chapter builds up from a single activation to a trained model, using examples from the corpus under examples/. Each example is a self-contained Chelis module with a main function, and each is run as a test:

chelis test tests/examples/p2/mlp_module.ch

Requirements

School is a reef package. From a checkout, build the package and run the tests with the Chelis toolchain pinned in reef.toml:

chelis reef build
chelis test tests/ --jobs auto

A single layer

The smallest thing you can do is apply an activation. School.Nn.Relu.relu_forward takes a tensor of any rank and returns the elementwise rectified value (examples/p0/relu_demo.ch):

module School.Examples.P0.Relu_Demo
import School.Nn.Relu (relu_forward)
export (main)
def main() -> f32 = {
  xs = to_tensor([cast(-2.0, f32), cast(-0.5, f32), cast(0.0, f32), cast(0.5, f32), cast(2.0, f32)])
  ys = relu_forward(xs)
  total = fold(fn (acc: f32, x: f32) -> add(acc, x), cast(0.0, f32), to_list(ys))
  total
}

The negative inputs clamp to zero, so the sum is 0.5 + 2.0 = 2.5.

A loss

A loss reduces a prediction and a target to a scalar. School.Loss.Mse.mse_loss returns the mean squared error (examples/p1/mse_smoke.ch):

module School.Examples.P1.Mse_Smoke
import School.Loss.Mse (mse_loss)
export (main)
def main() -> f32 = {
  pred = to_tensor([cast(1.0, f32), cast(2.0, f32), cast(3.0, f32)])
  target = to_tensor([cast(0.0, f32), cast(0.0, f32), cast(0.0, f32)])
  v = mse_loss(pred, target)
  v
}

The mean of 1, 4, 9 is 14 / 3.

An optimizer step

An optimizer turns a gradient into a parameter update. Every optimizer follows the same shape: build a config, initialize state from the parameters, then call the step, which returns the next parameters and the next state as a pair (examples/p2/sgd_step.ch):

module School.Examples.P2.Sgd_Step
import School.Optim.Sgd (SgdConfig, sgd_init_like, sgd_step)
export (main)
def main() -> f32 = {
  params = to_tensor([cast(1.0, f32), cast(2.0, f32), cast(3.0, f32)])
  grads = to_tensor([cast(0.1, f32), cast(0.1, f32), cast(0.1, f32)])
  s0 = sgd_init_like(&params)
  cfg = SgdConfig { lr: cast(0.1, f32) }
  result = sgd_step(params, grads, s0, cfg)
  first = index(to_list(result.0), cast(0, int64))
  first
}

With a learning rate of 0.1 and a gradient of 0.1, the first parameter moves from 1.0 to 0.99. The next parameters are result.0 and the next state is result.1.

Putting it together: training an MLP

The pieces compose into a training run. The example below initializes an MLP from a config, then runs full-batch gradient descent over a small toy set for thirty steps and returns the final loss (examples/p2/mlp_module.ch):

module School.Examples.P2.Mlp_Module
import School.Models.Mlp (MlpConfig, MlpParams, mlp_init_like, mlp_sgd_step, mlp_sq_loss_value)
export (main)
def features() -> tensor[4, 2, f32] = reshape(to_tensor([cast(0.9, f32), cast(0.1, f32), cast(0.8, f32), cast(0.2, f32), cast(0.1, f32), cast(0.9, f32), cast(0.2, f32), cast(0.8, f32)]), [cast(4, int64), cast(2, int64)])
def targets() -> tensor[4, 2, f32] = reshape(to_tensor([cast(1.0, f32), cast(0.0, f32), cast(1.0, f32), cast(0.0, f32), cast(0.0, f32), cast(1.0, f32), cast(0.0, f32), cast(1.0, f32)]), [cast(4, int64), cast(2, int64)])
def train_steps(params: MlpParams, epochs: int64, i: int64) -> MlpParams = {
  if gte(i, epochs) then params else {
    stepped = mlp_sgd_step(params, features(), targets(), cast(0.1, f32))
    train_steps(stepped.0, epochs, add(i, cast(1, int64)))
  }
}
def main() -> f32 = {
  cfg = MlpConfig { in_dim: cast(2, int64), hidden_dim: cast(4, int64), out_dim: cast(2, int64), seed: cast(7, int64) }
  params0 = mlp_init_like(cast(0, int64), cfg)
  trained = train_steps(params0, cast(30, int64), cast(0, int64))
  mlp_sq_loss_value(trained, features(), targets())
}

The mechanism worth noting is mlp_sgd_step. It calls grad(mlp_sq_loss)(...), which returns the gradient of the sum-of-squares loss with respect to each weight and bias tensor, then subtracts the scaled gradient from each parameter. The training driver train_steps is an ordinary recursive function. The loop itself is not differentiated; each step differentiates the loss, and the recursion threads the updated parameters forward. The returned loss is far below the loss at initialization, which is what tells you the model trained.

From here, the reference chapters cover the full surface: Layers, Losses, Optimizers, Schedules, the Training loop and data, and the Model library.