Skip to content

Scope and limitations

School is a deep-learning framework on a tensor-first language, and the shape of the language shows through in how you write models. This chapter sets out what the framework covers and the constraints that come with it.

The surface documented in this book is the surface School provides:

  • The layers under School.Nn in the Layers chapter.
  • The losses and metrics under School.Loss in the Losses chapter.
  • The optimizers under School.Optim in the Optimizers chapter.
  • The schedules under School.Schedule and School.SchedExt in the Schedules chapter.
  • The data layer, training loop, and metrics under School.Data and School.Train in the Training loop and data chapter.
  • The models under School.Models in the Model library chapter.

The statistical primitives (distributions, special functions, linear algebra) come from the upstream nautilus shell, and the weight initializers come from upstream Std.Init. School does not reimplement these.

The framework differentiates with the language's grad transform, and that transform has a defined contract that shapes how loss functions are written:

  • grad(f)(args) differentiates a scalar-returning, pure-tensor function and returns the per-argument gradient as a tuple. The arguments are tensors and scalars.
  • The differentiated function does not take ADT, struct, or record arguments, and its body does not pattern-match or walk lists. The idiom is to unpack a parameter record in the caller and pass flat tensors to a pure local function. The MLP's mlp_sq_loss and mlp_sq_loss_value in the Model library are the reference for this pattern.
  • An if/then/else differentiates through the chosen branch. Loops and recursion are not differentiated through; instead, the training driver takes grad once per step and the recursion threads the updated parameters forward, outside the differentiated region.
  • There is no broadcasting. Where two tensors must align, the code uses expand and reshape explicitly. School's layers handle this inside their forward functions, so a caller passes a rank-1 bias to a linear layer and the layer expands it. When you write your own forward code, you align shapes yourself.
  • Reduction and expansion axes are compile-time constants. An axis is a literal, not a runtime value.
  • Values are immutable. Optimizers and the training loop thread new values explicitly rather than mutating in place. Every optimizer step and every loop step returns its next state.

Some models in the library run at a fixed input shape rather than over an arbitrary batch and resolution. ResnetV2 runs at batch 1 with an 8x8 spatial input, and SimpleVisionCnn runs at a 32x32 input. The configs for these models expose the channel and class counts; the spatial dimensions are fixed by the architecture. The model chapter notes which dimensions each model leaves polymorphic.

Initialization, the data loader, and the training loop are deterministic functions of their seeds. The same configuration and seed reproduce the same parameters, the same batch order, and the same trained result. This is what lets a training run be checked against a reference trajectory.