Scope and limitations
School is a deep-learning framework on a tensor-first language, and the shape of the language shows through in how you write models. This chapter sets out what the framework covers and the constraints that come with it.
What the framework covers
Section titled “What the framework covers”The surface documented in this book is the surface School provides:
- The layers under
School.Nnin the Layers chapter. - The losses and metrics under
School.Lossin the Losses chapter. - The optimizers under
School.Optimin the Optimizers chapter. - The schedules under
School.ScheduleandSchool.SchedExtin the Schedules chapter. - The data layer, training loop, and metrics under
School.DataandSchool.Trainin the Training loop and data chapter. - The models under
School.Modelsin the Model library chapter.
The statistical primitives (distributions, special functions, linear algebra) come
from the upstream nautilus shell, and the weight initializers come from upstream
Std.Init. School does not reimplement these.
How differentiation constrains your code
Section titled “How differentiation constrains your code”The framework differentiates with the language's grad transform, and that
transform has a defined contract that shapes how loss functions are written:
grad(f)(args)differentiates a scalar-returning, pure-tensor function and returns the per-argument gradient as a tuple. The arguments are tensors and scalars.- The differentiated function does not take ADT, struct, or record arguments, and
its body does not pattern-match or walk lists. The idiom is to unpack a parameter
record in the caller and pass flat tensors to a pure local function. The MLP's
mlp_sq_lossandmlp_sq_loss_valuein the Model library are the reference for this pattern. - An
if/then/elsedifferentiates through the chosen branch. Loops and recursion are not differentiated through; instead, the training driver takesgradonce per step and the recursion threads the updated parameters forward, outside the differentiated region.
Tensor conventions
Section titled “Tensor conventions”- There is no broadcasting. Where two tensors must align, the code uses
expandandreshapeexplicitly. School's layers handle this inside their forward functions, so a caller passes a rank-1 bias to a linear layer and the layer expands it. When you write your own forward code, you align shapes yourself. - Reduction and expansion axes are compile-time constants. An axis is a literal, not a runtime value.
- Values are immutable. Optimizers and the training loop thread new values explicitly rather than mutating in place. Every optimizer step and every loop step returns its next state.
Fixed-shape models
Section titled “Fixed-shape models”Some models in the library run at a fixed input shape rather than over an arbitrary
batch and resolution. ResnetV2 runs at batch 1 with an 8x8 spatial input, and
SimpleVisionCnn runs at a 32x32 input. The configs for these models expose the
channel and class counts; the spatial dimensions are fixed by the architecture. The
model chapter notes which dimensions each model leaves polymorphic.
Determinism
Section titled “Determinism”Initialization, the data loader, and the training loop are deterministic functions of their seeds. The same configuration and seed reproduce the same parameters, the same batch order, and the same trained result. This is what lets a training run be checked against a reference trajectory.