Skip to content

Introduction

School is a deep-learning framework for the Chelis programming language. It ships as a reef package under the School module prefix and is written entirely in Chelis.

The idea is mini-JAX on a tensor-first language. You compose an architecture out of School layers, differentiate it with the language's grad transform, and train it with real optimizer updates and a real training loop. Automatic differentiation flows through tensor-op composition, so a forward function built from differentiable primitives is differentiable end to end without any hand-written backward pass.

  • Typed layers under School.Nn: linear and convolutional layers, embeddings and positional encodings, normalization, attention, pooling, padding, dropout, and the common activations. Every layer is a plain function with an explicit tensor signature, so shapes are checked at compile time.
  • Losses under School.Loss: classification, regression, and embedding losses, plus accuracy and perplexity metrics.
  • Optimizers under School.Optim: the SGD family, Adam, AdamW, Adagrad, RMSProp, Lion, and LAMB. Each is a pure step function that threads new parameters and optimizer state explicitly.
  • Learning-rate schedules under School.Schedule and School.SchedExt.
  • A training loop and a data layer under School.Train and School.Data: a dataset abstraction, a shuffling mini-batch loader, an epoch-by-batch loop combinator, and training metrics.
  • A model library under School.Models: an MLP reference architecture and a set of vision and transformer architectures.

School follows the conventions of the language it targets.

  • Tensors are typed by shape. A signature like tensor[a, b, f32] -> tensor[b, c, f32] -> tensor[c, f32] -> tensor[a, c, f32] is the type of a linear layer. Shape mismatches are compile-time errors.
  • There is no broadcasting. Where two tensors must line up, the code uses expand and reshape explicitly. School layers do this for you inside their forward functions.
  • Values are immutable. Optimizers and the training loop do not mutate parameters in place. A step takes the current parameters and returns the next parameters, and the caller threads them forward.
  • Differentiation is a transform, not a tape. grad(f)(args) returns the per-argument gradient of a scalar-returning function f. School's layers and losses are written as compositions of differentiable tensor operations so that grad can run through them.

The statistical primitives School builds on (distributions, special functions, linear algebra) come from the upstream nautilus shell, and weight initializers (Xavier, Kaiming) come from upstream Std.Init. School does not reimplement these.

Getting started walks a small example from forward pass to a trained MLP. The reference chapters that follow document the layers, losses, optimizers, schedules, training loop, data layer, and model library. Each entry traces to the module that defines it under src/, and every code example is taken from the validated example corpus under examples/. The final chapter sets out the scope and limitations of the current surface.