Provenance and span tracking

Provenance is the reason Octant exists. The translation alone is useful, but the property that makes Octant part of a trust stack is that every piece of the output can be traced back to the formula it came from, and that the trace survives compilation. This chapter describes the span model, the manifest Octant writes, and how a compiled trace points back to the source formula.

The span model

Every value that flows through Octant's pipeline carries a span. A span records where in the LaTeX source a node originated:

Span {
  source_id,
  start_byte,
  end_byte,
  start_line,
  start_column,
  end_line,
  end_column,
  label,        // the \label{...} of a containing equation, if any
}

Byte offsets are zero-based; lines and columns are one-based to match editor convention. The label is the equation label of a containing \begin{equation} block when one is present.

Spans flow from the tokenizer through the parser, the normalizer, and the emitter. When a pass produces a node that did not appear literally in the source, the synthesized node inherits the span of its parent LaTeX node. For example, \frac{a}{b} becomes a division node that did not appear as a token in the source; that division node carries the span of the whole \frac expression. The synthesized (lit 1.0) in the x^{-1} lowering inherits the span of the power expression. A synthesized child never has a missing span; it always points back into the customer's LaTeX.

The spans manifest

Alongside the Deep .dp, translate writes a .spans.json provenance manifest. It is a flat list of entries plus a header:

{
  "source": "input.tex",
  "source_hash": "sha256:...",
  "spans": [
    {
      "deep_node_id": "eq:bs_001",
      "deep_path": "/__top[0]/def[0]/app[1]/app[1]",
      "latex": {
        "start_byte": 142,
        "end_byte": 168,
        "start_line": 7,
        "start_column": 12,
        "end_line": 7,
        "end_column": 38,
        "label": "eq:bs"
      },
      "latex_text": "s \\Phi(d_1) - k e^{-r t} \\Phi(d_2)"
    }
  ]
}

The header records the source filename and a hash of the source bytes, so a downstream tool can detect drift between the manifest and a changed source. Each entry has four fields:

deep_node_id: a stable identifier for the emitted Deep node. This is the identifier octant explain --target takes, and the identifier the Chelis compiler preserves through its passes.
deep_path: a slash-prefixed sequence of <tag>[<index>] segments locating the node in the Deep AST. Described below.
latex: the full span, byte range and line/column position, plus the equation label if present.
latex_text: the literal LaTeX text the span covers. This is the exact source slice, the bytes from start_byte to end_byte, verbatim.

Node identifiers

The emitter generates identifiers as <label-or-n>_<NNN>. The prefix is the containing equation's \label{...} if one is in scope, or n if no label is present. The suffix is a per-prefix sequential counter, zero-padded to three digits. So a labelled equation eq:bs produces eq:bs_001, eq:bs_002, and so on; an unlabelled formula produces n_001, n_002. The label prefix lets explain locate any node within a labelled equation, and the counter guarantees uniqueness even when the same LaTeX text appears more than once.

The deep_path notation

deep_path is a slash-prefixed sequence of <tag>[<index>] segments locating the node in the Deep AST. It is not JSONPath.

<tag> is the Chelis Deep node tag: def, app, let, bind, if, var, lit, access, params, and so on.
<index> is the zero-based position of the entry within its parent's child list, counting after the tag and the metadata map. In (app {meta} (var mul) (var a) (var b)), the (var mul) is at app[0], (var a) at app[1], and (var b) at app[2].
/__top[i] is the synthetic root segment for the i-th top-level expression in the program.

The notation is unambiguous and parseable: split on / and walk the Deep AST element by element. It matches the positional list shape of Chelis Deep, where JSONPath, which assumes named keys, would not.

How provenance survives compilation

Octant emits Deep in the canonical three-tuple form (tag {meta} ...children). Each node that originates from the source carries its stable identifier in the metadata map as {span: "<id>"}:

(def {span: "eq1_001"}
  call_price
  (fn {span: "eq1_002"}
    (params {} ...)
    ...
    (app {span: "eq1_006"}
      (var {} sub)
      (app {span: "eq1_007"} (var {} mul) (var {} s) ...)
      (app {span: "eq1_008"} ...))))

The {span: "..."} entry carries the identifier; the full span data lives in the sidecar manifest. The Chelis compiler preserves these identifiers through every pass. Parsing keeps them on AST nodes, type checking copies them to typed nodes, lowering copies them to RISC IR nodes, and code generation emits them as comments in the generated host source.

The result is that a runtime trace of the binary references identifiers that point back into the LaTeX source. A model-validation team examining a risk number can trace the computation through the binary back to the original formula in the original notation. "Show me where this number comes from" has a mechanical answer.

Tracing with `explain`

Given an identifier from the manifest, explain prints the LaTeX it came from:

octant explain references/black_scholes_call.tex --target n_002

The output reports the byte range, the line and column position, the literal source slice, the equation label if present, and a few lines of surrounding context with the matched line marked. Translate first to produce the manifest, then read identifiers from it. If you pass an identifier that is not in the manifest, explain reports the error and lists valid identifiers.

Nautilus references

When the source uses a function that lives in a Nautilus module, for example \Phi, mapping to Nautilus.Distributions.normal_cdf, the way Octant emits the reference depends on --emit-nautilus.

Chelis Deep symbols are restricted to a single bare name, so a dotted target like Nautilus.Distributions.normal_cdf cannot appear as one symbol. There are two ways to encode it.

In the default stub mode, the emitter walks the body, collects every distinct Nautilus reference, and prepends a top-level identity-stub def for each at the file top:

(def {span: "__synthesized_nautilus_stub__"}
  normal_cdf
  (fn {span: "__synthesized_nautilus_stub__"}
    (params {span: "__synthesized_nautilus_stub__"}
      (x_1 {type: (t-prim {} f32)}))
    (var {} x_1)))

The stub's parameter count matches the call-site arity. Its body returns the first argument unchanged. Each call site is rewritten to a bare (var {} normal_cdf). The emitted Deep is self-contained and compiles through the Chelis Deep build path.

The stub returns its first argument unchanged, so the generated code is correct for the audit-chain claim but numerically a placeholder if executed standalone. Customers link the generated code against a real Nautilus implementation at deploy time, which shadows the stub with the actual function. Octant prints a loud warning to stderr whenever it inlines a stub.

The stub-introduced nodes carry a reserved synthesized-marker span, __synthesized_nautilus_stub__. The manifest gains one entry per stub with that marker as its identifier and a zero byte range; the marker entries have no LaTeX origin and are exempt from the byte-exact rule that applies to every LaTeX-derived entry.

The --emit-nautilus access opt-out reverts to the access-chain callee form, (access (access (var Nautilus) Distributions) normal_cdf), with no stubs and no warning. It suits callers that resolve the Nautilus package through a Chelis-side build pipeline.

In both modes the call-site application node keeps its LaTeX-derived span, so the reference is traceable back to the byte range of the original \Phi(...) regardless of which emission mode you use.