Evidence normalisation

Evidence arrives in many forms. Two organisations publishing the same fact may use different key ordering, different encoding, different whitespace, different locale conventions for numbers and dates. Without normalisation, the same semantic content produces two different cryptographic hashes — defeating reproducible verification.

AIEP’s deterministic normalisation engine (P10) solves this by converting all evidence to a canonical form before hashing, indexing, or cross-node processing. The key property: two independent normalisers processing semantically equivalent source material must produce bit-identical output.


The problem with non-deterministic normalisation

IssueWhat goes wrong
Key ordering variance{"a":1,"b":2} and {"b":2,"a":1} produce different hashes despite being identical
Whitespace varianceAdditional spaces, newlines, or indentation alter the hash
Encoding varianceUTF-8 vs UTF-16, BOM presence, normalisation form (NFC vs NFD)
Locale variance1,234.56 vs 1.234,56 for the same number
Timestamp variance2026-03-05T09:00:00Z vs 2026-03-05T09:00:00+00:00
Container metadataSpreadsheet internal metadata, PDF creator fields, EXIF data

Any of these produce a different hash from semantically identical content. Non-deterministic normalisation makes distributed verification impossible — two nodes independently hashing the same evidence produce different identifiers.


The NormalisationProfile

The normalisation engine operates under a version-bound NormalisationProfile — a declared, schema-pinned set of rules that fully specifies how any input of a given type is to be normalised.

A NormalisationProfile declares:

Rule setWhat it specifies
Parsing rulesHow to parse each supported input type
Encoding rulesTarget encoding (UTF-8 NFC, no BOM) and encoding detection logic
Locale rulesCanonical number, date, and currency representation
Key ordering rulesLexicographic stable sort of all object keys
Whitespace rulesElimination of insignificant whitespace
Container metadata strippingWhich metadata fields to strip from container formats
Lossless constraintsWhich transformations are permitted — lossy transformations are rejected

The profile is versioned and pinned. A normaliser running ProfileVersion 1.0.3 and another normaliser also running 1.0.3 will produce identical output from identical input — regardless of operating system, locale, or hardware.


The normalisation pipeline

Input object
    → InputType detection (declared type + signature classification)
    → NormalisationProfile selection (version-bound)
    → Deterministic parsing under profile rules
    → CanonicalForm generation
    → CanonicalHash = SHA-256(CanonicalForm || ProfileVersionId)
    → NormalisationManifest generation
    → Acceptance or fail-closed rejection

The CanonicalHash formula:

$$\text{CanonicalHash} = H(\text{CanonicalForm} | \text{ProfileVersionId})$$

The profile version is included in the hash input. A document normalised under profile 1.0.2 produces a different hash from the same document normalised under 1.0.3 — even if the canonical forms are identical. This makes profile version changes detectable.


Fail-closed rejection

The normalisation engine does not attempt to recover from ambiguous inputs. If a transformation cannot be performed deterministically under the declared profile — if it would be lossy, or if it depends on parameters not declared in the profile — the upload is rejected and a deterministic rejection record is generated.

The rejection record identifies:

  • Which constraint was violated
  • At which transformation step
  • The input type and profile version in force

This record is itself appended to the evidence ledger. The rejection is auditable — you can see not just what evidence was accepted but what was refused and why.


NormalisationManifest

Every accepted normalisation produces a NormalisationManifest that binds:

FieldContents
inputTypeDetected and declared input type
profileVersionIdThe NormalisationProfile version applied
transformationStepsOrdered list of transformation step identifiers
canonicalHashSHA-256 of the canonical form under the profile
timestampRecorded as data, not as system time dependency

The manifest is itself hash-bound and appended to the evidence ledger. This means the normalisation process itself is auditable — any node can replay the transformation and verify that the same manifest would result.


Cross-jurisdiction normalisation (P17)

Evidence originating in different jurisdictions may use jurisdiction-specific conventions — legal date formats, currency representations, statutory reference formats, regulatory identifier schemes. P17 extends the NormalisationProfile model to multi-jurisdiction normalisation profiles.

A multi-jurisdiction profile declares:

  • The source jurisdiction
  • The target canonical representation
  • Jurisdiction-specific parsing rules for known format variants
  • A deterministic mapping from jurisdiction-specific identifiers to canonical identifiers

This enables evidence from a UK regulatory filing and a US regulatory filing describing the same underlying fact to be normalised to the same canonical form — and therefore to the same hash — enabling cross-jurisdiction comparison.


Deterministic replay

Any normalisation can be replayed deterministically. Given:

  • The original input object
  • The NormalisationManifest (which records the profile version and transformation steps)

Any node can re-run the normalisation and verify that the resulting CanonicalHash matches the manifest. This is the basis of cross-node verification in distributed AIEP deployments.


Constitutional substrate · Divergence · Protocol · Patents — P10, P17