◎ OS PUB Apache 2.0 ← All specifications

P159 — AIEP — Evidence Language Normalisation Protocol

Publication Date: 2026-03-27 Status: Open Source Prior Art Disclosure Licence: Apache License 2.0 Author/Organisation: Phatfella Ltd Schema: AIEP_OS_SPEC_TEMPLATE v1.0.1 — https://aiep.dev/schemas/aiep-os-spec-template/v1.0.1

Framework Context

[0001] This disclosure operates within an Architected Instruction and Evidence Protocol (AIEP) environment as defined in United Kingdom patent application number GB2519711.2, filed 20 November 2025, the entire contents of which are incorporated herein by reference.

[0002] The present disclosure defines a protocol for normalising evidence artefacts authored in non-primary languages into canonical working-language representations, while preserving the original-language content, recording the translation operation as a typed ProvenanceLink in the Evidence Provenance Chain (P150), and attaching language metadata to each EvidenceNode enabling AIEP-governed reasoning systems to reason equitably across multilingual evidence corpora without privileging monolingual sources.

Field of the Disclosure

[0003] This disclosure relates to evidence language normalisation protocols for governed artificial intelligence reasoning systems operating across multilingual evidence corpora.

[0004] More particularly, the disclosure concerns a LanguageDetection step, a TranslationAdapter interface, a TRANSLATE-typed ProvenanceLink schema extension, a bilingual EvidenceNode storage model retaining the original-language text alongside the translated representation, and a LanguageParity weight correction mechanism preventing under-representation of non-primary-language evidence in reasoning confidence calculations.

Background

[0005] AIEP-governed reasoning systems ingest evidence from global sources. A substantial fraction of high-quality scientific, governmental, and institutional evidence is authored primarily or exclusively in languages other than the reasoning system’s working language. Without a language normalisation protocol, such evidence is either excluded from reasoning — reducing evidence diversity — or ingested without normalisation, creating inconsistencies in how the normalisation pipeline (P10) processes text across artefacts.

[0006] Translation is an inherently lossy transformation: meaning may be partially altered, idiomatic constructs may not have direct equivalents, and technical terminology may differ across language traditions. The AIEP deterministic evidence model requires that this lossy transformation be explicitly recorded as a ProvenanceLink (P150) of type TRANSLATE, enabling any downstream system to inspect whether a given artefact is a translation and to retrieve the original-language text if required.

[0007] No existing AIEP specification defines language detection, translation adapter integration, the TRANSLATE ProvenanceLink extension, or the bilingual storage model. Cross-language reasoning is a structural gap in the core evidence stack.

Summary of the Disclosure

[0008] Language Detection: The LanguageDetection step, applied in the AIEP normalisation pipeline (P10) before any text processing, determines the BCP 47 language tag of the evidence artefact’s primary text content. The detected language tag is stored in the EvidenceNode language field. Where language detection confidence is below a configurable threshold (default: 0.85), the language field is set to "und" (undetermined) and the artefact is flagged for manual review.

[0009] TranslationAdapter Interface: The TranslationAdapter is a pluggable interface with one required method:

translate(text: str, source_language: BCP47Tag, target_language: BCP47Tag,
          context_domain: str | None) -> TranslationResult

A TranslationResult contains: translated_text, source_language, target_language, engine_id (identifier of the translation engine or model used), engine_version, translation_confidence (float 0–1), and translation_timestamp. The TranslationAdapter interface is engine-agnostic; implementations may use neural machine translation services, dedicated domain-trained models, or rule-based systems.

[0010] TRANSLATE ProvenanceLink: Following translation, a ProvenanceLink of transform_type = TRANSLATE is appended to the EvidenceNode’s ProvenanceChain (P150). The transform_params field of this link contains:

source_language — BCP 47 tag of the original text
target_language — BCP 47 tag of the translated text
engine_id — translation engine identifier
engine_version — engine version string
translation_confidence — confidence score from the TranslationResult

[0011] Bilingual Storage Model: The EvidenceNode schema is extended with two content fields: content_original (the unmodified original-language text) and content_canonical (the working-language translation or, for artefacts already in the working language, identical to the original). All AIEP reasoning operations reference content_canonical. Any human review or audit operation may access content_original via the EvidenceNode or via the OriginLink in the ProvenanceChain.

[0012] Language Parity Weight Correction: Evidence corpora that skew heavily towards a single language may produce reasoning outputs that over-weight evidence from that language tradition. The LanguageParity component applies a language_diversity_weight correction to the confidence contribution of each evidence artefact in a reasoning chain, computed as:

language_diversity_weight = 1.0 / (corpus_fraction_for_language × scaling_factor)

where corpus_fraction_for_language is the proportion of artefacts in the active corpus sharing the same source language as this artefact, and scaling_factor is a governance-configurable constant (default: 1.0, may be set to 0.0 to disable correction). This correction prevents monolingual over-representation without discarding evidence from any language.

[0013] Working Language Configuration: The AIEP node’s working_language configuration parameter (BCP 47 tag, default: "en") specifies the target language for all translation operations. Where an artefact is already in the working language, the LanguageDetection step sets content_canonical = content_original and no TranslationAdapter call is made.

ASCII Architecture

Raw Evidence Artefact
        │
        ▼
┌─────────────────────┐
│  LanguageDetection  │──▶ EvidenceNode.language = "fr"
│  (BCP 47 detection) │
└──────────┬──────────┘
           │ (non-working-language)
           ▼
┌─────────────────────┐
│  TranslationAdapter │──▶ TranslationResult
│  (pluggable engine) │    { translated_text,
│                     │      engine_id, confidence }
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐    ┌────────────────────────┐
│  EvidenceNode       │    │  ProvenanceChain (P150) │
│  .content_original  │───▶│  TRANSLATE link:        │
│  .content_canonical │    │  src=fr, tgt=en,        │
│  .language = "fr"   │    │  engine, confidence     │
└──────────┬──────────┘    └────────────────────────┘
           │
           ▼
┌─────────────────────┐
│  LanguageParity     │
│  Weight Correction  │
└─────────────────────┘

Operational Detail

[0014] Domain Context: The context_domain parameter passed to the TranslationAdapter should be populated with the primary taxonomy code (P160) of the artefact, where available. Domain-aware translation engines use this to select domain-specific terminology models, improving translation accuracy for specialised corpora (medical, legal, scientific).

[0015] Low-Confidence Translation Handling: Where translation_confidence falls below a configurable threshold (default: 0.65), the EvidenceNode is tagged with translation_quality = LOW and the artefact’s contribution to any reasoning chain confidence score is reduced by a configurable penalty factor (default: 0.5). The artefact is not excluded from reasoning, ensuring no evidence is silently lost.

[0016] Retranslation: Where an improved translation engine is deployed, a system administrator may trigger retranslation of artefacts with translation_quality = LOW or those translated by a named engine version. Retranslation produces a new TRANSLATE ProvenanceLink and updates content_canonical, preserving the previous translation in the ledger history.

Claims-Exclusion Notice

This specification is published as open-source prior art. No patent claims are asserted by the author in respect of the mechanisms described. Any third party seeking to patent mechanisms substantially equivalent to those described herein is placed on notice of this prior art disclosure.