◎ OS PUB Apache 2.0 ← All specifications

P148 — AIEP — Semantic Evidence Deduplication Protocol

Publication Date: 2026-03-27 Status: Open Source Prior Art Disclosure Licence: Apache License 2.0 Author/Organisation: Phatfella Ltd Schema: AIEP_OS_SPEC_TEMPLATE v1.0.1 — https://aiep.dev/schemas/aiep-os-spec-template/v1.0.1


Framework Context

[0001] This disclosure operates within an Architected Instruction and Evidence Protocol (AIEP) environment as defined in United Kingdom patent application number GB2519711.2, filed 20 November 2025, the entire contents of which are incorporated herein by reference.

[0002] The present disclosure defines a protocol for detecting and collapsing semantically equivalent evidence artefacts from different sources into a canonical evidence record, while preserving full provenance from all contributing source artefacts, preventing redundant storage and avoiding inflated confidence scores caused by citing the same underlying fact from multiple nominally independent sources.


Field of the Disclosure

[0003] This disclosure relates to semantic deduplication protocols for verifiable evidence artefacts in governed artificial intelligence reasoning systems.

[0004] More particularly, the disclosure concerns a two-stage deduplication process: exact-hash deduplication that detects byte-for-byte identical content, followed by semantic similarity detection using fingerprint comparison to identify artefacts that differ in representation but convey the same underlying claim.


Background

[0005] AIEP evidence bases routinely contain multiple artefacts drawn from different sources that report the same underlying fact. A scientific finding may be reported in the original paper, a press release, a news article, and a blog post: four nominally independent sources conveying essentially the same claim. If reasoning chains cite all four as independent evidence, confidence computation overestimates the strength of support for the claim.

[0006] Content-hash deduplication as defined in P133 detects byte-for-byte identical artefacts but does not detect semantic equivalence across different textual representations of the same fact. A semantic deduplication layer is required to identify these cases and collapse them into a canonical record with a consolidated provenance list.

[0007] No existing AIEP specification defines a semantic similarity threshold, a canonical record selection policy, or a provenance merge procedure for semantically equivalent evidence artefacts admitted from different sources.


Summary of the Disclosure

[0008] Stage 1 — Exact Hash Deduplication: On evidence admission, the incoming content_hash is compared against all existing EvidenceIndexEntry hashes. An exact match produces a DuplicateRejectionRecord referencing the existing artefact; the incoming artefact is not admitted independently. The existing artefact’s provenance list is updated to include the incoming source URL.

[0009] Stage 2 — Semantic Fingerprint Comparison: Each evidence artefact is assigned a semantic_fingerprint — a compact vector representation of the artefact’s textual content computed using the Feature Extraction Similarity Computation Model (P55). On admission, the incoming fingerprint is compared against the fingerprints of existing artefacts in the same topic and jurisdiction class. A similarity_score is computed as the cosine similarity of the two fingerprint vectors.

[0010] A SemanticDuplicateRecord is produced when similarity_score ≥ semantic_dedup_threshold (default: 0.92). The SemanticDuplicateRecord comprises: primary_evidence_id (the existing canonical artefact); duplicate_evidence_id (the incoming artefact); similarity_score; and merge_action from the enumeration: AUTO_MERGE, FLAG_FOR_REVIEW, REJECT_DUPLICATE.

[0011] Canonical Record Selection: When merge_action = AUTO_MERGE, the existing artefact with the higher trust_score is designated the canonical record. The incoming artefact’s source_url and trust_score are appended to the canonical record’s provenance_sources array. The incoming artefact is not independently admitted. The canonical record’s provenance_count is incremented.

[0012] Confidence Correction: Reasoning chains that cite a canonical record with provenance_count > 1 are eligible for a provenance diversity bonus — an increase of confidence if the provenance sources are from genuinely independent domains (detected by comparing registered domain names). If provenance_sources contains multiple URLs from the same registered domain, no diversity bonus is applied.

[0013] Review Queue: When merge_action = FLAG_FOR_REVIEW, the SemanticDuplicateRecord is placed in the deduplication review queue. A governance operator may confirm the merge, reject the proposed deduplication (preserving both artefacts as independent), or split an existing canonical record back into independent artefacts.

[0014] Semantic Dedup Threshold Governance: The semantic_dedup_threshold is a schema-governed parameter. Nodes may configure per-source-class thresholds to reflect different tolerance for semantic variation in different evidence domains. Threshold changes are recorded in the Reasoning Ledger as governance amendments.


Technical Effect

[0015] Two-stage deduplication — exact hash followed by semantic fingerprint — collapses both byte-identical and semantically equivalent artefacts without requiring a centralised deduplication authority, as each node independently computes fingerprints and similarity scores.

[0016] The provenance diversity bonus corrects confidence inflation from nominally multi-source citation of a single underlying fact, without preventing legitimate multi-perspective corroboration from genuinely independent sources from contributing to confidence.

[0017] Review queue routing for borderline similarity scores — rather than auto-refusing or auto-merging — prevents false positives that would incorrectly collapse genuinely distinct artefacts with high surface-level similarity.


Claims

  1. A semantic evidence deduplication protocol for a governed AI reasoning system, the protocol comprising: exact content hash deduplication rejecting byte-for-byte identical incoming artefacts and updating the existing artefact’s provenance list; semantic fingerprint comparison computing a similarity score between incoming and existing artefacts in the same topic and jurisdiction class; and a SemanticDuplicateRecord produced when similarity score meets or exceeds a configurable semantic dedup threshold.

  2. The protocol of claim 1, wherein canonical record selection on auto-merge designates the artefact with the higher trust score as the canonical record and appends the incoming source URL to the canonical record’s provenance sources array, preserving full multi-source provenance without independent admission of the duplicate.

  3. The protocol of claim 1, wherein reasoning chains citing a canonical record with multiple provenance sources are eligible for a provenance diversity bonus only when the provenance sources are from genuinely independent registered domains, preventing confidence inflation from citing the same underlying fact across multiple pages on the same domain.

  4. The protocol of claim 1, wherein artefacts crossing the similarity threshold with merge action FLAG_FOR_REVIEW are placed in a deduplication review queue, and wherein a governance operator may confirm the merge, reject the proposed deduplication, or split an existing canonical record back into independent artefacts.


Brief Description of the Drawing

FIG. 1 — Two-stage deduplication pipeline: incoming artefact → exact hash check → semantic fingerprint comparison → SemanticDuplicateRecord → canonical record merge or review queue.

FIG. 2 — Canonical record provenance structure showing primary evidence entry with provenance_sources array and provenance_count field.


Abstract

A semantic evidence deduplication protocol prevents inflated confidence scoring from nominally independent sources reporting the same underlying fact. A two-stage pipeline performs exact content hash deduplication followed by semantic fingerprint similarity comparison. When similarity meets the configured threshold, a SemanticDuplicateRecord is produced and the incoming artefact is merged into the canonical record or flagged for governance review. Canonical records carry a consolidated provenance list, and reasoning chains citing them receive a provenance diversity bonus only when provenance sources are from genuinely independent domains.


Licence

Apache License 2.0 — https://www.apache.org/licenses/LICENSE-2.0

Copyright 2026 Phatfella Ltd. Licensed under the Apache License, Version 2.0. You may use this specification in compliance with the Licence.

Dependencies