◎ OS PUB Apache 2.0 ← All specifications

P119 — AIEP — Multimodal Document Ingestion Protocol

Publication Date: 2026-03-01 Status: Open Source Prior Art Disclosure Licence: Apache License 2.0 Author/Organisation: Phatfella Ltd Schema: AIEP_OS_SPEC_TEMPLATE v1.0.1 — https://aiep.dev/schemas/aiep-os-spec-template/v1.0.1


Framework Context

[0001] This disclosure operates within an Architected Instruction and Evidence Protocol (AIEP) environment as defined in United Kingdom patent application number GB2519711.2, filed 20 November 2025, and GB2519798.9, filed 20 November 2025, the entire contents of which are incorporated herein by reference.

[0002] The present disclosure defines a governed document ingestion protocol by which binary documents in multiple formats — PDF, DOCX, and plain text — are ingested into a PIEA evidence substrate, normalised to canonical evidence artefacts, and committed to the Evidence Ledger with a full provenance chain.


Field of the Disclosure

[0003] This disclosure relates to governed artificial intelligence evidence substrates that accept user-supplied documents as source artefacts for evidence-grounded inference.

[0004] More particularly, the disclosure concerns a multimodal ingestion pipeline that: accepts file uploads across defined MIME types; extracts plain text from binary formats; normalises the extracted text to the AIEP canonical evidence artefact schema; applies the source integrity inspection defined in P124 where applicable; commits the resulting EvidenceRef to the session evidence store; and returns an ingestion_hash committing the ingestion event.


Background

[0005] Evidence-grounded AI systems that operate only on web-retrieved sources cannot operate on private organisational documents — contracts, reports, policies, meeting transcripts — that are not publicly accessible. Meaningful enterprise AI assistance requires the ability to ingest documents supplied directly by the user.

[0006] Existing AI file ingestion approaches: (a) treat uploaded content as part of the conversation prompt rather than as governed evidence artefacts; (b) do not generate provenance records for ingested content; (c) do not apply confidence tier classification to ingested content; and (d) do not commit an ingestion event that enables independent verification that a specific document contributed to a specific response.


Summary of the Disclosure

[0007] The Ingestion Pipeline accepts file uploads via multipart/form-data at the /ingest endpoint. Supported MIME types are: application/pdf; application/vnd.openxmlformats-officedocument.wordprocessingml.document; and text/plain. Files in unsupported formats are rejected with a structured 415 error before processing.

[0008] For PDF inputs, the pipeline applies a PDF text extraction routine that handles: single-page and multi-page documents; embedded fonts; and non-OCR text content. For DOCX inputs, the pipeline extracts XML body content from the word/document.xml entry, strips XML tags, and normalises whitespace. For plain text inputs, encoding is validated and the content is used directly.

[0009] The extracted text is normalised to a canonical evidence artefact comprising: a content_hash (SHA-256 of the extracted text in UTF-8); a source_label (the original filename); a source_type set to document; a provenance_class of user_supplied; a confidence_tier of qualified; an ingested_at ISO 8601 timestamp; and a tenant_id binding the artefact to its tenant.

[0010] The canonical artefact is stored in the tenant’s document store (Cloudflare R2) at a key composed of tenant_id/content_hash. The artefact metadata is inserted into the Evidence Ledger (D1) documents table. An ingestion_hash — SHA-256 over the ingestion record in canonical JSON form — is returned in the ingestion response alongside the content_hash.

[0011] The ingested document enters the tenant’s evidence pool and is eligible for retrieval by the Evidence Rail on subsequent queries. Relevance routing for document-type sources uses the same confidence tier and provenance classification pipeline as web-retrieved sources, ensuring that user-supplied documents and web sources are governed under a unified schema.

[0012] User-supplied documents carry the user_supplied provenance class. Under the confidence ceiling enforcement defined in P124, this class may be subject to tenant-configured provenance constraints. The ingestion pipeline does not apply retraction checks (defined in P114) to user-supplied documents at ingestion time; such checks apply only when an ingested document is subsequently cross-referenced against the Retraction Registry during evidence retrieval.