Corpus spec & schema — meetLab Docs

I.

Two lineages, one target

Diagnostic samples — transcript_001, transcript_071, transcript_110, transcript_121 — reveal two distinct structural lineages in the 137-file corpus. Both must converge on a single canonical form before ingest.

Lineage	Range	Markers	Primary corruption source
Early / Claude	001 – ~070	speaker-tag-only · no `Exported:` · no h1 · `\t∙\t` bullets · paragraph collapse	initial ingestion + Tier 0 recovery
Later / Gemini	~090 – 137	`## Prompt:` / `## Response:` wrapper · `Exported:` on line 9 · h1 title · escaped Markdown · floating terminal output	export tool from source chat application

single canonical target

The normalizer must collapse both lineages into one Markdown form without silently destroying lexical evidence. Wounds in the underlying text (truncations, yoUser: substitutions, zero-width characters) are preserved or stripped under explicit rules — never edited away.

II.

Encoding & frontmatter

§ 1 — file encoding & terminus

R-ENC-01 — UTF-8 with LF-only line endings. No CRLF. No BOM (U+FEFF).
R-ENC-02 — exactly one trailing newline after the last content line.
R-ENC-03 — no trailing whitespace on any line.
R-ENC-04 — strip zero-width characters (U+200B, U+200C, U+200D, U+FEFF, U+00AD) from all positions.

§ 2 — YAML frontmatter

Every file opens with a frontmatter block. Opening --- on line 1, column 1, no preceding bytes. Three required fields, in order, all string values double-quoted:

---
session_id: "[value]"
primary_model: "[value]"
normalization_level: [integer]
---

R-FM-02 — closing --- immediately follows the last field. No blank lines inside the block.
R-FM-05 — exactly one blank line follows the closing delimiter before body content.
R-FM-06 — the **Exported:** [date] exporter residue (44 of 137 files) is removed. If the date must be preserved, it moves into frontmatter as exported_date: "[date]" after normalization_level.

III.

Title & turn structure

§ 3 — document title

R-DT-01 — h1 (# Title) is permitted only as the first line of body content. No other h1 may appear.
R-DT-02 — files without an h1 do not have one added by the normalizer.
R-DT-03 — when h1 is present, exactly one blank line separates it from the first turn.

§ 4 — speaker tags & turn structure

Canonical speaker-tag form is **SPEAKER_NAME:**. Permitted speakers in the formatting spec: HUMAN_RELAY, CLAUDE, GEMINI, CHATGPT, NOTEBOOKLM. All-caps. Trailing colon mandatory.

R-TS-02 / R-TS-03 — exactly one blank line before and after every speaker tag (with the obvious exception that the very first tag has no preceding blank line beyond the frontmatter / h1 separator).
R-TS-05 — empty speaker turns (consecutive tags with no intervening content) are removed entirely.

R-TS-04 — replacing the `## Prompt:` / `## Response:` wrapper

The wrapper is exporter residue and must be replaced by the appropriate speaker tag:

Input	Replacement
`## Prompt:` + embedded `SPEAKER:`	remove `## Prompt:`; keep the embedded tag as the sole turn marker
`## Prompt:` with no embedded tag	replace with `HUMAN_RELAY:`
`## Response:`	replace with the speaker tag corresponding to `primary_model` (e.g. `"gemini"` → `GEMINI:`)

R-TS-06 [FLAGGED] — truncation artifacts

Tier 0 recovery introduced sentence-initial truncations. The normalizer must not correct them. Each instance is annotated inline with [CORPUS_CORRUPTION: truncation] immediately before the fragment:

[CORPUS_CORRUPTION: truncation]laude, we are running...

IV.

Blank-line discipline

§ 5 of the spec. Six rules that govern every blank line in a normalized file.

R-BL-01 — never more than one consecutive blank line. All sequences of two or more collapse to one.
R-BL-02 — exactly one blank line between the last line of a speaker's content and the next speaker tag.
R-BL-03 — exactly one blank line separates adjacent paragraphs within a turn.
R-BL-04 — list items use tight style (no inter-item blank lines) unless an item contains a sub-list, code fence, or blockquote. Loose lists are converted to tight.
R-BL-05 [FLAGGED] — suspected paragraph collapse is flagged for human review. The normalizer never auto-splits prose by semantic heuristic.

why no auto-split

transcript_001 contains severe paragraph collapse (lines 14–65 fuse what should be six or seven blocks). The temptation to detect . followed by a sentence-like clause and insert a break is real, but heuristic splitting silently rewrites prose. The spec hands this to humans on purpose.

V.

Headings & lists

§ 6 — heading hierarchy

R-HD-01 — h1 reserved for document title only. Never inside a turn.
R-HD-02 — h2 is the highest heading permitted within a turn.
R-HD-03 — h3 for subsections; h4 maximum depth.
R-HD-04 — exactly one blank line before and after every heading.
R-HD-05 — heading markers must not be escaped. \### un-escapes to ###.

§ 7 — bullets & lists

The canonical unordered marker is - (hyphen + single space). All other forms convert:

Input form	Output
\t∙\t (tab + U+2219 + tab)	-
* or *	-
\* (escaped asterisk)	-
+	-

The \t∙\t form is the most severe — Unicode U+2219 (BULLET OPERATOR) is not a list marker in any Markdown spec, so every list in transcript_001 currently renders as a paragraph.

R-LS-02 — ordered lists use N. . Tab-prefixed forms (\t1.\t) convert to standard.
R-LS-03 — nested items indent two spaces per level relative to the parent's text start.
R-LS-04 — continuation lines align with the first character of the item text.

VI.

Escapes & code fences

§ 8 — escaped Markdown

The export tool that produced the Gemini-lineage files serialized Markdown syntax characters as literals. Un-escaping is required outside code fences:

Escaped	Canonical	Condition
\#, \##, \###	#, ##, ###	line start (heading)
\\text\\	text	bold context
\text\	text	italic context
\* at line start	-	list context (R-LS-01)
\\> · \>	>	line start (blockquote)
\---	---	alone on line (rule)
\[text\]\(url\)	[text](url)	link context
\_text\_	_text_	isolated italic only

underscore matcher · normative

The un-escape matcher MUST be boundary-safe and single-pass idempotent. Canonical pattern:

re.compile(r"(?<![_\\])\\_([^_]+?)\\_(?!_)")

Must not match dunder-like forms such as \_\_future\_\_. If boundary-safe matching cannot be guaranteed, underscore un-escaping is optional and may be skipped entirely.

R-EX-02 — backslash escapes inside code fences are preserved verbatim.
R-EX-03 — escaped ASCII separator lines are not simply un-escaped — they are moved inside a fence per R-CF-02.

§ 11 — code fences & terminal output

R-CF-01 — terminal/shell output, log lines, verification reports, JSON artifacts, and file manifests are enclosed in triple-backtick fences.
R-CF-02 — ASCII separator lines (10+ repeated =, -, #, *, or ~) outside fences are moved inside a fence along with associated report content.
R-CF-03 — triple backticks are canonical. Indented (4-space / tab) blocks convert to fenced.
R-CF-05 — exactly one blank line before and after every fence.

R-CF-04 — language tags

Content	Tag
terminal output, logs, reports	text
Python source	python
JSON	json
YAML	yaml
shell commands	bash
unknown	(omit tag)

R-CF-06 — deduplication is exact-match only

The same WITNESS PROTOCOL VERIFICATION REPORT appears in transcript_121 both as raw floating text in ## Prompt: sections and as a properly fenced block in ## Response:. Removing duplicates is permitted, but only under exact identity:

compare normalized fence payload only (exclude delimiters and language tags; preserve all bytes except line-ending normalization)
compute a canonical content hash (SHA-256 recommended) for both candidates
remove the floating block only when hashes are exactly equal
scope to adjacent turn context (Prompt/Response pair or directly neighboring speaker turns) — cross-document or non-adjacent dedup is prohibited
near-match or similarity-based dedup is prohibited

When exact-match criteria are not met, both blocks are preserved and the case is flagged for review.

R-CF-07 — stateful parser, not regex split

Code-fence boundary detection MUST use a stateful line parser:

scan line-by-line with explicit state (OUTSIDE, INSIDE)
fence delimiter MUST begin at column 1 in default mode (no leading whitespace)
opening fence may include a language tag; closing fence MUST NOT
closing fence canonical form is exactly three backticks (optionally followed by spaces or tabs only)
separator promotion (R-CF-02) runs only while state is OUTSIDE
parser MUST prevent fence-boundary concatenation artifacts (adjacent close/open boundaries misread as a single six-backtick fence)
regex-only _outside_fences partitioning is non-compliant

VII.

Rules, blockquotes, tables

§ 9 — horizontal rules

Canonical form is exactly --- (three hyphens, alone on the line). All variants normalize: * * *, ***, ___, \---, - - - → ---.

R-HR-02 — exactly one blank line before and after every rule.
R-HR-03 — rules must not be used as turn separators. Any --- immediately before a speaker tag (with only optional blank lines between) is removed; the tag is sufficient demarcation.
R-HR-04 [PERMITTED] — rules within a turn separate thematic sections only when no heading is present, and use should be sparse.

transcript_071

28 occurrences of --- across 650 lines: 2 are legitimate frontmatter delimiters; the other 26 are body rules conflating turn boundaries with thematic boundaries. R-HR-03 is the rule that resolves this.

§ 10 — blockquotes

R-BQ-01 — canonical marker is > (greater-than + single space).
R-BQ-02 — escaped forms (\\>, \>) un-escape to > per R-EX-01.
R-BQ-03 — file-attachment lines like > File: name.md are valid and preserved.
R-BQ-04 — quoted speech from other sessions or external agents uses blockquote with optional source reference.
R-BQ-05 — nested blockquotes use >> ; nesting beyond two levels is not permitted.
R-BQ-06 — exactly one blank line before and after every blockquote.

§ 12 / 13 — emphasis & tables

R-BE-01 — bold is **text**.
R-BE-02 — italic is *text*; underscore form (_text_) is reserved for code contexts.
R-BE-03 — speaker tags carry no additional formatting. No nested bold-italic, no surrounding hyphens.
R-TB-01 / R-TB-02 — GFM pipe tables with header row and hyphen separator. One blank line before and after.

VIII.

Lexical corruption

§ 14. These are wounds in the underlying text — not formatting issues — but they present as formatting anomalies because they cause sentence-initial truncations a parser may misidentify as markup.

R-LC-01 [FLAGGED] — annotate, do not correct

Observed fragment	Annotation
yoUser:	[CORPUS_CORRUPTION: substitution]yoUser:
laude, at line start	[CORPUS_CORRUPTION: truncation]laude,
ritical at line start	[CORPUS_CORRUPTION: truncation]ritical
onditional at line start	[CORPUS_CORRUPTION: truncation]onditional
pload at line start	[CORPUS_CORRUPTION: truncation]pload
ploaded at line start	[CORPUS_CORRUPTION: truncation]ploaded

yoUser annotation · normative

Annotation MUST be single-pass idempotent. Canonical matcher:

re.compile(r"(?<!\])\byoUser:")

The matcher MUST NOT fire when yoUser: is already preceded by a closing annotation bracket — e.g. [CORPUS_CORRUPTION: substitution]yoUser: is invariant under re-application.

R-LC-02 — zero-width Unicode characters are stripped silently and completely with no annotation.

IX.

Application order

Rule interactions force a precise sequence. The normalizer is single-pass and precedence-ordered. Implementations MUST NOT run "loop until stable" reprocessing across the full rule set — idempotency is achieved by rule design and ordering, not by iterative convergence.

#	Pass	Why this position
1	R-ENC	strip BOM, normalize line endings, strip zero-width chars first
2	R-FM	frontmatter validation and exporter residue removal
3	R-CF	fence floating terminal output before any other transform touches it
4	R-EX	un-escape Markdown — runs after CF so fenced content is protected
5	R-TS	turn structure: replace `## Prompt:` / `## Response:`, drop empty turns
6	R-LS	list syntax — runs after EX so escaped bullets are already raw
7	R-HR	horizontal rule normalization
8	R-BQ	blockquote normalization
9	R-HD	heading normalization
10	R-BL	blank-line discipline runs last as a final pass over the transformed document
11	R-LC	annotation pass for lexical corruption — read-only, no structural changes
12	R-ENC-02/03	final encoding cleanup (trailing newline, trailing whitespace)

Deviation severity quick-reference

Code	Files	Severity
DEV-FM-02 · exporter residue	44/137	High
DEV-TS-01 · variant bifurcation	~93/137	Critical
DEV-TS-02 · label truncation	all samples	High
DEV-LS-01 · Unicode/escaped bullets	001, 110	Critical
DEV-EX · over-escaping	110, 121	Critical
DEV-CF-01 · floating terminal output	121	Critical
DEV-LC-01 · yoUser substitution	001	flag only

X.

Schema · sessions / turns

The normalized corpus lands in corpus_v1_5_2.db — the canonical relational substrate for session, turn, cycle, and event-marker analytics. Contract authority lives in integration/contracts/shared_contracts.py.

10.1 · sessions

Column	Type	Notes
session_id	TEXT PK	session identifier
primary_model	TEXT	model identity
file_path	TEXT	ingest source path
source_filename	TEXT	source filename
session_order	INTEGER	deterministic ingest order
session_type	TEXT DEFAULT 'relay'	session class
original_date	TEXT	optional date metadata
protocol_version	TEXT	optional protocol field
turn_count	INTEGER	classification metric
inversion_ratio	REAL	classification metric
regime_tier	INTEGER	classification metric

10.2 · turns

Column	Type	Notes
turn_id	INTEGER PK AUTOINC	global turn id
session_id	TEXT	FK → sessions
turn_index	INTEGER	ordinal within session
speaker_canonical	TEXT	canonical speaker tag
raw_text	TEXT	turn text payload
timestamp_iso	TEXT	optional
recipient_model	TEXT	optional routing
sender_model	TEXT	optional routing

Constraints & indexes:

UNIQUE(session_id, turn_index) ON CONFLICT IGNORE
INDEX idx_turn_session   ON turns(session_id)
INDEX idx_speaker        ON turns(speaker_canonical)

10.3 · cycles

Column	Type	Notes
cycle_id	INTEGER PK AUTOINC	cycle id
session_id	TEXT	FK → sessions
prompt_turn_id	INTEGER	FK → turns
response_turn_id	INTEGER	FK → turns
status	TEXT	optional cycle state

10.4 · event_markers

Column	Type	Notes
marker_id	INTEGER PK AUTOINC	marker id
turn_id	INTEGER NOT NULL	FK → turns
marker_type	TEXT NOT NULL	marker class
snippet	TEXT NOT NULL	local context

Constraint: UNIQUE(turn_id, marker_type).

XI.

Registry & vectors

file_registry.db

File-level semantic and review registry, with turn-review side tables. Primary tables:

files — semantic labeling, review status, annotation metadata
reviewed_turns — turn-level review overlay (does not mutate the corpus DB)
study_turns — study-flagged turn overlay

corpus_vectors.db

Vector search substrate, vec0-backed. Primary virtual table:

transcript_chunks — 768-dimensional embeddings via vec0

Shadow and index tables are implementation-specific and must not be mutated by hand. Treat the DB as managed by embedding/index tooling only.

Shared contracts & compatibility

contract source of truth — integration/contracts/shared_contracts.py
validation source of truth — integration/contracts/validators.py

Compatibility enforcement covers speaker ontology constraints, turn / session / registry required fields, cross-domain path namespace constraints, and the DB compatibility matrix. Run unified checks:

ORCHESTRATED=1 make unified-check
python3 -m integration.orchestrate_unified

XII.

Validation & safety

Validation queries

-- Duplicate turn positions per session (should be zero)
SELECT session_id, turn_index, COUNT(*)
FROM turns
GROUP BY session_id, turn_index
HAVING COUNT(*) > 1;

-- Sessions without turns
SELECT s.session_id
FROM sessions s
LEFT JOIN turns t ON t.session_id = s.session_id
GROUP BY s.session_id
HAVING COUNT(t.turn_id) = 0;

-- Event markers without matching turns
SELECT em.marker_id
FROM event_markers em
LEFT JOIN turns t ON t.turn_id = em.turn_id
WHERE t.turn_id IS NULL;

Current non-blocking warnings

As of the 2026-03-23 unified report:

corpus_v1.db present but deprecated (read-only)
turns.json includes 39 rows with speaker=UNKNOWN; resolve before ingestion into the canonical corpus

DB safety gates

Any mutation workflow must use integration/db_safety.py:

default DBSafetyContext(dry_run=True) for non-mutating validation
backup() creates timestamped backups (.backup_YYYYMMDDTHHMMSSZ.db) before mutation
row_counts() / row_count_diff() with assert_counts_non_decreasing() for count guards
integrity_check() / quick_check() before and after writes
DBSafetyContext.safe_mutate(path) as the recommended mutation wrapper

Strict warning gate for operational runs:

python3 -m integration.orchestrate_unified --strict-warnings
ORCHESTRATED=1 make unified-check-strict

Orchestrator data checks are adapter-delegated — ReviewAdapter.check_data_health() and CorpusReviewAdapter.check_data_health() — with boundary enforcement and unified reporting layered on top.