Two lineages, one target
Diagnostic samples — transcript_001, transcript_071, transcript_110, transcript_121 — reveal two distinct structural lineages in the 137-file corpus. Both must converge on a single canonical form before ingest.
| Lineage | Range | Markers | Primary corruption source |
|---|---|---|---|
| Early / Claude | 001 – ~070 | speaker-tag-only · no **Exported:** · no h1 · \t∙\t bullets · paragraph collapse | initial ingestion + Tier 0 recovery |
| Later / Gemini | ~090 – 137 | ## Prompt: / ## Response: wrapper · **Exported:** on line 9 · h1 title · escaped Markdown · floating terminal output | export tool from source chat application |
yoUser: substitutions, zero-width characters) are preserved or stripped under explicit rules — never edited away.
Encoding & frontmatter
§ 1 — file encoding & terminus
- R-ENC-01 — UTF-8 with LF-only line endings. No CRLF. No BOM (U+FEFF).
- R-ENC-02 — exactly one trailing newline after the last content line.
- R-ENC-03 — no trailing whitespace on any line.
- R-ENC-04 — strip zero-width characters (U+200B, U+200C, U+200D, U+FEFF, U+00AD) from all positions.
§ 2 — YAML frontmatter
Every file opens with a frontmatter block. Opening --- on line 1, column 1, no preceding bytes. Three required fields, in order, all string values double-quoted:
--- session_id: "[value]" primary_model: "[value]" normalization_level: [integer] ---
- R-FM-02 — closing
---immediately follows the last field. No blank lines inside the block. - R-FM-05 — exactly one blank line follows the closing delimiter before body content.
- R-FM-06 — the
**Exported:** [date]exporter residue (44 of 137 files) is removed. If the date must be preserved, it moves into frontmatter asexported_date: "[date]"afternormalization_level.
Title & turn structure
§ 3 — document title
- R-DT-01 — h1 (
# Title) is permitted only as the first line of body content. No other h1 may appear. - R-DT-02 — files without an h1 do not have one added by the normalizer.
- R-DT-03 — when h1 is present, exactly one blank line separates it from the first turn.
§ 4 — speaker tags & turn structure
Canonical speaker-tag form is **SPEAKER_NAME:**. Permitted speakers in the formatting spec: HUMAN_RELAY, CLAUDE, GEMINI, CHATGPT, NOTEBOOKLM. All-caps. Trailing colon mandatory.
- R-TS-02 / R-TS-03 — exactly one blank line before and after every speaker tag (with the obvious exception that the very first tag has no preceding blank line beyond the frontmatter / h1 separator).
- R-TS-05 — empty speaker turns (consecutive tags with no intervening content) are removed entirely.
R-TS-04 — replacing the ## Prompt: / ## Response: wrapper
The wrapper is exporter residue and must be replaced by the appropriate speaker tag:
| Input | Replacement |
|---|---|
## Prompt: + embedded **SPEAKER:** | remove ## Prompt:; keep the embedded tag as the sole turn marker |
## Prompt: with no embedded tag | replace with **HUMAN_RELAY:** |
## Response: | replace with the speaker tag corresponding to primary_model (e.g. "gemini" → **GEMINI:**) |
R-TS-06 [FLAGGED] — truncation artifacts
Tier 0 recovery introduced sentence-initial truncations. The normalizer must not correct them. Each instance is annotated inline with [CORPUS_CORRUPTION: truncation] immediately before the fragment:
[CORPUS_CORRUPTION: truncation]laude, we are running...
Blank-line discipline
§ 5 of the spec. Six rules that govern every blank line in a normalized file.
- R-BL-01 — never more than one consecutive blank line. All sequences of two or more collapse to one.
- R-BL-02 — exactly one blank line between the last line of a speaker's content and the next speaker tag.
- R-BL-03 — exactly one blank line separates adjacent paragraphs within a turn.
- R-BL-04 — list items use tight style (no inter-item blank lines) unless an item contains a sub-list, code fence, or blockquote. Loose lists are converted to tight.
- R-BL-05 [FLAGGED] — suspected paragraph collapse is flagged for human review. The normalizer never auto-splits prose by semantic heuristic.
. followed by a sentence-like clause and insert a break is real, but heuristic splitting silently rewrites prose. The spec hands this to humans on purpose.
Headings & lists
§ 6 — heading hierarchy
- R-HD-01 — h1 reserved for document title only. Never inside a turn.
- R-HD-02 — h2 is the highest heading permitted within a turn.
- R-HD-03 — h3 for subsections; h4 maximum depth.
- R-HD-04 — exactly one blank line before and after every heading.
- R-HD-05 — heading markers must not be escaped.
\###un-escapes to###.
§ 7 — bullets & lists
The canonical unordered marker is - (hyphen + single space). All other forms convert:
| Input form | Output |
|---|---|
| \t∙\t (tab + U+2219 + tab) | - |
| * or * | - |
| \* (escaped asterisk) | - |
| + | - |
The \t∙\t form is the most severe — Unicode U+2219 (BULLET OPERATOR) is not a list marker in any Markdown spec, so every list in transcript_001 currently renders as a paragraph.
- R-LS-02 — ordered lists use
N.. Tab-prefixed forms (\t1.\t) convert to standard. - R-LS-03 — nested items indent two spaces per level relative to the parent's text start.
- R-LS-04 — continuation lines align with the first character of the item text.
Escapes & code fences
§ 8 — escaped Markdown
The export tool that produced the Gemini-lineage files serialized Markdown syntax characters as literals. Un-escaping is required outside code fences:
| Escaped | Canonical | Condition |
|---|---|---|
| \#, \##, \### | #, ##, ### | line start (heading) |
| \*\*text\*\* | **text** | bold context |
| \*text\* | *text* | italic context |
| \* at line start | - | list context (R-LS-01) |
| \\> · \> | > | line start (blockquote) |
| \--- | --- | alone on line (rule) |
| \[text\]\(url\) | [text](url) | link context |
| \_text\_ | _text_ | isolated italic only |
re.compile(r"(?<![_\\])\\_([^_]+?)\\_(?!_)")Must not match dunder-like forms such as
\_\_future\_\_. If boundary-safe matching cannot be guaranteed, underscore un-escaping is optional and may be skipped entirely.
- R-EX-02 — backslash escapes inside code fences are preserved verbatim.
- R-EX-03 — escaped ASCII separator lines are not simply un-escaped — they are moved inside a fence per R-CF-02.
§ 11 — code fences & terminal output
- R-CF-01 — terminal/shell output, log lines, verification reports, JSON artifacts, and file manifests are enclosed in triple-backtick fences.
- R-CF-02 — ASCII separator lines (10+ repeated
=,-,#,*, or~) outside fences are moved inside a fence along with associated report content. - R-CF-03 — triple backticks are canonical. Indented (4-space / tab) blocks convert to fenced.
- R-CF-05 — exactly one blank line before and after every fence.
R-CF-04 — language tags
| Content | Tag |
|---|---|
| terminal output, logs, reports | text |
| Python source | python |
| JSON | json |
| YAML | yaml |
| shell commands | bash |
| unknown | (omit tag) |
R-CF-06 — deduplication is exact-match only
The same WITNESS PROTOCOL VERIFICATION REPORT appears in transcript_121 both as raw floating text in ## Prompt: sections and as a properly fenced block in ## Response:. Removing duplicates is permitted, but only under exact identity:
- compare normalized fence payload only (exclude delimiters and language tags; preserve all bytes except line-ending normalization)
- compute a canonical content hash (SHA-256 recommended) for both candidates
- remove the floating block only when hashes are exactly equal
- scope to adjacent turn context (
Prompt/Responsepair or directly neighboring speaker turns) — cross-document or non-adjacent dedup is prohibited - near-match or similarity-based dedup is prohibited
When exact-match criteria are not met, both blocks are preserved and the case is flagged for review.
R-CF-07 — stateful parser, not regex split
Code-fence boundary detection MUST use a stateful line parser:
- scan line-by-line with explicit state (
OUTSIDE,INSIDE) - fence delimiter MUST begin at column 1 in default mode (no leading whitespace)
- opening fence may include a language tag; closing fence MUST NOT
- closing fence canonical form is exactly three backticks (optionally followed by spaces or tabs only)
- separator promotion (R-CF-02) runs only while state is
OUTSIDE - parser MUST prevent fence-boundary concatenation artifacts (adjacent close/open boundaries misread as a single six-backtick fence)
- regex-only
_outside_fencespartitioning is non-compliant
Rules, blockquotes, tables
§ 9 — horizontal rules
Canonical form is exactly --- (three hyphens, alone on the line). All variants normalize: * * *, ***, ___, \---, - - - → ---.
- R-HR-02 — exactly one blank line before and after every rule.
- R-HR-03 — rules must not be used as turn separators. Any
---immediately before a speaker tag (with only optional blank lines between) is removed; the tag is sufficient demarcation. - R-HR-04 [PERMITTED] — rules within a turn separate thematic sections only when no heading is present, and use should be sparse.
--- across 650 lines: 2 are legitimate frontmatter delimiters; the other 26 are body rules conflating turn boundaries with thematic boundaries. R-HR-03 is the rule that resolves this.
§ 10 — blockquotes
- R-BQ-01 — canonical marker is
>(greater-than + single space). - R-BQ-02 — escaped forms (
\\>,\>) un-escape to>per R-EX-01. - R-BQ-03 — file-attachment lines like
> File: name.mdare valid and preserved. - R-BQ-04 — quoted speech from other sessions or external agents uses blockquote with optional source reference.
- R-BQ-05 — nested blockquotes use
>>; nesting beyond two levels is not permitted. - R-BQ-06 — exactly one blank line before and after every blockquote.
§ 12 / 13 — emphasis & tables
- R-BE-01 — bold is
**text**. - R-BE-02 — italic is
*text*; underscore form (_text_) is reserved for code contexts. - R-BE-03 — speaker tags carry no additional formatting. No nested bold-italic, no surrounding hyphens.
- R-TB-01 / R-TB-02 — GFM pipe tables with header row and hyphen separator. One blank line before and after.
Lexical corruption
§ 14. These are wounds in the underlying text — not formatting issues — but they present as formatting anomalies because they cause sentence-initial truncations a parser may misidentify as markup.
R-LC-01 [FLAGGED] — annotate, do not correct
| Observed fragment | Annotation |
|---|---|
| yoUser: | [CORPUS_CORRUPTION: substitution]yoUser: |
| laude, at line start | [CORPUS_CORRUPTION: truncation]laude, |
| ritical at line start | [CORPUS_CORRUPTION: truncation]ritical |
| onditional at line start | [CORPUS_CORRUPTION: truncation]onditional |
| pload at line start | [CORPUS_CORRUPTION: truncation]pload |
| ploaded at line start | [CORPUS_CORRUPTION: truncation]ploaded |
re.compile(r"(?<!\])\byoUser:")The matcher MUST NOT fire when
yoUser: is already preceded by a closing annotation bracket — e.g. [CORPUS_CORRUPTION: substitution]yoUser: is invariant under re-application.
R-LC-02 — zero-width Unicode characters are stripped silently and completely with no annotation.
Application order
Rule interactions force a precise sequence. The normalizer is single-pass and precedence-ordered. Implementations MUST NOT run "loop until stable" reprocessing across the full rule set — idempotency is achieved by rule design and ordering, not by iterative convergence.
| # | Pass | Why this position |
|---|---|---|
| 1 | R-ENC | strip BOM, normalize line endings, strip zero-width chars first |
| 2 | R-FM | frontmatter validation and exporter residue removal |
| 3 | R-CF | fence floating terminal output before any other transform touches it |
| 4 | R-EX | un-escape Markdown — runs after CF so fenced content is protected |
| 5 | R-TS | turn structure: replace ## Prompt: / ## Response:, drop empty turns |
| 6 | R-LS | list syntax — runs after EX so escaped bullets are already raw |
| 7 | R-HR | horizontal rule normalization |
| 8 | R-BQ | blockquote normalization |
| 9 | R-HD | heading normalization |
| 10 | R-BL | blank-line discipline runs last as a final pass over the transformed document |
| 11 | R-LC | annotation pass for lexical corruption — read-only, no structural changes |
| 12 | R-ENC-02/03 | final encoding cleanup (trailing newline, trailing whitespace) |
Deviation severity quick-reference
| Code | Files | Severity |
|---|---|---|
| DEV-FM-02 · exporter residue | 44/137 | High |
| DEV-TS-01 · variant bifurcation | ~93/137 | Critical |
| DEV-TS-02 · label truncation | all samples | High |
| DEV-LS-01 · Unicode/escaped bullets | 001, 110 | Critical |
| DEV-EX · over-escaping | 110, 121 | Critical |
| DEV-CF-01 · floating terminal output | 121 | Critical |
| DEV-LC-01 · yoUser substitution | 001 | flag only |
Schema · sessions / turns
The normalized corpus lands in corpus_v1_5_2.db — the canonical relational substrate for session, turn, cycle, and event-marker analytics. Contract authority lives in integration/contracts/shared_contracts.py.
10.1 · sessions
| Column | Type | Notes |
|---|---|---|
| session_id | TEXT PK | session identifier |
| primary_model | TEXT | model identity |
| file_path | TEXT | ingest source path |
| source_filename | TEXT | source filename |
| session_order | INTEGER | deterministic ingest order |
| session_type | TEXT DEFAULT 'relay' | session class |
| original_date | TEXT | optional date metadata |
| protocol_version | TEXT | optional protocol field |
| turn_count | INTEGER | classification metric |
| inversion_ratio | REAL | classification metric |
| regime_tier | INTEGER | classification metric |
10.2 · turns
| Column | Type | Notes |
|---|---|---|
| turn_id | INTEGER PK AUTOINC | global turn id |
| session_id | TEXT | FK → sessions |
| turn_index | INTEGER | ordinal within session |
| speaker_canonical | TEXT | canonical speaker tag |
| raw_text | TEXT | turn text payload |
| timestamp_iso | TEXT | optional |
| recipient_model | TEXT | optional routing |
| sender_model | TEXT | optional routing |
Constraints & indexes:
UNIQUE(session_id, turn_index) ON CONFLICT IGNORE INDEX idx_turn_session ON turns(session_id) INDEX idx_speaker ON turns(speaker_canonical)
10.3 · cycles
| Column | Type | Notes |
|---|---|---|
| cycle_id | INTEGER PK AUTOINC | cycle id |
| session_id | TEXT | FK → sessions |
| prompt_turn_id | INTEGER | FK → turns |
| response_turn_id | INTEGER | FK → turns |
| status | TEXT | optional cycle state |
10.4 · event_markers
| Column | Type | Notes |
|---|---|---|
| marker_id | INTEGER PK AUTOINC | marker id |
| turn_id | INTEGER NOT NULL | FK → turns |
| marker_type | TEXT NOT NULL | marker class |
| snippet | TEXT NOT NULL | local context |
Constraint: UNIQUE(turn_id, marker_type).
Registry & vectors
file_registry.db
File-level semantic and review registry, with turn-review side tables. Primary tables:
files— semantic labeling, review status, annotation metadatareviewed_turns— turn-level review overlay (does not mutate the corpus DB)study_turns— study-flagged turn overlay
corpus_vectors.db
Vector search substrate, vec0-backed. Primary virtual table:
transcript_chunks— 768-dimensional embeddings viavec0
Shadow and index tables are implementation-specific and must not be mutated by hand. Treat the DB as managed by embedding/index tooling only.
Shared contracts & compatibility
- contract source of truth —
integration/contracts/shared_contracts.py - validation source of truth —
integration/contracts/validators.py
Compatibility enforcement covers speaker ontology constraints, turn / session / registry required fields, cross-domain path namespace constraints, and the DB compatibility matrix. Run unified checks:
ORCHESTRATED=1 make unified-check python3 -m integration.orchestrate_unified
Validation & safety
Validation queries
-- Duplicate turn positions per session (should be zero) SELECT session_id, turn_index, COUNT(*) FROM turns GROUP BY session_id, turn_index HAVING COUNT(*) > 1; -- Sessions without turns SELECT s.session_id FROM sessions s LEFT JOIN turns t ON t.session_id = s.session_id GROUP BY s.session_id HAVING COUNT(t.turn_id) = 0; -- Event markers without matching turns SELECT em.marker_id FROM event_markers em LEFT JOIN turns t ON t.turn_id = em.turn_id WHERE t.turn_id IS NULL;
Current non-blocking warnings
As of the 2026-03-23 unified report:
corpus_v1.dbpresent but deprecated (read-only)turns.jsonincludes 39 rows withspeaker=UNKNOWN; resolve before ingestion into the canonical corpus
DB safety gates
Any mutation workflow must use integration/db_safety.py:
- default
DBSafetyContext(dry_run=True)for non-mutating validation backup()creates timestamped backups (.backup_YYYYMMDDTHHMMSSZ.db) before mutationrow_counts()/row_count_diff()withassert_counts_non_decreasing()for count guardsintegrity_check()/quick_check()before and after writesDBSafetyContext.safe_mutate(path)as the recommended mutation wrapper
Strict warning gate for operational runs:
python3 -m integration.orchestrate_unified --strict-warnings ORCHESTRATED=1 make unified-check-strict
Orchestrator data checks are adapter-delegated — ReviewAdapter.check_data_health() and CorpusReviewAdapter.check_data_health() — with boundary enforcement and unified reporting layered on top.