meetLab/ Docs/ REF · 07 — corpus spec & schema

Corpus spec & schema.

137 transcript files, two distinct file lineages, one canonical Markdown target. This document pairs the pre-normalization formatting specification with the v1 SQLite schema that holds the result. The formatting rules say what the bytes must look like before ingest; the schema says what shape they take once they land.

StatusStable
DocREF · 07 / 12
SourceCORPUS_FORMATTING_
SPEC_v1.md ·
corpus_schema_v1.md
Spec version1.0
Primary DBcorpus_v1_5_2.db
I.

Two lineages, one target

Diagnostic samples — transcript_001, transcript_071, transcript_110, transcript_121 — reveal two distinct structural lineages in the 137-file corpus. Both must converge on a single canonical form before ingest.

LineageRangeMarkersPrimary corruption source
Early / Claude001 – ~070speaker-tag-only · no **Exported:** · no h1 · \t∙\t bullets · paragraph collapseinitial ingestion + Tier 0 recovery
Later / Gemini~090 – 137## Prompt: / ## Response: wrapper · **Exported:** on line 9 · h1 title · escaped Markdown · floating terminal outputexport tool from source chat application
single canonical target
The normalizer must collapse both lineages into one Markdown form without silently destroying lexical evidence. Wounds in the underlying text (truncations, yoUser: substitutions, zero-width characters) are preserved or stripped under explicit rules — never edited away.
II.

Encoding & frontmatter

§ 1 — file encoding & terminus

  • R-ENC-01 — UTF-8 with LF-only line endings. No CRLF. No BOM (U+FEFF).
  • R-ENC-02 — exactly one trailing newline after the last content line.
  • R-ENC-03 — no trailing whitespace on any line.
  • R-ENC-04 — strip zero-width characters (U+200B, U+200C, U+200D, U+FEFF, U+00AD) from all positions.

§ 2 — YAML frontmatter

Every file opens with a frontmatter block. Opening --- on line 1, column 1, no preceding bytes. Three required fields, in order, all string values double-quoted:

---
session_id: "[value]"
primary_model: "[value]"
normalization_level: [integer]
---
  • R-FM-02 — closing --- immediately follows the last field. No blank lines inside the block.
  • R-FM-05 — exactly one blank line follows the closing delimiter before body content.
  • R-FM-06 — the **Exported:** [date] exporter residue (44 of 137 files) is removed. If the date must be preserved, it moves into frontmatter as exported_date: "[date]" after normalization_level.
III.

Title & turn structure

§ 3 — document title

  • R-DT-01 — h1 (# Title) is permitted only as the first line of body content. No other h1 may appear.
  • R-DT-02 — files without an h1 do not have one added by the normalizer.
  • R-DT-03 — when h1 is present, exactly one blank line separates it from the first turn.

§ 4 — speaker tags & turn structure

Canonical speaker-tag form is **SPEAKER_NAME:**. Permitted speakers in the formatting spec: HUMAN_RELAY, CLAUDE, GEMINI, CHATGPT, NOTEBOOKLM. All-caps. Trailing colon mandatory.

  • R-TS-02 / R-TS-03 — exactly one blank line before and after every speaker tag (with the obvious exception that the very first tag has no preceding blank line beyond the frontmatter / h1 separator).
  • R-TS-05 — empty speaker turns (consecutive tags with no intervening content) are removed entirely.

R-TS-04 — replacing the ## Prompt: / ## Response: wrapper

The wrapper is exporter residue and must be replaced by the appropriate speaker tag:

InputReplacement
## Prompt: + embedded **SPEAKER:**remove ## Prompt:; keep the embedded tag as the sole turn marker
## Prompt: with no embedded tagreplace with **HUMAN_RELAY:**
## Response:replace with the speaker tag corresponding to primary_model (e.g. "gemini"**GEMINI:**)

R-TS-06 [FLAGGED] — truncation artifacts

Tier 0 recovery introduced sentence-initial truncations. The normalizer must not correct them. Each instance is annotated inline with [CORPUS_CORRUPTION: truncation] immediately before the fragment:

[CORPUS_CORRUPTION: truncation]laude, we are running...
IV.

Blank-line discipline

§ 5 of the spec. Six rules that govern every blank line in a normalized file.

  • R-BL-01 — never more than one consecutive blank line. All sequences of two or more collapse to one.
  • R-BL-02 — exactly one blank line between the last line of a speaker's content and the next speaker tag.
  • R-BL-03 — exactly one blank line separates adjacent paragraphs within a turn.
  • R-BL-04 — list items use tight style (no inter-item blank lines) unless an item contains a sub-list, code fence, or blockquote. Loose lists are converted to tight.
  • R-BL-05 [FLAGGED] — suspected paragraph collapse is flagged for human review. The normalizer never auto-splits prose by semantic heuristic.
why no auto-split
transcript_001 contains severe paragraph collapse (lines 14–65 fuse what should be six or seven blocks). The temptation to detect . followed by a sentence-like clause and insert a break is real, but heuristic splitting silently rewrites prose. The spec hands this to humans on purpose.
V.

Headings & lists

§ 6 — heading hierarchy

  • R-HD-01 — h1 reserved for document title only. Never inside a turn.
  • R-HD-02 — h2 is the highest heading permitted within a turn.
  • R-HD-03 — h3 for subsections; h4 maximum depth.
  • R-HD-04 — exactly one blank line before and after every heading.
  • R-HD-05 — heading markers must not be escaped. \### un-escapes to ###.

§ 7 — bullets & lists

The canonical unordered marker is - (hyphen + single space). All other forms convert:

Input formOutput
\t∙\t (tab + U+2219 + tab)-
*  or  * -
\* (escaped asterisk)-
+ -

The \t∙\t form is the most severe — Unicode U+2219 (BULLET OPERATOR) is not a list marker in any Markdown spec, so every list in transcript_001 currently renders as a paragraph.

  • R-LS-02 — ordered lists use N. . Tab-prefixed forms (\t1.\t) convert to standard.
  • R-LS-03 — nested items indent two spaces per level relative to the parent's text start.
  • R-LS-04 — continuation lines align with the first character of the item text.
VI.

Escapes & code fences

§ 8 — escaped Markdown

The export tool that produced the Gemini-lineage files serialized Markdown syntax characters as literals. Un-escaping is required outside code fences:

EscapedCanonicalCondition
\#, \##, \####, ##, ###line start (heading)
\*\*text\*\***text**bold context
\*text\**text*italic context
\* at line start- list context (R-LS-01)
\\> · \>>line start (blockquote)
\------alone on line (rule)
\[text\]\(url\)[text](url)link context
\_text\__text_isolated italic only
underscore matcher · normative
The un-escape matcher MUST be boundary-safe and single-pass idempotent. Canonical pattern:
re.compile(r"(?<![_\\])\\_([^_]+?)\\_(?!_)")
Must not match dunder-like forms such as \_\_future\_\_. If boundary-safe matching cannot be guaranteed, underscore un-escaping is optional and may be skipped entirely.
  • R-EX-02 — backslash escapes inside code fences are preserved verbatim.
  • R-EX-03 — escaped ASCII separator lines are not simply un-escaped — they are moved inside a fence per R-CF-02.

§ 11 — code fences & terminal output

  • R-CF-01 — terminal/shell output, log lines, verification reports, JSON artifacts, and file manifests are enclosed in triple-backtick fences.
  • R-CF-02 — ASCII separator lines (10+ repeated =, -, #, *, or ~) outside fences are moved inside a fence along with associated report content.
  • R-CF-03 — triple backticks are canonical. Indented (4-space / tab) blocks convert to fenced.
  • R-CF-05 — exactly one blank line before and after every fence.

R-CF-04 — language tags

ContentTag
terminal output, logs, reportstext
Python sourcepython
JSONjson
YAMLyaml
shell commandsbash
unknown(omit tag)

R-CF-06 — deduplication is exact-match only

The same WITNESS PROTOCOL VERIFICATION REPORT appears in transcript_121 both as raw floating text in ## Prompt: sections and as a properly fenced block in ## Response:. Removing duplicates is permitted, but only under exact identity:

  • compare normalized fence payload only (exclude delimiters and language tags; preserve all bytes except line-ending normalization)
  • compute a canonical content hash (SHA-256 recommended) for both candidates
  • remove the floating block only when hashes are exactly equal
  • scope to adjacent turn context (Prompt/Response pair or directly neighboring speaker turns) — cross-document or non-adjacent dedup is prohibited
  • near-match or similarity-based dedup is prohibited

When exact-match criteria are not met, both blocks are preserved and the case is flagged for review.

R-CF-07 — stateful parser, not regex split

Code-fence boundary detection MUST use a stateful line parser:

  • scan line-by-line with explicit state (OUTSIDE, INSIDE)
  • fence delimiter MUST begin at column 1 in default mode (no leading whitespace)
  • opening fence may include a language tag; closing fence MUST NOT
  • closing fence canonical form is exactly three backticks (optionally followed by spaces or tabs only)
  • separator promotion (R-CF-02) runs only while state is OUTSIDE
  • parser MUST prevent fence-boundary concatenation artifacts (adjacent close/open boundaries misread as a single six-backtick fence)
  • regex-only _outside_fences partitioning is non-compliant
VII.

Rules, blockquotes, tables

§ 9 — horizontal rules

Canonical form is exactly --- (three hyphens, alone on the line). All variants normalize: * * *, ***, ___, \---, - - ----.

  • R-HR-02 — exactly one blank line before and after every rule.
  • R-HR-03 — rules must not be used as turn separators. Any --- immediately before a speaker tag (with only optional blank lines between) is removed; the tag is sufficient demarcation.
  • R-HR-04 [PERMITTED] — rules within a turn separate thematic sections only when no heading is present, and use should be sparse.
transcript_071
28 occurrences of --- across 650 lines: 2 are legitimate frontmatter delimiters; the other 26 are body rules conflating turn boundaries with thematic boundaries. R-HR-03 is the rule that resolves this.

§ 10 — blockquotes

  • R-BQ-01 — canonical marker is > (greater-than + single space).
  • R-BQ-02 — escaped forms (\\>, \>) un-escape to > per R-EX-01.
  • R-BQ-03 — file-attachment lines like > File: name.md are valid and preserved.
  • R-BQ-04 — quoted speech from other sessions or external agents uses blockquote with optional source reference.
  • R-BQ-05 — nested blockquotes use >> ; nesting beyond two levels is not permitted.
  • R-BQ-06 — exactly one blank line before and after every blockquote.

§ 12 / 13 — emphasis & tables

  • R-BE-01 — bold is **text**.
  • R-BE-02 — italic is *text*; underscore form (_text_) is reserved for code contexts.
  • R-BE-03 — speaker tags carry no additional formatting. No nested bold-italic, no surrounding hyphens.
  • R-TB-01 / R-TB-02 — GFM pipe tables with header row and hyphen separator. One blank line before and after.
VIII.

Lexical corruption

§ 14. These are wounds in the underlying text — not formatting issues — but they present as formatting anomalies because they cause sentence-initial truncations a parser may misidentify as markup.

R-LC-01 [FLAGGED] — annotate, do not correct

Observed fragmentAnnotation
yoUser:[CORPUS_CORRUPTION: substitution]yoUser:
laude, at line start[CORPUS_CORRUPTION: truncation]laude,
ritical at line start[CORPUS_CORRUPTION: truncation]ritical
onditional at line start[CORPUS_CORRUPTION: truncation]onditional
pload at line start[CORPUS_CORRUPTION: truncation]pload
ploaded at line start[CORPUS_CORRUPTION: truncation]ploaded
yoUser annotation · normative
Annotation MUST be single-pass idempotent. Canonical matcher:
re.compile(r"(?<!\])\byoUser:")
The matcher MUST NOT fire when yoUser: is already preceded by a closing annotation bracket — e.g. [CORPUS_CORRUPTION: substitution]yoUser: is invariant under re-application.

R-LC-02 — zero-width Unicode characters are stripped silently and completely with no annotation.

IX.

Application order

Rule interactions force a precise sequence. The normalizer is single-pass and precedence-ordered. Implementations MUST NOT run "loop until stable" reprocessing across the full rule set — idempotency is achieved by rule design and ordering, not by iterative convergence.

#PassWhy this position
1R-ENCstrip BOM, normalize line endings, strip zero-width chars first
2R-FMfrontmatter validation and exporter residue removal
3R-CFfence floating terminal output before any other transform touches it
4R-EXun-escape Markdown — runs after CF so fenced content is protected
5R-TSturn structure: replace ## Prompt: / ## Response:, drop empty turns
6R-LSlist syntax — runs after EX so escaped bullets are already raw
7R-HRhorizontal rule normalization
8R-BQblockquote normalization
9R-HDheading normalization
10R-BLblank-line discipline runs last as a final pass over the transformed document
11R-LCannotation pass for lexical corruption — read-only, no structural changes
12R-ENC-02/03final encoding cleanup (trailing newline, trailing whitespace)

Deviation severity quick-reference

CodeFilesSeverity
DEV-FM-02 · exporter residue44/137High
DEV-TS-01 · variant bifurcation~93/137Critical
DEV-TS-02 · label truncationall samplesHigh
DEV-LS-01 · Unicode/escaped bullets001, 110Critical
DEV-EX · over-escaping110, 121Critical
DEV-CF-01 · floating terminal output121Critical
DEV-LC-01 · yoUser substitution001flag only
X.

Schema · sessions / turns

The normalized corpus lands in corpus_v1_5_2.db — the canonical relational substrate for session, turn, cycle, and event-marker analytics. Contract authority lives in integration/contracts/shared_contracts.py.

10.1 · sessions

ColumnTypeNotes
session_idTEXT PKsession identifier
primary_modelTEXTmodel identity
file_pathTEXTingest source path
source_filenameTEXTsource filename
session_orderINTEGERdeterministic ingest order
session_typeTEXT DEFAULT 'relay'session class
original_dateTEXToptional date metadata
protocol_versionTEXToptional protocol field
turn_countINTEGERclassification metric
inversion_ratioREALclassification metric
regime_tierINTEGERclassification metric

10.2 · turns

ColumnTypeNotes
turn_idINTEGER PK AUTOINCglobal turn id
session_idTEXTFK → sessions
turn_indexINTEGERordinal within session
speaker_canonicalTEXTcanonical speaker tag
raw_textTEXTturn text payload
timestamp_isoTEXToptional
recipient_modelTEXToptional routing
sender_modelTEXToptional routing

Constraints & indexes:

UNIQUE(session_id, turn_index) ON CONFLICT IGNORE
INDEX idx_turn_session   ON turns(session_id)
INDEX idx_speaker        ON turns(speaker_canonical)

10.3 · cycles

ColumnTypeNotes
cycle_idINTEGER PK AUTOINCcycle id
session_idTEXTFK → sessions
prompt_turn_idINTEGERFK → turns
response_turn_idINTEGERFK → turns
statusTEXToptional cycle state

10.4 · event_markers

ColumnTypeNotes
marker_idINTEGER PK AUTOINCmarker id
turn_idINTEGER NOT NULLFK → turns
marker_typeTEXT NOT NULLmarker class
snippetTEXT NOT NULLlocal context

Constraint: UNIQUE(turn_id, marker_type).

XI.

Registry & vectors

file_registry.db

File-level semantic and review registry, with turn-review side tables. Primary tables:

  • files — semantic labeling, review status, annotation metadata
  • reviewed_turns — turn-level review overlay (does not mutate the corpus DB)
  • study_turns — study-flagged turn overlay

corpus_vectors.db

Vector search substrate, vec0-backed. Primary virtual table:

  • transcript_chunks — 768-dimensional embeddings via vec0

Shadow and index tables are implementation-specific and must not be mutated by hand. Treat the DB as managed by embedding/index tooling only.

Shared contracts & compatibility

  • contract source of truth — integration/contracts/shared_contracts.py
  • validation source of truth — integration/contracts/validators.py

Compatibility enforcement covers speaker ontology constraints, turn / session / registry required fields, cross-domain path namespace constraints, and the DB compatibility matrix. Run unified checks:

ORCHESTRATED=1 make unified-check
python3 -m integration.orchestrate_unified
XII.

Validation & safety

Validation queries

-- Duplicate turn positions per session (should be zero)
SELECT session_id, turn_index, COUNT(*)
FROM turns
GROUP BY session_id, turn_index
HAVING COUNT(*) > 1;

-- Sessions without turns
SELECT s.session_id
FROM sessions s
LEFT JOIN turns t ON t.session_id = s.session_id
GROUP BY s.session_id
HAVING COUNT(t.turn_id) = 0;

-- Event markers without matching turns
SELECT em.marker_id
FROM event_markers em
LEFT JOIN turns t ON t.turn_id = em.turn_id
WHERE t.turn_id IS NULL;

Current non-blocking warnings

As of the 2026-03-23 unified report:

  • corpus_v1.db present but deprecated (read-only)
  • turns.json includes 39 rows with speaker=UNKNOWN; resolve before ingestion into the canonical corpus

DB safety gates

Any mutation workflow must use integration/db_safety.py:

  • default DBSafetyContext(dry_run=True) for non-mutating validation
  • backup() creates timestamped backups (.backup_YYYYMMDDTHHMMSSZ.db) before mutation
  • row_counts() / row_count_diff() with assert_counts_non_decreasing() for count guards
  • integrity_check() / quick_check() before and after writes
  • DBSafetyContext.safe_mutate(path) as the recommended mutation wrapper

Strict warning gate for operational runs:

python3 -m integration.orchestrate_unified --strict-warnings
ORCHESTRATED=1 make unified-check-strict

Orchestrator data checks are adapter-delegated — ReviewAdapter.check_data_health() and CorpusReviewAdapter.check_data_health() — with boundary enforcement and unified reporting layered on top.

corpus pipeline raw .md normalize (12-pass) corpus_v1_5_2.db + file_registry.db + corpus_vectors.db | unversioned read = HALT
doc · 07 · build 2026-04-25 event → enforce(event) → invariant → PASS | HALT meetLab · 2026