Two contractors who have never worked with Seby before. A foundation customer in exec search where confidentiality is the product. A Microsoft stack Seeda has deliberately not built MCPs for. This page is the contract between Michael, William, and Khoa. Read it once, then deliver against it.
Two things before you read anything else. Nothing gets built until the existing Seeda stack passes a fresh security, resilience, and scalability audit by William and Khoa (Section 00). And testing plus iteration is the work, not a phase that happens at the end. Every artefact assumes three to five loops to reach done. That is budgeted and expected (Section 06).
Customer · Huddle Talent (Cliff Wilson)Rate · A$150 / hrOwner · Michael KingstonBuild · William Nguyen + Khoa DoSpeed · TBC by Cliff (default Run)Kill switch · End of Week 1
This is the stack running daily at Seeda right now. Huddle's build is an adaptation of this anatomy onto Cliff's Microsoft tenancy. Before you touch a line of code, you will know every engine, every worker, every input, every output.
00 · AUDIT BEFORE BUILD
Nothing ships until the foundations are checked.
Michael has built the entire Seeda automation stack as a general business operator, not as a trained engineer, and without professional external advice. The architecture works in production today, but it has never been audited by qualified outside eyes. Huddle is a paying customer in a confidentiality-critical industry. We do not build on top of an unverified foundation.
AUDIT 01Security
Map every secret in flight: where it lives, who can read it, how it rotates, audit trail on access1Password vaults, .env files, GH Actions secrets, CF tokens, Firebase service accounts
Verify the 5-tier privacy boundary actually holds: penetration test cross-tier reads against the Seeda stackKhoa runs the test, documents every finding
Identify any path where outbound (Slack, Gmail, Twilio, Airwallex, Xero) can fire without the send-word gateIf found: P0, fix before Huddle build starts
Authorisation model review: Firebase auth, Cloudflare Access, App Check, repo-boundary checkAre they correctly composed? Any bypass paths?
Data-at-rest review: which memory files contain L0/L1 content, where they sync, who has filesystem accessdotclaude-memory, Google Drive transcripts, ~/seeda-private/
AUDIT 02Resilience
Failure-mode map for all 40+ scheduled tasks: what happens if each one silently breaks for 3 days?Charlie autonomy, Fathom pipeline, MRR truth, payroll preview
Backup and recovery: can the system be restored if dotclaude-memory, seeda-finance, seeda-ops are lost?What is the RTO and RPO? Are they documented?
Dependency graph: which automations break if a single MCP, LaunchAgent, or API goes downPostHog, Xero, Chargebee, Airwallex, Slack, Gmail, Calendar
Hook collision risk: do any of the pre-commit, PreToolUse, or Stop hooks contradict each other under load?Re-run the May 8 self-bite incident in a sandbox
Monitoring and alerting: who finds out, how fast, when something is broken?Is there a single dashboard or are failures invisible?
AUDIT 03Scalability
What breaks at 10 customers? At 50? At 200? Identify the first three bottlenecksLikely candidates: memory file size, MCP context, scheduled-task collisions
Multi-tenancy posture: can the Huddle stack be cleanly isolated from Seeda or future customers?Per-venture working dirs work today; do they scale?
Cost model at 10x usage: Claude credits, CF Pages, Firebase, MCP-host feesProject current ~A$1.2k/mo to 10x. Is the unit economics OK?
Code-quality review: identify the highest-risk skills and rules that need refactor before extensionNot aesthetic. Risk-prioritised.
Documentation gap analysis: what must a new engineer understand before contributing?The audit itself produces this list
Deliverable: a single audit report at team.seby.com.au/audit/seeda-stack-2026-05/ with severity-ranked findings, owner, and fix-or-accept decision per item. Michael reviews and accepts each finding. Anything tagged P0 blocks Huddle Pillar 1 ingest. Anything P1 must have a remediation plan before that pillar ships. Audit budget: ~20 hours William, ~30 hours Khoa, ~5 hours Michael across Week 0 (the week before Huddle kickoff).
01 · PILLAR × SPEED
Three pillars, three speeds, one matrix.
Cliff picks the speed. The pillars and the order are fixed. Brain, then CRM, then Boardroom.
Walk · 13 wks
Run · 5 wks DEFAULT
Sprint · 3 wks
Pillar 01The Brain
Wks 1–5
Outlook + Teams + OneDrive ingest. 5 privacy tiers. Audit log.
Infra: GitHub Actions, Cloudflare Pages deploy, secrets via 1Password
CliffDirector · Huddle Talent
Picks the speed. Confirms the pillar order. Names the first data sources.
30-minute weekly steer call (Friday Sydney time)
Signs off the 5 privacy tier definitions in Week 1
Final acceptance on each pillar against the criteria in section 06
03 · TIMELINE
Six weeks. Audit first. Build second.
Default Run speed. Week 0 is the audit (Section 00) and is non-negotiable. Weeks 1 to 5 are the Huddle build. If Cliff picks Walk or Sprint, the build bars stretch or compress, the audit week stays.
Wk 0 · Audit
Wk 1
Wk 2
Wk 3
Wk 4
Wk 5
Audit · securityKhoa lead · ~12 hrs
Audit · resilienceKhoa + William · ~18 hrs
Audit · scalabilityWilliam lead · ~15 hrs
Audit report + sign-offMichael · ~5 hrs
Audit-driven fixesP0 only · variable
Audit log + gatesKhoa · ~20 hrs
MS Graph ingestKhoa · ~25 hrs
Prospect pagesWilliam · ~22 hrs
Voice-memo ingestWilliam · ~15 hrs
Eval harness + iterationKhoa · ~20 hrs across all weeks
Boardroom pillarKhoa + William · ~20 hrs
Michael QA + Cliff callEvery week
Michael (QA, weekly call, Boardroom)William (operator surface)Khoa (engineering substrate)
04 · WEEK 1 · DAILY
Five days. Kill-switch at the end.
If the Day 5 demo does not show Cliff something that materially saves him time, we stop and rescope. The retainer-free promise dies if this week drifts. No real Huddle client data touches the system this week. Sanitised Barrenjoey fixtures only.
DAY 1 · MON
Kickoff + tier sign-off
Cliff sign-off on the 5 privacy tier definitions. Michael walks William and Khoa through the venture plan, the brand kit, and the Cliff-only memory boundary. First three data sources named (default: Outlook, Teams chat, OneDrive design folder).
Owner: MichaelCliff · William · Khoa on call
DAY 2 · TUE
Audit log + send-word gate live
Khoa ships the audit log, send-word gate, and repo-boundary check against a synthetic fixture. Nothing reads or writes to a real Huddle source until this is green. Michael QAs against the Seeda equivalents.
Owner: KhoaMichael QA gate
DAY 3 · WED
MS Graph read-only against Cliff's tenant
First real read. Outlook + Teams chat scoped to Cliff-only tier. No write paths. No customer data leaves Cliff's tenancy. Khoa pairs with Michael for the Azure AD app registration.
Owner: KhoaMichael paired
DAY 4 · THU
Barrenjoey prospect page · old vs new
William restructures the existing Barrenjoey deck into the new prospect-page pattern, side by side with the old. Tech stack, market map, buyer map, blockers, next action. Sanitised inputs only.
Owner: WilliamMichael QA review
DAY 5 · FRI
Demo + go/no-go
Live demo to Cliff. Cliff runs three queries the brain has to answer correctly. Decision: continue to Week 2 (Run pace), stop and rescope, or pause. Weekly timesheet + invoice to Cliff. Definition-of-done checklist signed off for Pillar 1 scaffolding.
Owner: MichaelCliff sign-off
05 · DEFINITION OF DONE
Each pillar, three boxes.
What exists, what Cliff can do, what "broken" looks like. Tick all three or it is not done.
PILLAR 01The Brain
Outlook, Teams chat, Teams recordings, OneDrive all ingest into the brain with 5-tier classification on every recordVerified by Khoa's eval harness, audit-logged
Cliff runs 10 test queries; brain returns ≥ 8 correct, 0 cross-tier leaksTest set agreed in Wk 1, frozen for the engagement
Brain refreshes itself nightly without manual intervention for 7 consecutive daysLaunchAgent or equivalent on Cliff's machine
PILLAR 02The CRM
Barrenjoey, ASB, Westpac all live as prospect pages with tech stack, buyer map, deal value, forecast, next actionOne source of truth, fed from the brain
Pre-meeting brief drafted in ≤ 5 minutes from "meeting in 2 hours" prompt, ≤ 30% rejection rate over 10 briefsTracked in eval harness
Post-meeting transcript folds back into the prospect page automatically; page updates within 30 minutes of the meeting endingFathom or Teams recording, both supported
PILLAR 03The Boardroom
Every board and management meeting recorded, parsed, and actions extracted with assignee + due dateVeronica is the first user; her weekly is the proof case
Next deck pre-built from prior deck + actions + new context; Cliff opens at ≥ 90% completeMeasured: time-to-finished-deck before vs after
Cross-meeting action tracking: no action lost across two consecutive cyclesEval harness checks for dropped actions
06 · TESTING + ITERATION IS THE WORK
Three to five loops per artefact. Budgeted, not apologised for.
Building with Claude and AI is not a one-pass write. Every artefact, every skill, every prompt, every brief gets tested, found wanting, iterated, retested. Anyone who quotes a fixed timeline assuming first-pass success has either never shipped AI in production or is lying. We assume the opposite and budget for it.
WHATEvery artefact, three loops minimum
Brief written, Claude produces v1, human reads, rejects or amends, v2 produced, retestedThree loops is the floor, not the ceiling
30% first-pass rejection is the expected baseline, not a failure signalSub-10% rejection means the eval is too easy. Sub-50% rejection means the brief is too vague.
Every reject logged with the reason: hallucination, wrong tone, wrong tier, missing fact, format breakThe reasons become the eval suite for the next artefact
Regression suite runs nightly: every passed artefact must still pass after any infra changeKhoa's eval harness owns this
WHOIteration stamina is a job requirement
William: same prompt, fifteen variants, still sharp on the sixteenthThis is in the JD because this is the work
Khoa: builds the eval framework before he builds anything that needs evaluatingHis JMIR paper is literally on evaluation methodology. Use that muscle.
Michael: QA gate weekly, no rubber-stampingAnything Michael does not personally test is not signed off
Cliff: 30-minute steer call weekly, points us at the next thing to hardenHis rejections drive the next iteration loop
HOWThe eval-and-iterate loop
Eval set frozen at the start of each pillar. 10 queries minimum, drawn from real Huddle workNo moving the goalposts mid-week
Every prompt change reruns the full eval set before mergeIf the eval set takes longer than 10 min, parallelise it
Failed evals categorised: data problem, prompt problem, model problem, eval problemEach category has a different remediation path
Demo-driven testing: weekly Loom shows Cliff what improved, not what was addedDiff thinking, not feature thinking
If iteration time is not budgeted, the timeline lies. Across the 5-week Run pace, expect roughly 40% of total hours to be testing, eval-writing, regression, and re-prompting, not net-new build. That is normal. That is the work. If a contractor reports "done" without showing the iteration trail (the rejected v1s, the eval results, the regression diff), the work is not done.
07 · ACCEPTANCE CRITERIA
Numbers, not vibes.
Tied to a demo. If Khoa's eval harness cannot measure it, it is not a criterion.
Pillar
Metric
Target
Measured by
01 Brain
Correct answers across 10 frozen test queries
≥ 8 / 10
Cliff live demo · Wk 1 Fri
01 Brain
Cross-tier leaks in 100 audit-log spot checks
0
Khoa eval harness · daily
01 Brain
Hallucinated facts in 20 named-entity queries
0
Eval harness · pre-Wk 2 demo
02 CRM
Pre-meeting brief generation time
≤ 5 min
Stopwatch on 10 trials
02 CRM
First-pass rejection rate by Cliff or Michael
≤ 30 %
Tracked in eval harness
02 CRM
Transcript-to-page update latency
≤ 30 min
Logged timestamps
03 Board
Deck completeness at Cliff first open
≥ 90 %
Side-by-side vs prior cycle
03 Board
Lost actions across two cycles
0
Veronica weekly review
All
Send-word gate bypasses
0
Audit log spot checks
All
Weekly demo recordings (Loom)
1 per week
William delivery
08 · PRIVACY TIERS
The most important page in this doc.
An exec-search firm lives or dies on confidentiality. Mis-tier one record and the engagement is over.
T1
Cliff-onlyMost sensitive
Board pre-reads, candidate financials, deal economics, personal notes. Encrypted at rest, audit log on every read.
SeesMichael Buckley · Veronica Byrne · James Casino
WritesCliff · Veronica
Customer-safeShareable
Public-facing decks, marketing collateral, anonymised case studies. The only tier safe to put in a customer email.
SeesAnyone
WritesCliff · Veronica · Michael
09 · RISK REGISTER
What can go wrong, and what we do about it.
First-project-with-unknown-contractors risks. Not generic risks.
Cross-tier data leakHigh
Mitigation: Khoa ships audit log + send-word gate + repo-boundary check on Day 2, before any real ingest. Week 1 uses sanitised fixtures only. 100-record spot check daily.
Khoa is single point of failure on MS GraphHigh
Mitigation: Khoa documents every MS Graph integration as he ships it. Michael shadow-pairs on the Azure AD app registration in Wk 1. If Khoa unavailable, Michael covers personally at A$150/hr.
William over-builds vs Cliff's actual workflowMed
Mitigation: Cliff's voice memos drive the design. William turns Cliff's two-line ask into the brief, Michael QAs the brief before any build. Weekly Loom check.
Berlin · Sydney TZ latency stalls weekly demoMed
Mitigation: Both contractors commit to a 2-hour Sydney-morning overlap window Tue + Thu. Weekly demo recorded async Friday Sydney time.
Quality drift on first-pass outputMed
Mitigation: ≤ 30% first-pass rejection target. Anything Cliff rejects twice triggers a pause-and-rescope conversation with Michael, not a third attempt.
Scope creep into out-of-scope itemsLow
Mitigation: Finance integration, Tranquil IT handover, deep voice training, team rollout are all explicitly out for the first 5 weeks. New scope = new proposal.
Premature MS Graph over-buildLow
Mitigation: Decision D-04 (venture plan) holds: only build MS-side integration that has a Cliff use case this week. No generic Outlook MCP.
Communication latency between contractorsLow
Mitigation: Shared Slack channel with Michael. Daily standup async in-channel by 09:00 Berlin. Blockers tagged to Michael directly.
Audit reveals P0 finding that blocks Wk 1High
Mitigation: Audit-driven fix budget reserved in Wk 1 timeline. If a P0 lands, Huddle build pauses, Cliff is informed in writing within 24 hrs, remediation plan in 48 hrs. No build over an unverified foundation.
Michael's lack of formal engineering training shows up as a defectMed
Mitigation: Section 00 audit exists specifically to find these. Each finding ranked, owned, and either fixed or accepted in writing. The audit report is the artefact that converts unknown unknowns into known knowns.
Iteration time underestimated; week budgets blownMed
Mitigation: 40% iteration overhead baked into all hour estimates (Section 06). If a sprint week needs more, contractor asks first, no silent overruns. Weekly timesheet shows iteration vs net-new hours separately.
Eval suite drifts or becomes too easyLow
Mitigation: Khoa owns suite integrity. New evals added every week from Cliff's actual rejections. Suite frozen at start of each pillar to prevent goalpost-moving mid-week.
10 · STOP-IF CRITERIA
The lines we draw before we start.
Written now, not negotiated later. Hitting any of these stops the engagement until Michael and Cliff have rescoped together.
Week 1 demo fails the ≥ 8/10 brain query target. We stop, rescope, decide whether to continue at all.
Any cross-tier leak detected in audit log. Build pauses immediately. Root cause and fix before any further ingest. No exceptions.
Cliff has not logged into the brain by Day 10. Means it is not actually saving him time. We pause and ask why.
Send-word gate is bypassed once. Means the safety architecture is not load-bearing. Full audit before any further outbound.
Two consecutive weekly demos missed or skipped. Rhythm break, signals deeper problem. Michael resets directly with the contractor.
Either contractor takes on competing work that breaks the Sydney overlap window. Renegotiate hours or replace.
Section 00 audit is skipped, rushed, or its findings are ignored. The whole engagement rests on a verified foundation. No audit, no build.
Any artefact is shipped without a recorded iteration trail. Means the contractor is claiming one-pass success on AI output, which is either dishonest or unsafe. Rework required.
Eval suite is moved or relaxed mid-pillar to make a target pass. Honest red is better than fake green. Pause and rescope.
11 · WEEKLY RHYTHM
Same shape every week. No surprises.
Two contractors who do not know Seby yet need a predictable rhythm before they need flexibility.
MON
Plan
Michael posts the week's targets in Slack by 10:00 Berlin. Hours estimated per swimlane.
TUE
Overlap call
90 min · Sydney 17:00 · Berlin 09:00. Working session, not a status meeting.
WED
Mid-week QA
Michael reviews work-in-progress, flags anything drifting before Friday demo.
THU
Build day
No meetings. Both contractors heads-down. Async only.
FRI
Demo + invoice
Loom demo from William. Timesheet from both. Invoice to Cliff. Cliff steer call · 30 min.