Live project memory
Live project memory
Last updated 5/3/2026
DataZoom — Live Project Memory
Document ID: TECH-MEM-001
Version: Current
Last Synthesized From: Repository snapshot + commit log through 960f88c
Project Codename: Dataroom / DataZoom (midwestco/datazoom)
1. Project Identity & Current Phase
DataZoom is an enterprise AI-powered document analysis platform built for legal and business due diligence workflows. The product enables natural language querying over uploaded documents using a local-first RAG pipeline, with multi-tenant organization isolation via Clerk Auth. The codebase lives at midwestco/datazoom and ships as a Next.js 15 monorepo (product/) backed by Supabase (PostgreSQL + pgvector) and a Python-based ingest/embedding worker fleet.
Current lifecycle phase: Production-grade, post-alpha. The system has cleared alpha (#109 Refactor/alpha) and staging (#110 Staging) gates and is operating in a live environment. Active development is concentrated on infrastructure reliability (CI/CD pipeline fixes), two major feature tracks (cap-table and e-signature), and a newly launched cloud-worker GPU fleet on RunPod/Fly.
2. What Has Been Built
2.1 Core Platform (Stable)
| Area | Status | Key Files |
|---|---|---|
| Document ingest + chunking | Stable | docker/Dockerfile.worker, docker/docker-compose.module1-ports.yml |
| Vector embeddings (384-dim, all-MiniLM-L6-v2) | Stable | pgvector IVFFlat index, document_chunks.embedding VECTOR(384) |
| RAG retrieval + citation layer | Stable | product/lib/__tests__/rag-retrieval.test.ts, product/lib/__tests__/citation-system.test.ts |
AI chat interface (/context) | Stable | product/app/(app)/context/ (8 components) |
Document management (/documents) | Stable | product/app/(app)/documents/ |
| Timeline extraction + visualization | Stable | timeline_events table, idx_timeline_date index |
| Multi-tenant auth (Clerk) | Stable | product/app/api/clerk/proxy/route.ts |
| Analytics (Mixpanel) | Stable | product/app/api/ai/track-interaction/route.ts, archived tracking plan in docs/archive/MIXPANEL_TRACKING_PLAN.md |
| Due diligence checklists | Stable | product/lib/due-diligence/__tests__/, /api/business-types/[typeKey]/checklist |
| Advisor routes (risk memo, strategic options) | Stable | product/app/api/advisor/ (5 routes) |
| Activity feed + unified calendar | Complete | product/app/(app)/activity/, product/app/api/activity/unified/ (4 sub-routes) |
| Admin pipeline dashboard | Stable | product/app/(app)/admin/pipeline/page.tsx, product/app/api/admin/pipeline/ |
2.2 Features Shipped in Recent Sprints
E-Signature System (BR-ESG-001)
- Full action plan in
docs/action_plans/e_sign/(9 steps including database, Resend email, template designer, signature pad, signing portal, stamping engine) - Status file:
docs/action_plans/e_sign/STATUS.md - Build fixes shipped in
#107 Signature build fixes; initial signature PR#106 Abdullah/signaturesclosed in favor of refactored approach - Conversion attribution tests in
product/lib/signature/__tests__/conversion-attribution.test.ts - PR
#108(open) extends template management with PDF preview and schema normalization
Cap Table System (BR-CAP-001)
- 14-step action plan fully documented:
docs/cap-table/action_plans/01through14 - Schema, RLS, calculation engine, read APIs, UI shell, extraction pipeline, review queue (approve/reject), Modal endpoint, observability, backfill, E2E smoke, feature flag/rollback, manual transaction controls, and snapshot read-model optimization
- Test coverage:
product/lib/cap-table/__tests__/(calc, feature-gate, fixture-corpus, observability) - API surface: 10 routes under
/api/cap-table/includingextract,auto-populate,review/[id]/approve,review/[id]/reject,transactions/[id]/void - Model router tested at
product/lib/__tests__/model-router.test.ts
Cloud GPU Worker Fleet (TECH-GPU-001)
- RunPod + Fly.io orchestrator:
fly/orchestrator/Dockerfile,docker/cloud-worker/Dockerfile,docker/cloud-worker/fly.toml - Pod warmup status tracking (provisioning → warming → ready): commit
580b80b - BullMQ job queue with Redis (Upstash TLS):
b718932 - Admin UI for cloud pipeline:
/api/admin/pipeline/cloud,/api/admin/routing-stats
Activity Tracking System (BR-ACT-001)
- Materialized views, activity logger library, per-event logging for upload/delete/processing/risk/strategic analysis
- Unified calendar API:
product/app/api/activity/unified/calendar,day,feed,refresh - Full implementation in PR
#104
Collaboration WebSocket Service
product/collaboration-ws/Dockerfile— standalone WS serviceproduct/app/api/collaboration/token/route.ts— token issuance- Shipped in
#111 feat: realtime status, chat improvements, Cloudflare deployment
3. What Is In Progress
3.1 PR #101 — Upload Fixes (OPEN, BLOCKED)
- Status: Open, no merge. Long-standing. Upload pipeline has known issues.
- Impact: Affects document ingestion reliability. Unresolved while other features shipped around it.
- Risk: High. If upload is broken for some file types or sizes, the entire RAG pipeline is upstream-blocked.
3.2 PR #108 — Template Management Enhancement (OPEN)
- Scope: PDF preview in template designer + schema normalization for e-sign templates
- Relates to:
docs/action_plans/e_sign/03_template_designer_ui.md - Blocking: Production rollout of full e-sign flow
3.3 CI/CD Pipeline Stabilization (TECH-CI-001)
- Last 6 commits are exclusively CI fixes:
960f88c— crane copy tag mismatch + retry logic for Fly registry07c91b9— cloud worker build moved to GitHub Actions, push toghcr.ioa693949— retry on cloud worker image build2c3145d—docker loginfor crane auth (Fly + ghcr.io)
- The build pipeline for
ghcr.io/midwestco/datazoom-base,datazoom-worker,datazoom-gpuwas non-functional and is being repaired incrementally - Workflow files:
.github/workflows/build-images.yml,.github/workflows/knip.yml
3.4 Business Type Profiles (BR-BTP-001)
- 8-step action plan in
docs/action_plans/business_type_fix/ - API routes exist:
/api/business-type-profiles/[profileKey],/api/business-types/[typeKey]/checklist - Final integration and validation steps appear in progress
4. Active Decisions and Rationale
TECH-DEC-001 — Modal (Cloud) as Primary LLM, Ollama as Local Fallback
Decision: Qwen2.5:32B on Modal is the default production LLM. Ollama (ollama/ollama:latest, port 11434) is an optional local fallback configurable via OLLAMA_HOST.
Rationale: Modal provides serverless GPU scaling without managing hardware. Local Ollama path preserves the "100% local" privacy guarantee for self-hosted customers.
Status: Active. Model routing logic is tested in product/lib/__tests__/model-router.test.ts. The OLLAMA_NUM_PARALLEL: "4" and OLLAMA_MAX_LOADED_MODELS: "3" settings are tuned for concurrent chat sessions.
TECH-DEC-002 — RunPod + Fly.io for GPU Worker Orchestration
Decision: GPU-intensive embedding/reranking workers run on RunPod fleet orchestrated via a Fly.io service.
Rationale: RunPod provides cost-effective GPU capacity; Fly handles ingress and orchestration lifecycle. BullMQ on Upstash Redis queues work items.
Status: Active but stabilizing. TLS ssl_cert_reqs=None workaround was needed for Upstash (b718932). Pod status is now derived from cloudStatus as single source of truth (commit 088f7b4).
Known issue: Orchestrator requires 2048MB for performance CPU tier (2c3145d).
TECH-DEC-003 — Clerk for Multi-Tenant Auth, Not Supabase Auth
Decision: Clerk handles authentication and organization (tenant) isolation. A proxy route (/api/clerk/proxy) bridges Clerk to Supabase row-level security.
Rationale: Clerk provides superior multi-org UX and enterprise SSO out of the box. Supabase Auth was evaluated but replaced.
Status: Active. Admin endpoints (/api/admin/pipeline/cloud) use Clerk auth directly (d037000 — withOrgAuth was intentionally not used here).
TECH-DEC-004 — pgvector IVFFlat with 384-Dimensional Embeddings
Decision: all-MiniLM-L6-v2 (sentence-transformers) produces 384-dim vectors; IVFFlat index with lists=100 serves similarity queries.
Rationale: 384 dimensions balances quality vs. storage/query speed for legal document chunking. IVFFlat is appropriate for the current data volume.
Watch: As corpus scales, HNSW indexing may outperform IVFFlat. No migration plan exists yet.
TECH-DEC-005 — BullMQ Over Direct HTTP Dispatch for Worker Jobs
Decision: Document processing jobs are queued through BullMQ (Redis-backed) rather than direct HTTP calls to workers.
Rationale: Decouples ingest rate from worker capacity; enables retry semantics and dead-letter queues. Previous direct-dispatch pattern (docs/archive/completed_action_plans/modal_migration_optimization/05_direct_dispatch.md) was superseded.
Status: Active. See product/app/api/advisor/process-queue/route.ts.
TECH-DEC-006 — Cap Table Feature Gating
Decision: Cap table functionality ships behind a feature flag with documented rollback procedures.
Evidence: product/lib/cap-table/__tests__/feature-gate.test.ts, action plan step 12 (docs/cap-table/action_plans/12_rollout_feature_flag_and_rollback.md).
Status: Active.
5. Recent Changes and Their Impact
| Commit | Change | Impact |
|---|---|---|
960f88c | Crane tag mismatch fix + Fly registry retry | CI image builds should now succeed reliably |
b718932 | BullMQ Redis TLS ssl_cert_reqs=None | Fixes silent connection failures to Upstash in production |
6c89d22 | Critical TLS + race condition fixes (full pipeline audit) | Broad stability improvement; multiple production bugs closed |
580b80b | Pod warmup status tracking | Operators can now see provisioning → warming → ready progression; eliminates blind spots in admin UI |
088f7b4 | RunPod status from cloudStatus (single source of truth) | Eliminates race condition where two status sources disagreed |
d037000 | Cloud proxy uses Clerk auth directly | Admin endpoint security model corrected; withOrgAuth was inappropriate for this context |
4e7cd14 | Orchestrator memory floor at 2048MB | Prevents OOM crashes on performance CPU tier |
#112 (merged) | Add @tiptap/y-tiptap dep + fix TS build error | Unblocks collaboration editor build |
#111 (merged) | Realtime status + Cloudflare deployment | Live pipeline status in UI; Cloudflare CDN layer added |
#104 (merged) | Daily activity tracking + API integration | Activity feed and unified calendar fully operational |
6. Open Questions and Unresolved Trade-offs
OQ-001 — Upload Pipeline Reliability (PR #101)
Question: What is the root cause of upload failures? File size limits, MIME type handling, Supabase Storage policy, or the ingest worker handoff? Impact: High. Core feature. Status: PR open, no resolution date.
OQ-002 — IVFFlat vs. HNSW as Corpus Grows
Question: At what document count does IVFFlat become a query-latency bottleneck and HNSW migration becomes necessary? Impact: Medium-term performance. No baseline established. No migration path documented.
OQ-003 — E-Sign Production Readiness
Question: Is the stamping engine (docs/action_plans/e_sign/06_stamping_engine_and_finalization.md) complete and tested at production quality?
Status: PR #108 still open. Conversion attribution tests exist but stamping finalization is unclear.
OQ-004 — Billing/Wallet Integration Completeness
Question: Is the billing wallet (docs/billing_wallet/) integrated with cap-table and advisor AI usage costs?
Evidence: WALLET_IMPLEMENTATION_STATUS.md exists but current integration state unknown. Modal wallet deduction was planned (docs/archive/completed_action_plans/modal_migration_optimization/06_wallet_deduction.md).
OQ-005 — knip.yml Dead Code Audit Findings
Question: What did the knip dead-code analysis (..github/workflows/knip.yml) find? Are there orphaned routes or components from refactors?
Impact: Codebase hygiene; 500 files with 100 components warrants periodic pruning.
OQ-006 — Collaboration WebSocket Scaling
Question: The product/collaboration-ws/Dockerfile is a standalone service. What is its scaling model? Does it have affinity requirements with the main Next.js process?
Not documented.
OQ-007 — Multi-Org Scaling Plan Execution State
Question: docs/MULTI_ORG_SCALING_PLAN.md exists. Which phases have been executed?
Not cross-referenced in commit history.
7. Technical Debt Inventory
| ID | Description | Location | Priority | Notes |
|---|---|---|---|---|
| DEBT-001 | PR #101 upload fixes languishing open | product/app/api/ upload path | P0 | Core feature; blocks new users |
| DEBT-002 | No HNSW migration plan for pgvector | documents DB schema | P1 | Will become urgent at scale |
| DEBT-003 | ssl_cert_reqs=None TLS workaround for Upstash | BullMQ Redis client | P1 | Security posture; proper cert validation should replace |
| DEBT-004 | Hardcoded 2048MB orchestrator memory floor | fly/orchestrator/Dockerfile | P2 | Should be configurable env var |
| DEBT-005 | Superseded docs still in docs/archive/superseded/ (10+ files) | docs/archive/superseded/ | P3 | Cognitive overhead; should be deleted |
| DEBT-006 | .DS_Store committed to root | .DS_Store | P3 | .gitignore entry missing or bypassed |
| DEBT-007 | __pycache__/modal_llm.cpython-313.pyc committed | root __pycache__/ | P3 | Python cache in VCS |
| DEBT-008 | Test coverage: only 35 test files for 500-file codebase (7%) | product/lib/__tests__/ | P2 | Activity tracking, advisor, and analysis routes appear untested |
| DEBT-009 | LARGE_PRODUCT_FILES.md in root — implies files that can't be normally tracked | LARGE_PRODUCT_FILES.md | P2 | Investigate LFS or splitting strategy |
| DEBT-010 | Business type profiles action plan partially executed | docs/action_plans/business_type_fix/ | P2 | 8 steps documented; completion state unclear |
8. Key Learnings From Recent Development Sessions
LEARN-001 — Cloud Infrastructure Requires Explicit Auth Wiring
The Fly registry and ghcr.io have different auth handshakes. crane copy failed silently when using the wrong credential source. Fix: explicit docker login before crane operations. Applied in 2c3145d.
LEARN-002 — Single Source of Truth for Async Status Is Non-Negotiable
RunPod status derived from two places produced a race condition that made the admin dashboard show stale data. Consolidating onto cloudStatus (commit 088f7b4) resolved it. Pattern to enforce: every async resource has exactly one authoritative status field.
LEARN-003 — BullMQ + Upstash TLS Requires ssl_cert_reqs=None in Python redis.asyncio
Upstash's managed Redis uses a certificate chain that redis.asyncio rejects by default. The workaround (b718932) disables cert verification. This is a known Upstash community issue; track upstream for a proper fix.
LEARN-004 — Admin Endpoints Should Not Use withOrgAuth
withOrgAuth enforces organization membership checks. Admin pipeline endpoints are cross-org by definition. Using Clerk auth directly without org scoping is correct for these routes (d037000).
LEARN-005 — Cap Table Requires Calibration Before Backfill
docs/cap-table/calibration_report.md and calibration_threshold_policy.md exist because running the extraction pipeline on real documents without calibrating extraction confidence thresholds produced noisy results. Calibration step (action plan 10) gates the backfill step (action plan 10).
LEARN-006 — Modal LLM Concurrency Increase Has Wallet Implications
Increasing Modal concurrency limits (docs/archive/completed_action_plans/modal_migration_optimization/07_concurrency_limit_increase.md) directly increases cost per organization. Wallet deduction must be wired before concurrency is opened up.
9. Architectural Evolution
Phase 1 — Single-Tenant RAG MVP
Initial architecture: single Supabase instance, all documents in one namespace, Ollama local LLM, Python ingest script. Documented in docs/archive/superseded/RAG_SYSTEM_OVERVIEW.md and docs/archive/superseded/README.md.
Phase 2 — Multi-Tenant + Modal Migration
Clerk auth added for org isolation. LLM migrated from local Ollama to Modal (Qwen2.5:32B) for production quality. Document type classification added (equity, ip_assignment, financial, healthcare, agreement). Mixpanel analytics wired. Documented in docs/archive/completed_action_plans/modal_migration_optimization/.
Phase 3 — Feature Expansion (Current)
Three parallel feature tracks landed simultaneously:
- Cap Table — 14-step implementation with extraction pipeline, review queue, and manual controls
- E-Signature — Full signing workflow with template designer, signature capture, and PDF stamping
- Activity System — Materialized-view-backed audit log with unified calendar
Cloud GPU worker fleet introduced (RunPod + Fly.io + BullMQ) to handle burst embedding workloads without blocking the Next.js API tier.
Phase 4 — Stabilization (In Progress)
CI/CD pipeline being hardened (last 6 commits). Upload reliability (PR #101) is the last major functional gap. Business type profiles completing. Collaboration WebSocket service added for real-time features.
10. Team Conventions and Patterns
Naming Conventions
- Action plans:
docs/action_plans/<feature>/00_overview...NN_step.md— numbered, sequential, withREADME.mdandSTART_HERE.mdentry points - Completed action plans archived to
docs/archive/completed_action_plans/ - API routes follow Next.js App Router file convention:
product/app/api/<resource>/route.tsandproduct/app/api/<resource>/[id]/action/route.ts - Test files co-located:
product/lib/__tests__/for integration tests,product/lib/<module>/__tests__/for unit tests
Git Workflow
- Feature branches merge via PR; staging gate (
#110) precedes production merges - Commit prefixes in use:
feat:,fix:,ci:,docs:,refactor: - Husky hooks enforced at pre-commit and pre-push (
.husky/pre-commit,.husky/pre-push) - Knip dead-code analysis runs on CI (
.github/workflows/knip.yml)
Docker / Infra Conventions
COMPOSE_PROFILESpattern:full,gpu,worker,infra— explicit profile activation required- Shared logging config via YAML anchor
x-logging: &default-logging(json-file, 50MB/5 files) - Shared env via
x-env: &default-envpointing to.env.services - Images published to
ghcr.io/midwestco/datazoom-{base,worker,gpu}
Code Patterns
- Dynamic item routing via
product/app/(app)/[type]/[id]/with type-specific handlers (company-handler.tsx,document-handler.tsx,finding-handler.tsx,option-handler.tsx) and views — clean strategy pattern for polymorphic entity display - Provider pattern for page-level state:
analysis-provider.tsx,documents-provider.tsx,activity-provider.tsx - Supabase client via
@supabase/supabase-js ^2.95.3 - UI components via
shadcn/ui+@radix-ui/react-context-menu
11. Integration Points and Current Health
| Integration | Purpose | Health | Notes |
|---|---|---|---|
| Supabase (PostgreSQL + pgvector) | Primary datastore + vector search | ✅ Stable | pgvector IVFFlat index active |
| Supabase Storage | Document file storage | ⚠️ Suspect | PR #101 upload fixes open |
| Clerk | Auth + multi-tenancy | ✅ Stable | Proxy route at /api/clerk/proxy |
| Modal (Qwen2.5:32B) | Production LLM inference | ✅ Stable | Wallet deduction integration unclear |
| Ollama (local) | Fallback LLM | ✅ Available | OLLAMA_HOST env var; not required in production |
| BullMQ + Upstash Redis | Job queue for ingest/embedding | ⚠️ TLS workaround | ssl_cert_reqs=None in use |
| RunPod + Fly.io | GPU worker orchestration | ⚠️ Stabilizing | CI builds for cloud-worker image recently fixed |
| Mixpanel | Product analytics | ✅ Stable | Track-interaction route active |
| Resend (Email) | E-sign notifications | 🔶 In progress | Action plan step 2 documented |
| Cloudflare | CDN / edge | ✅ Recently added | #111 merged |
| Collaboration WebSocket | Real-time editing | 🔶 New | product/collaboration-ws/ — scaling model undocumented |
| Sentry | Error tracking | 🔶 Optional | SENTRY_DSN listed as optional in .env.services.example |
12. Performance Baselines and Optimization Opportunities
Known Baselines
- Embedding dimensions: 384 (all-MiniLM-L6-v2) — fixed by model choice
- IVFFlat lists: 100 — suitable for moderate corpus size
- Ollama parallelism: 4 concurrent requests, max 3 loaded models
- Orchestrator memory: 2048MB minimum on performance CPU tier
Optimization Opportunities
| ID | Opportunity | Effort | Expected Gain |
|---|---|---|---|
| PERF-001 | Migrate pgvector index from IVFFlat to HNSW at scale | Medium | Sub-10ms p99 similarity queries at >100k chunks |
| PERF-002 | Materialized view refresh scheduling for activity metrics (/api/activity/unified/refresh) | Low | Eliminate cold query latency on activity feed |
| PERF-003 | Cap table snapshot read model (docs/cap-table/action_plans/14_snapshot_read_model_optimization.md) | Medium | Avoid full recalculation on every cap table read |
| PERF-004 | BullMQ batch advisor processing (/api/advisor/batch) | Already built | Confirm batch size tuning against Modal concurrency limits |
| PERF-005 | Parallel queue processing (docs/archive/completed_action_plans/modal_migration_optimization/04_parallel_queue_processing.md) | Implemented | Validate against current Modal wallet deduction rate |
| PERF-006 | OLLAMA_MAX_LOADED_MODELS: "3" — evaluate reducing to 2 if memory pressure observed | Low | Reduce RAM usage per worker node |
13. Immediate Priorities (Next Session Checklist)
- Resolve PR #101 — Diagnose and merge upload fixes. This is the highest-impact unblocked user-facing bug.
- Merge PR #108 — E-sign template PDF preview + schema normalization needed to close the e-sign feature track.
- Audit TLS workaround DEBT-003 — Determine if Upstash has released a proper cert bundle; remove
ssl_cert_reqs=None. - Verify CI green — Confirm
build-images.ymlproduces valid image tags toghcr.ioafter the crane/Fly fixes. - Document collaboration-ws scaling model — Before the next load spike, define how many WebSocket instances run and whether they require sticky sessions.
- Close DEBT-006/007 — Remove
.DS_Storeand__pycache__/from VCS; update.gitignore. - Cross-reference billing wallet with cap-table and Modal — Ensure cost deduction fires before concurrency limit increases go live.
This document reflects the state of midwestco/datazoom as of commit 960f88c. It should be regenerated after each significant sprint or infrastructure change.