Requirements Document

Personal AI Twin

Build a local AI system that writes and speaks like Thota — using LoRA fine-tuning and voice cloning on an M4 Pro Mac Mini

Status: Planning & Research Complete · Next: Data Collection & Setup

1. Project Overview

Two independent AI capabilities — one trained to write in Thota's voice, one to speak with Thota's voice — both running locally on a personal Mac Mini. No cloud services, no subscriptions, no data leaving the house.

Component	Approach
Writing Style Clone	LoRA fine-tuning of Qwen 2.5 7B Instruct on personal emails and WhatsApp messages
Voice Clone TTS	OpenVoice V2 instant voice cloning from ~1 hour of reference recordings
Inference Platform	Ollama + Metal GPU on M4 Pro Mac Mini 24GB
Backend	SvelteKit + Deno + FastAPI (Python)

2. Writing Style LoRA — Requirements

2.1 Training Data

Requirement

The system shall accept email exports in .mbox and .eml formats from Gmail, Outlook, and Apple Mail, and WhatsApp chat exports in plain text .txt format.

Requirement

Training data shall be parsed locally using Python scripts without any cloud-connected service. No personal data shall leave the Mac Mini during processing.

Requirement

The final curated dataset shall contain between 500–1,000 well-formatted instruction pairs derived from emails and WhatsApp messages, with a minimum of 200 samples for a viable first run.

Requirement

Dataset entries shall follow ChatML format (system, user, assistant message structure) and be serialized as JSONL files.

Requirement

The system shall deduplicate near-identical samples using MinHash LSH at similarity threshold 0.85 before training begins.

2.2 Model Selection

Selected: Qwen2.5-7B-Instruct with 4-bit QLoRA fine-tuning

7.6B parameters · ~15GB bf16, ~4–5GB Q4 quantized · fits in 24GB with headroom
Best writing quality (9/10) among models that fit comfortably on M4 Pro 24GB
128K context · GQA architecture for efficient long-document processing
MIT/Apache-style Qwen License — more permissive than Llama's custom license
Runner-up: Llama 3.1 8B Instruct if Qwen has issues

2.3 Training Configuration

Parameter	Value	Notes
LoRA Rank	8–16	Rank 8 for style-only; cap at 16 for style+task. No higher than 32.
Target Modules	q_proj, v_proj	Minimum. Adding k_proj+o_proj is optional.
Learning Rate	2e-4	Cosine scheduler, 5–10% warmup steps
Dropout	0.05	Mild dropout to prevent overfitting
Optimizer	AdamW 8-bit	bitsandbytes for memory savings
Batch Size	4–8	Per device; use gradient accumulation 4–8
Sequence Length	512–1024 tokens	2048+ risks OOM on 24GB
Epochs	1–3	Style-only: overtraining causes mimicry not adaptation
Training Steps	300–600	Or 1–3 epochs on 500 samples

2.4 Inference

Requirement

The writing style LoRA shall be served via Ollama with a custom Modelfile that loads the base Qwen 2.5 7B model and applies the LoRA adapter weights. The service shall run on localhost and respond to OpenAI-compatible API calls.

Requirement

The API shall accept a writing task prompt (e.g., "Draft a reply to my colleague thanking them for their help") and return text written in Thota's style — concise, direct, dry humor, no hedging, no corporate fluff.

Requirement

The Ollama server shall make zero outbound network requests during inference. All processing shall happen locally on the Mac Mini.

Requirement

System prompt shall guide the model to return only the draft text — no preamble like "Here's a draft:", no explanations, no "[DRAFT]" markers.

Requirement

Throughput target: 30–50 tokens/second on M4 Pro Metal GPU with Q4 quantization.

3. Voice Clone TTS — Requirements

3.1 Training / Cloning Data

Requirement

Thota shall record approximately 1 hour of audio across 6–10 distinct emotional contexts, including: neutral/calm, happy/excited, sad/contemplative, angry/frustrated, surprised/curious, whispered/soft, authoritative/strong, tired/fatigued, and playful/teasing.

Requirement

Audio recordings shall be captured at 16kHz minimum (24kHz recommended), in a consistent environment with the same microphone. Files shall be recorded in 5–10 minute segments per emotional context to avoid vocal fatigue.

Requirement

Audio shall be pre-processed: normalize levels, remove long silences and breathing artifacts, ensure consistent sample rate across all recordings.

Requirement

All reference audio shall be stored locally in a FileVault-encrypted directory on the Mac Mini. No audio data shall be uploaded to any cloud service.

3.2 Voice Model Selection

Selected: OpenVoice V2 (MIT License)

Instant tone color cloning from 10–30 second reference clip — no fine-tuning required for basic cloning
Fine-tuning mode: ~2–4 hours on 1 hour of audio data for enhanced quality
Full Apple Silicon MPS support · runs 100% local on M4 Pro
0.3–0.8x real-time synthesis speed on M4 Pro Metal GPU
Emotion and prosody control via reference audio + style parameters
MIT licensed — free for all use including commercial

Runner-up: XTTS v2 (higher quality ceiling, but Coqui Public Model License — not fully open source) or Parler-TTS Mini (Apache 2.0, description-driven style control, 880M params)

3.3 TTS Inference

Requirement

The TTS engine shall produce audio at 22,050 Hz sample rate in WAV format (lossless) and optionally MP3 for streaming. Output shall be returned as a file download or streaming response.

Requirement

End-to-end latency shall not exceed 2 seconds for short texts (under 100 characters) including reference audio processing and synthesis.

Requirement

The TTS API shall be served via FastAPI (Python) on the Mac Mini, callable from the SvelteKit backend via a single REST endpoint.

4. Unified API — Requirements

Requirement

A single SvelteKit backend shall expose these API routes:

Route	Method	Description
`/api/tts/pipeline`	POST	Unified pipeline: LoRA text gen → TTS synthesis in one call
`/api/tts/clone`	POST	TTS clone endpoint with reference audio
`/api/tts/lora/generate`	POST	LoRA text generation with writing style
`/api/voice/upload`	POST	Upload reference audio, returns voice ID

Requirement

The pipeline endpoint shall accept a prompt, optional style reference text, and a reference audio clip, and return styled text plus synthesized speech in a single request.

5. Hardware & Performance Targets

Metric	Target
Target Platform	Mac Mini M4 Pro · 24GB unified RAM · macOS
Writing Inference Speed	30–50 tokens/sec (Q4 quantized)
TTS Synthesis Speed	0.3–0.8× real-time (faster than speech)
Memory Usage — Writing	~4–6GB (Qwen QLoRA + LoRA adapter)
Memory Usage — TTS	~2–4GB (OpenVoice + MeloTTS)
Combined RAM Target	Stay under ~18GB of 24GB (leave headroom for macOS)
Training Time (Writing LoRA)	1.5–6 hours per epoch on 500–1,000 samples
TTS Fine-tune Time (1hr audio)	2–4 hours on M4 Pro Metal
Storage Required	50GB+ free SSD for models + datasets + outputs

6. Privacy & Security

Requirement

All training and inference shall run entirely on the Mac Mini. No personal emails, messages, or voice recordings shall be sent to any external service.

Requirement

The TTS API shall be exposed only via Cloudflare Tunnel (outbound-only connection) or Tailscale VPN. No ports shall be opened directly on the router.

Requirement

SSH access shall use key-based authentication only. Password authentication shall be disabled.

Requirement

Training data shall be deduplicated (minimum 50–200 diverse samples) to prevent the model from memorizing exact phrasing from personal messages.

Requirement

Reference audio storage shall be protected by FileVault full-disk encryption. Optionally, sensitive voice samples shall be stored in an encrypted DMG container.

7. Deployment & Remote Access

Requirement

The Mac Mini shall be configured to start automatically after a power failure (Energy Saver setting) and run as an always-on home server.

Requirement

Cloudflare Tunnel shall be used to provide a persistent public URL for API access without opening router ports. The tunnel shall use an outbound-only connection.

Requirement

Ollama shall be configured to listen on localhost only (default). The FastAPI TTS server shall bind to localhost and only accept connections through the SvelteKit backend.

Requirement

Process management shall use launchd (macOS native) or tmux to ensure services restart automatically after a crash or reboot.

8. Out of Scope

Browser-based or cloud-connected tools for email/WhatsApp parsing
Commercial voice cloning services or API-dependant TTS
vLLM (not supported on Apple Silicon)
14B+ models for training on 24GB (requires gradient offloading, 3–5× slower)
Cross-platform Windows/Linux support (macOS-specific stack)