Build a local AI system that writes and speaks like Thota — using LoRA fine-tuning and voice cloning on an M4 Pro Mac Mini
Status: Planning & Research Complete ·
Next: Data Collection & Setup
1. Project Overview
Two independent AI capabilities — one trained to write in Thota's voice, one to speak with Thota's voice — both running locally on a personal Mac Mini. No cloud services, no subscriptions, no data leaving the house.
Component
Approach
Writing Style Clone
LoRA fine-tuning of Qwen 2.5 7B Instruct on personal emails and WhatsApp messages
Voice Clone TTS
OpenVoice V2 instant voice cloning from ~1 hour of reference recordings
Inference Platform
Ollama + Metal GPU on M4 Pro Mac Mini 24GB
Backend
SvelteKit + Deno + FastAPI (Python)
2. Writing Style LoRA — Requirements
2.1 Training Data
Requirement
The system shall accept email exports in .mbox and .eml formats from Gmail, Outlook, and Apple Mail, and WhatsApp chat exports in plain text .txt format.
Requirement
Training data shall be parsed locally using Python scripts without any cloud-connected service. No personal data shall leave the Mac Mini during processing.
Requirement
The final curated dataset shall contain between 500–1,000 well-formatted instruction pairs derived from emails and WhatsApp messages, with a minimum of 200 samples for a viable first run.
Requirement
Dataset entries shall follow ChatML format (system, user, assistant message structure) and be serialized as JSONL files.
Requirement
The system shall deduplicate near-identical samples using MinHash LSH at similarity threshold 0.85 before training begins.
2.2 Model Selection
Selected:Qwen2.5-7B-Instruct with 4-bit QLoRA fine-tuning
7.6B parameters · ~15GB bf16, ~4–5GB Q4 quantized · fits in 24GB with headroom
Best writing quality (9/10) among models that fit comfortably on M4 Pro 24GB
128K context · GQA architecture for efficient long-document processing
MIT/Apache-style Qwen License — more permissive than Llama's custom license
Runner-up: Llama 3.1 8B Instruct if Qwen has issues
2.3 Training Configuration
Parameter
Value
Notes
LoRA Rank
8–16
Rank 8 for style-only; cap at 16 for style+task. No higher than 32.
Target Modules
q_proj, v_proj
Minimum. Adding k_proj+o_proj is optional.
Learning Rate
2e-4
Cosine scheduler, 5–10% warmup steps
Dropout
0.05
Mild dropout to prevent overfitting
Optimizer
AdamW 8-bit
bitsandbytes for memory savings
Batch Size
4–8
Per device; use gradient accumulation 4–8
Sequence Length
512–1024 tokens
2048+ risks OOM on 24GB
Epochs
1–3
Style-only: overtraining causes mimicry not adaptation
Training Steps
300–600
Or 1–3 epochs on 500 samples
2.4 Inference
Requirement
The writing style LoRA shall be served via Ollama with a custom Modelfile that loads the base Qwen 2.5 7B model and applies the LoRA adapter weights. The service shall run on localhost and respond to OpenAI-compatible API calls.
Requirement
The API shall accept a writing task prompt (e.g., "Draft a reply to my colleague thanking them for their help") and return text written in Thota's style — concise, direct, dry humor, no hedging, no corporate fluff.
Requirement
The Ollama server shall make zero outbound network requests during inference. All processing shall happen locally on the Mac Mini.
Requirement
System prompt shall guide the model to return only the draft text — no preamble like "Here's a draft:", no explanations, no "[DRAFT]" markers.
Requirement
Throughput target: 30–50 tokens/second on M4 Pro Metal GPU with Q4 quantization.
3. Voice Clone TTS — Requirements
3.1 Training / Cloning Data
Requirement
Thota shall record approximately 1 hour of audio across 6–10 distinct emotional contexts, including: neutral/calm, happy/excited, sad/contemplative, angry/frustrated, surprised/curious, whispered/soft, authoritative/strong, tired/fatigued, and playful/teasing.
Requirement
Audio recordings shall be captured at 16kHz minimum (24kHz recommended), in a consistent environment with the same microphone. Files shall be recorded in 5–10 minute segments per emotional context to avoid vocal fatigue.
Requirement
Audio shall be pre-processed: normalize levels, remove long silences and breathing artifacts, ensure consistent sample rate across all recordings.
Requirement
All reference audio shall be stored locally in a FileVault-encrypted directory on the Mac Mini. No audio data shall be uploaded to any cloud service.
3.2 Voice Model Selection
Selected:OpenVoice V2 (MIT License)
Instant tone color cloning from 10–30 second reference clip — no fine-tuning required for basic cloning
Fine-tuning mode: ~2–4 hours on 1 hour of audio data for enhanced quality
Full Apple Silicon MPS support · runs 100% local on M4 Pro
0.3–0.8x real-time synthesis speed on M4 Pro Metal GPU
Emotion and prosody control via reference audio + style parameters
MIT licensed — free for all use including commercial
Runner-up: XTTS v2 (higher quality ceiling, but Coqui Public Model License — not fully open source) or Parler-TTS Mini (Apache 2.0, description-driven style control, 880M params)
3.3 TTS Inference
Requirement
The TTS engine shall produce audio at 22,050 Hz sample rate in WAV format (lossless) and optionally MP3 for streaming. Output shall be returned as a file download or streaming response.
Requirement
End-to-end latency shall not exceed 2 seconds for short texts (under 100 characters) including reference audio processing and synthesis.
Requirement
The TTS API shall be served via FastAPI (Python) on the Mac Mini, callable from the SvelteKit backend via a single REST endpoint.
4. Unified API — Requirements
Requirement
A single SvelteKit backend shall expose these API routes:
Route
Method
Description
/api/tts/pipeline
POST
Unified pipeline: LoRA text gen → TTS synthesis in one call
/api/tts/clone
POST
TTS clone endpoint with reference audio
/api/tts/lora/generate
POST
LoRA text generation with writing style
/api/voice/upload
POST
Upload reference audio, returns voice ID
Requirement
The pipeline endpoint shall accept a prompt, optional style reference text, and a reference audio clip, and return styled text plus synthesized speech in a single request.
5. Hardware & Performance Targets
Metric
Target
Target Platform
Mac Mini M4 Pro · 24GB unified RAM · macOS
Writing Inference Speed
30–50 tokens/sec (Q4 quantized)
TTS Synthesis Speed
0.3–0.8× real-time (faster than speech)
Memory Usage — Writing
~4–6GB (Qwen QLoRA + LoRA adapter)
Memory Usage — TTS
~2–4GB (OpenVoice + MeloTTS)
Combined RAM Target
Stay under ~18GB of 24GB (leave headroom for macOS)
Training Time (Writing LoRA)
1.5–6 hours per epoch on 500–1,000 samples
TTS Fine-tune Time (1hr audio)
2–4 hours on M4 Pro Metal
Storage Required
50GB+ free SSD for models + datasets + outputs
6. Privacy & Security
Requirement
All training and inference shall run entirely on the Mac Mini. No personal emails, messages, or voice recordings shall be sent to any external service.
Requirement
The TTS API shall be exposed only via Cloudflare Tunnel (outbound-only connection) or Tailscale VPN. No ports shall be opened directly on the router.
Requirement
SSH access shall use key-based authentication only. Password authentication shall be disabled.
Requirement
Training data shall be deduplicated (minimum 50–200 diverse samples) to prevent the model from memorizing exact phrasing from personal messages.
Requirement
Reference audio storage shall be protected by FileVault full-disk encryption. Optionally, sensitive voice samples shall be stored in an encrypted DMG container.
7. Deployment & Remote Access
Requirement
The Mac Mini shall be configured to start automatically after a power failure (Energy Saver setting) and run as an always-on home server.
Requirement
Cloudflare Tunnel shall be used to provide a persistent public URL for API access without opening router ports. The tunnel shall use an outbound-only connection.
Requirement
Ollama shall be configured to listen on localhost only (default). The FastAPI TTS server shall bind to localhost and only accept connections through the SvelteKit backend.
Requirement
Process management shall use launchd (macOS native) or tmux to ensure services restart automatically after a crash or reboot.
8. Out of Scope
Browser-based or cloud-connected tools for email/WhatsApp parsing
Commercial voice cloning services or API-dependant TTS
vLLM (not supported on Apple Silicon)
14B+ models for training on 24GB (requires gradient offloading, 3–5× slower)
Cross-platform Windows/Linux support (macOS-specific stack)