Spaces:

marcosremar2
/

parle-speech-to-speech

Configuration error

App Files Files Community

marcos commited on Jan 17

Commit

bd4f893

0 Parent(s):

Initial commit

Browse files

Files changed (21) hide show

README.md +209 -0
app.py +1497 -0
cefr/.gitkeep +2 -0
checkpoint.sh +126 -0
deploy-criu.sh +69 -0
fast_startup.py +409 -0
llm/.gitkeep +2 -0
models/cefr/.gitkeep +2 -0
models/llm/.gitkeep +2 -0
models/stt/.gitkeep +2 -0
models/tts/.gitkeep +2 -0
requirements.txt +27 -0
restore.sh +108 -0
setup-criu-patched.sh +57 -0
setup-criu.sh +91 -0
setup-fast-coldstart.sh +131 -0
start-optimized.sh +198 -0
start-smart.sh +91 -0
start.sh +40 -0
stt/.gitkeep +2 -0
tts/.gitkeep +2 -0

README.md ADDED Viewed

	@@ -0,0 +1,209 @@

+# PARLE Speech-to-Speech
+Pipeline completo de Speech-to-Speech para ensino de português com adaptação automática ao nível CEFR do aluno.
+**HuggingFace:** [marcosremar2/parle-speech-to-speech](https://huggingface.co/marcosremar2/parle-speech-to-speech)
+**Hardware:** TensorDock RTX 3090 (24GB VRAM)
+## Pipeline
+```
+Audio -> Whisper (STT) -> CEFR Classifier -> Gemma 3 4B vLLM (LLM) -> Kokoro (TTS) -> Audio
+                              ↓
+                    Adapta prompt ao nível do aluno (A1-C1)
+```
+## CEFR Adaptativo
+O sistema classifica automaticamente o nível CEFR do aluno a cada 5 mensagens e adapta as respostas do avatar:
+| Nível | Comportamento do Avatar |
+|-------|------------------------|
+| **A1** | Frases muito curtas, vocabulário básico, fala devagar |
+| **A2** | Frases simples, conectivos básicos, correções gentis |
+| **B1** | Vocabulário variado, tempos verbais diversos |
+| **B2** | Discussões abstratas, expressões idiomáticas |
+| **C1** | Linguagem nativa, nuances culturais |
+**Modelo CEFR:** `marcosremar2/cefr-classifier-pt-mdeberta-v3-enem` (96.43% accuracy)
+## Modelos HuggingFace
+| Componente | Modelo | Função |
+|------------|--------|--------|
+| STT | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | Transcrição de áudio |
+| LLM | [RedHatAI/gemma-3-4b-it-quantized.w4a16](https://huggingface.co/RedHatAI/gemma-3-4b-it-quantized.w4a16) | Geração de resposta |
+| TTS | [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) | Síntese de voz |
+| CEFR | [marcosremar2/cefr-classifier-pt-mdeberta-v3-enem](https://huggingface.co/marcosremar2/cefr-classifier-pt-mdeberta-v3-enem) | Classificação de nível |
+## Endpoints
+### Frontend-Compatible (JSON)
+| Endpoint | Método | Descrição |
+|----------|--------|-----------|
+| `/health` | GET | Health check com status dos modelos e CEFR |
+| `/api/audio` | POST | Processa áudio (STT → LLM → TTS) |
+| `/api/text` | POST | Processa texto (LLM → TTS) |
+| `/api/reset` | POST | Limpa histórico de conversa e reseta CEFR |
+### CEFR Endpoints
+| Endpoint | Método | Descrição |
+|----------|--------|-----------|
+| `/api/cefr/status` | GET | Status atual do CEFR (nível, contador) |
+| `/api/cefr/classify` | POST | Classifica texto manualmente |
+| `/api/cefr/reset` | POST | Reseta nível para B1 |
+| `/api/cefr/set` | POST | Define nível manualmente |
+### WebSocket
+| Endpoint | Descrição |
+|----------|-----------|
+| `/ws/stream` | Streaming bidirecional de áudio |
+## Formato de Requisição
+### POST /api/audio
+```json
+{
+  "audio": "<base64 WAV>",
+  "language": "pt",
+  "voice": "pf_dora",
+  "mode": "default"
+}
+```
+### POST /api/text
+```json
+{
+  "text": "Olá, como você está?",
+  "language": "pt",
+  "voice": "pf_dora",
+  "mode": "default"
+}
+```
+## Formato de Resposta
+```json
+{
+  "transcription": {
+    "text": "texto transcrito",
+    "language": "pt",
+    "confidence": 1.0
+  },
+  "response": {
+    "text": "resposta do LLM",
+    "emotion": "neutral",
+    "language": "pt"
+  },
+  "speech": {
+    "audio": "<base64 WAV>",
+    "visemes": [],
+    "duration": 1.5,
+    "sample_rate": 24000,
+    "format": "wav"
+  },
+  "timing": {
+    "stt_ms": 100,
+    "llm_ms": 200,
+    "tts_ms": 150,
+    "total_ms": 450
+  },
+  "cefr": {
+    "current_level": "B1",
+    "messages_until_classify": 3
+  }
+}
+```
+### POST /api/cefr/classify
+```json
+{
+  "text": "Eu gosto de estudar português porque é uma língua bonita."
+}
+```
+**Resposta:**
+```json
+{
+  "level": "B1",
+  "confidence": 0.87,
+  "probabilities": {
+    "A1": 0.02,
+    "A2": 0.08,
+    "B1": 0.87,
+    "B2": 0.02,
+    "C1": 0.01
+  },
+  "text_length": 58
+}
+```
+## Deploy no TensorDock
+### 1. Criar instância RTX 3090 (24GB)
+```bash
+# SSH para a instância
+ssh user@SEU_IP_TENSORDOCK
+```
+### 2. Instalar dependências
+```bash
+pip install fastapi uvicorn torch transformers vllm kokoro soundfile librosa
+```
+### 3. Configurar variáveis de ambiente
+```bash
+export IDLE_TIMEOUT_SECONDS=300  # 5 minutos
+export TENSORDOCK_API_TOKEN="seu_token"
+export TENSORDOCK_INSTANCE_ID="seu_instance_id"
+```
+### 4. Iniciar servidor
+```bash
+python app.py
+# ou
+uvicorn app:app --host 0.0.0.0 --port 8000
+```
+### 5. Configurar frontend
+No arquivo `.env` do projeto Next.js:
+```bash
+NEXT_PUBLIC_CABECAO_BACKEND_URL="http://SEU_IP_TENSORDOCK:8000"
+```
+## Auto-Stop
+O servidor para automaticamente após 60 segundos de inatividade (configurável via `IDLE_TIMEOUT_SECONDS`).
+Para manter ativo, o frontend faz chamadas periódicas ao `/health`.
+## WebSocket Streaming
+```javascript
+const ws = new WebSocket('ws://SEU_IP:8000/ws/stream');
+ws.onmessage = (event) => {
+  if (event.data instanceof Blob) {
+    // Chunk de áudio WAV - tocar imediatamente
+    playAudioChunk(event.data);
+  } else {
+    // JSON com métricas ou status
+    const data = JSON.parse(event.data);
+    console.log('Status:', data);
+  }
+};
+// Enviar áudio gravado
+ws.send(audioBlob);
+```

app.py ADDED Viewed

	@@ -0,0 +1,1497 @@

+"""
+DumontTalker Inference Server - Full Pipeline with WebSocket Streaming
+TensorDock RTX 3090 (24GB VRAM)
+Pipeline: Whisper (STT) → Gemma 3 4B vLLM (LLM) → Kokoro (TTS)
++ CEFR Classifier: Classifica nível do aluno a cada 5 mensagens
+Auto-stop: Para a instância após 60s de inatividade
+WebSocket: /ws/stream - Streaming bidirecional de áudio
+- Cliente envia: áudio (binary)
+- Servidor envia: chunks de áudio (binary) + métricas (JSON)
+"""
+import base64
+import io
+import os
+import re
+import time
+import json
+import asyncio
+import threading
+import requests as http_requests
+from datetime import datetime
+from typing import Optional
+from dataclasses import dataclass
+from fastapi import FastAPI, File, Form, UploadFile, HTTPException, WebSocket, WebSocketDisconnect
+from fastapi.responses import JSONResponse
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import List, Optional
+# ============================================================================
+# PYDANTIC MODELS - Compatible with frontend
+# ============================================================================
+class AudioRequest(BaseModel):
+    """Request format expected by frontend"""
+    audio: str  # base64 encoded WAV
+    language: str = "pt"  # Idioma para STT (forçar transcrição neste idioma)
+    voice: str = "pf_dora"
+    mode: str = "default"
+    conversation_history: List[dict] = []
+    student_name: str = "Aluno"  # Nome do aluno para personalizar prompts
+    # Novo: system_prompt opcional enviado pelo frontend (para adaptar ao nível CEFR)
+    system_prompt: Optional[str] = None
+    max_tokens: Optional[int] = None  # Max tokens para resposta (opcional)
+    temperature: Optional[float] = None  # Temperature para LLM (opcional)
+    speed_rate: Optional[float] = None  # Velocidade manual da fala (0.5-1.5, None = automático)
+class TextRequest(BaseModel):
+    """Text request format expected by frontend"""
+    text: str
+    language: str = "pt"
+    voice: str = "pf_dora"
+    mode: str = "default"
+    stream: bool = False
+    student_name: str = "Aluno"  # Nome do aluno para personalizar prompts
+    # Novo: system_prompt opcional enviado pelo frontend (para adaptar ao nível CEFR)
+    system_prompt: Optional[str] = None
+    max_tokens: Optional[int] = None  # Max tokens para resposta (opcional)
+    temperature: Optional[float] = None  # Temperature para LLM (opcional)
+# ============================================================================
+# AUTO-STOP CONFIGURATION
+# ============================================================================
+IDLE_TIMEOUT_SECONDS = int(os.environ.get("IDLE_TIMEOUT_SECONDS", "60"))
+TENSORDOCK_API_TOKEN = os.environ.get("TENSORDOCK_API_TOKEN", "")
+TENSORDOCK_INSTANCE_ID = os.environ.get("TENSORDOCK_INSTANCE_ID", "")
+# Email alerts configuration
+RESEND_API_KEY = os.environ.get("RESEND_API_KEY", "")
+ALERT_EMAIL = os.environ.get("ALERT_EMAIL", "marcos@marcosrp.com")  # Email to receive alerts
+# Try to get instance ID from hostname if not set
+if not TENSORDOCK_INSTANCE_ID:
+    try:
+        import socket
+        TENSORDOCK_INSTANCE_ID = socket.gethostname()
+    except:
+        pass
+# Global state
+last_activity = datetime.now()
+auto_stop_enabled = True
+def send_alert_email(subject: str, message: str):
+    """Send alert email via Resend API"""
+    if not RESEND_API_KEY:
+        print(f"[ALERT] No RESEND_API_KEY set, cannot send email: {subject}")
+        return False
+    try:
+        resp = http_requests.post(
+            "https://api.resend.com/emails",
+            headers={
+                "Authorization": f"Bearer {RESEND_API_KEY}",
+                "Content-Type": "application/json"
+            },
+            json={
+                "from": "PARLE Backend <alerts@parle.marcosrp.com>",
+                "to": [ALERT_EMAIL],
+                "subject": f"[PARLE ALERT] {subject}",
+                "html": f"""
+                <h2>🚨 PARLE Backend Alert</h2>
+                <p><strong>Instance:</strong> {TENSORDOCK_INSTANCE_ID or 'unknown'}</p>
+                <p><strong>Time:</strong> {datetime.now().isoformat()}</p>
+                <hr/>
+                <p>{message}</p>
+                <hr/>
+                <p style="color: #666; font-size: 12px;">
+                    This is an automated alert from the PARLE TensorDock backend.
+                </p>
+                """
+            },
+            timeout=10
+        )
+        if resp.status_code == 200:
+            print(f"[ALERT] Email sent successfully: {subject}")
+            return True
+        else:
+            print(f"[ALERT] Failed to send email: {resp.status_code} {resp.text}")
+            return False
+    except Exception as e:
+        print(f"[ALERT] Error sending email: {e}")
+        return False
+def touch_activity():
+    """Register activity (reset idle timer)"""
+    global last_activity
+    last_activity = datetime.now()
+def stop_instance():
+    """Stop this TensorDock instance via API"""
+    if not TENSORDOCK_API_TOKEN or not TENSORDOCK_INSTANCE_ID:
+        error_msg = "Missing API token or instance ID, cannot stop"
+        print(f"[AUTO-STOP] {error_msg}")
+        send_alert_email(
+            "Auto-Stop FAILED - Missing Credentials",
+            f"""
+            <p><strong>Error:</strong> {error_msg}</p>
+            <p><strong>TENSORDOCK_API_TOKEN:</strong> {'SET' if TENSORDOCK_API_TOKEN else 'NOT SET'}</p>
+            <p><strong>TENSORDOCK_INSTANCE_ID:</strong> {TENSORDOCK_INSTANCE_ID or 'NOT SET'}</p>
+            <p style="color: red;"><strong>⚠️ The instance is still running and costing money!</strong></p>
+            <p>Please SSH into the VM and set the environment variables, or stop the instance manually.</p>
+            """
+        )
+        return False
+    try:
+        print(f"[AUTO-STOP] Stopping instance {TENSORDOCK_INSTANCE_ID}...")
+        resp = http_requests.post(
+            f"https://dashboard.tensordock.com/api/v2/instances/{TENSORDOCK_INSTANCE_ID}/stop",
+            headers={"Authorization": f"Bearer {TENSORDOCK_API_TOKEN}"},
+            timeout=30
+        )
+        if resp.status_code == 200:
+            print("[AUTO-STOP] Instance stopped successfully!")
+            return True
+        else:
+            error_msg = f"API returned {resp.status_code}: {resp.text}"
+            print(f"[AUTO-STOP] Failed to stop: {error_msg}")
+            send_alert_email(
+                "Auto-Stop FAILED - API Error",
+                f"""
+                <p><strong>Error:</strong> {error_msg}</p>
+                <p style="color: red;"><strong>⚠️ The instance is still running and costing money!</strong></p>
+                <p>Please stop the instance manually via TensorDock dashboard.</p>
+                """
+            )
+            return False
+    except Exception as e:
+        error_msg = str(e)
+        print(f"[AUTO-STOP] Error stopping instance: {error_msg}")
+        send_alert_email(
+            "Auto-Stop FAILED - Exception",
+            f"""
+            <p><strong>Exception:</strong> {error_msg}</p>
+            <p style="color: red;"><strong>⚠️ The instance is still running and costing money!</strong></p>
+            <p>Please stop the instance manually via TensorDock dashboard.</p>
+            """
+        )
+        return False
+def idle_monitor():
+    """Background thread that monitors idle time and stops instance"""
+    global last_activity, auto_stop_enabled
+    print(f"[AUTO-STOP] Monitor started. Timeout: {IDLE_TIMEOUT_SECONDS}s")
+    while auto_stop_enabled:
+        time.sleep(10)  # Check every 10 seconds
+        elapsed = (datetime.now() - last_activity).total_seconds()
+        remaining = max(0, IDLE_TIMEOUT_SECONDS - elapsed)
+        if elapsed >= IDLE_TIMEOUT_SECONDS:
+            print(f"[AUTO-STOP] Idle for {elapsed:.0f}s, stopping instance...")
+            success = stop_instance()
+            if not success:
+                # Alert already sent by stop_instance, but log the failure
+                print("[AUTO-STOP] CRITICAL: Failed to stop instance! Will keep trying every 60s...")
+                # Keep trying every 60 seconds instead of giving up
+                while auto_stop_enabled:
+                    time.sleep(60)
+                    if stop_instance():
+                        break
+            break
+        elif remaining <= 30:
+            print(f"[AUTO-STOP] Warning: stopping in {remaining:.0f}s if no activity")
+# Start idle monitor thread
+monitor_thread = threading.Thread(target=idle_monitor, daemon=True)
+# ============================================================================
+# TEXT CHUNKER - Divide texto em chunks para TTS streaming
+# ============================================================================
+@dataclass
+class ChunkConfig:
+    """Configuração do chunker"""
+    min_words: int = 3
+    max_words: int = 15
+    filler_words: list = None
+    def __post_init__(self):
+        if self.filler_words is None:
+            self.filler_words = ["hmm,", "bem,", "então,", "bom,", "olha,"]
+class TextChunker:
+    """
+    Divide streaming de texto em chunks para TTS.
+    Prioridades de quebra:
+    5: Fim de frase (. ! ?)
+    4: Quebras semânticas fortes (porém, entretanto, ; :)
+    3: Quebras médias (enquanto, embora, ,)
+    2: Conectivos (e, mas, porque)
+    1: Fallback por contagem de palavras
+    """
+    def __init__(self, config: ChunkConfig = None):
+        self.config = config or ChunkConfig()
+        self.buffer = ""
+        self.word_count = 0
+        # Padrões de quebra com prioridades
+        self.break_patterns = {
+            5: [r'[.!?](?:\s|$)'],  # Fim de frase
+            4: [r'[;:](?:\s|$)', r'\b(porém|entretanto|contudo|todavia|portanto)\b'],
+            3: [r',(?:\s|$)', r'\b(enquanto|embora|desde)\b'],
+            2: [r'\b(e|mas|porque|então|ou)\b'],
+        }
+    def add_token(self, token: str) -> Optional[str]:
+        """
+        Adiciona token ao buffer e retorna chunk se pronto.
+        Returns:
+            Chunk de texto pronto para TTS, ou None se ainda acumulando.
+        """
+        self.buffer += token
+        self.word_count = len(self.buffer.split())
+        # Verificar quebras por prioridade
+        for priority in [5, 4, 3, 2]:
+            for pattern in self.break_patterns.get(priority, []):
+                match = re.search(pattern, self.buffer, re.IGNORECASE)
+                if match and self.word_count >= self.config.min_words:
+                    # Encontrou ponto de quebra
+                    split_pos = match.end()
+                    chunk = self.buffer[:split_pos].strip()
+                    self.buffer = self.buffer[split_pos:].strip()
+                    self.word_count = len(self.buffer.split())
+                    return chunk
+        # Fallback: quebrar por contagem de palavras
+        if self.word_count >= self.config.max_words:
+            words = self.buffer.split()
+            chunk = " ".join(words[:self.config.max_words])
+            self.buffer = " ".join(words[self.config.max_words:])
+            self.word_count = len(self.buffer.split())
+            return chunk
+        return None
+    def flush(self) -> Optional[str]:
+        """Retorna qualquer texto restante no buffer."""
+        if self.buffer.strip():
+            chunk = self.buffer.strip()
+            self.buffer = ""
+            self.word_count = 0
+            return chunk
+        return None
+# ============================================================================
+# MODELS
+# ============================================================================
+whisper_model = None
+whisper_processor = None
+vllm_engine = None
+kokoro_pipeline = None
+conversation_history = []
+# CEFR Classifier
+cefr_model = None
+cefr_tokenizer = None
+CEFR_MODEL = "marcosremar2/cefr-classifier-pt-mdeberta-v3-enem"
+CEFR_LEVELS = ["A1", "A2", "B1", "B2", "C1"]
+# CEFR tracking per session
+user_message_buffer = []  # Buffer das mensagens do usuário
+user_message_count = 0    # Contador de mensagens
+current_cefr_level = "B1" # Nível padrão inicial
+CEFR_CLASSIFY_EVERY = 2   # Classificar a cada N mensagens (reduzido para adaptação mais rápida)
+CEFR_MIN_CHARS = 50       # Mínimo de caracteres para classificar (reduzido para A1)
+CEFR_FIRST_MESSAGE_CLASSIFY = True  # Classificar já na primeira mensagem se tiver chars suficientes
+# Adaptive Speed System - Velocidade baseada em CEFR + espelhamento do aluno
+CEFR_SPEED_MAP = {
+    "A1": 0.70,   # Muito lento para iniciantes
+    "A2": 0.85,   # Lento, bem articulado
+    "B1": 1.00,   # Normal
+    "B2": 1.10,   # Um pouco mais rápido
+    "C1": 1.25,   # Fluente, rápido
+}
+CEFR_EXPECTED_WPM = {
+    "A1": 90,     # Iniciantes falam ~90 palavras/min
+    "A2": 115,
+    "B1": 145,
+    "B2": 175,
+    "C1": 200,    # Avançados falam ~200 palavras/min
+}
+# Estado da velocidade adaptativa
+last_student_wpm = 0.0
+suggested_avatar_speed = 1.0
+LLM_MODEL = "RedHatAI/gemma-3-4b-it-quantized.w4a16"
+WHISPER_MODEL = "openai/whisper-small"
+# System prompts adaptados por nível CEFR
+# {student_name} será substituído pelo nome do aluno
+CEFR_SYSTEM_PROMPTS = {
+    "A1": """Você é Emma, professora de português para iniciantes.
+O aluno se chama {student_name} e está no nível A1 (iniciante absoluto).
+REGRAS OBRIGATÓRIAS:
+- RESPONDA SEMPRE E SOMENTE EM PORTUGUÊS. NUNCA use inglês, francês ou outras línguas.
+- Use 1-2 frases MUITO curtas (5-10 palavras cada).
+- Vocabulário MUITO básico: saudações, números, cores, família, comida.
+- Frases simples: sujeito + verbo + objeto. Ex: "Eu gosto de pizza."
+- SEMPRE CORRIJA os erros do aluno de forma gentil, mostrando a forma correta.
+  Exemplo: Se o aluno disser "Eu gostar pizza", responda: "Muito bem! Em português dizemos 'Eu gosto de pizza'. Você gosta de pizza! 🍕"
+- Se o aluno usar palavras em inglês/francês, ensine a palavra em português.
+  Exemplo: Se disser "happy", responda: "Feliz! Você está feliz! 😊"
+- Celebre cada tentativa com entusiasmo.
+- Faça perguntas simples: "Você gosta de...?" "O que é isso?"
+- Use muitos emojis para tornar a conversa visual e amigável.""",
+    "A2": """Você é Emma, professora de português para nível básico.
+O aluno se chama {student_name} e está no nível A2 (elementar).
+REGRAS OBRIGATÓRIAS:
+- RESPONDA SEMPRE E SOMENTE EM PORTUGUÊS. NUNCA use outras línguas.
+- Use 2-3 frases curtas (10-15 palavras cada).
+- Vocabulário do dia-a-dia: rotina, trabalho, hobbies, viagens.
+- Conectivos básicos: e, mas, porque, quando, depois.
+- Tempos verbais: presente, passado simples, "vou + infinitivo".
+- CORRIJA erros importantes de forma natural e encorajadora.
+  Exemplo: "Ótimo! Só uma dica: dizemos 'fui ao cinema' em vez de 'fui no cinema'. Continue assim!"
+- Se o aluno errar preposições ou conjugações, corrija gentilmente.
+- Pergunte sobre rotina, família, hobbies, fins de semana.
+- Seja paciente e encorajadora, mas ensine a forma correta.""",
+    "B1": """Você é Emma, professora de português para nível intermediário.
+O aluno se chama {student_name} e está no nível B1 (intermediário).
+REGRAS OBRIGATÓRIAS:
+- RESPONDA SEMPRE E SOMENTE EM PORTUGUÊS.
+- Use 2-3 frases de tamanho médio (15-25 palavras cada).
+- Vocabulário variado com expressões comuns do português.
+- Use diferentes tempos verbais naturalmente (presente, passado, futuro, condicional).
+- Introduza o subjuntivo em contextos comuns: "Espero que você goste", "Talvez seja bom".
+- Corrija erros de forma natural, integrada à conversa.
+  Exemplo: "Interessante! Eu também acho que seja importante... aliás, nesse caso dizemos 'é importante' no indicativo."
+- Encoraje o aluno a elaborar mais: "Me conta mais sobre isso!"
+- Tópicos: opiniões, experiências, planos, notícias, cultura.
+- Faça perguntas que estimulem respostas mais longas.""",
+    "B2": """Você é Emma, professora de português para nível intermediário-avançado.
+O aluno se chama {student_name} e está no nível B2 (intermediário superior).
+REGRAS OBRIGATÓRIAS:
+- RESPONDA SEMPRE EM PORTUGUÊS com naturalidade.
+- Use 3-4 frases elaboradas (25-40 palavras cada).
+- Vocabulário rico: expressões idiomáticas, phrasal verbs, colocações.
+- Todas as estruturas gramaticais: subjuntivo, condicional, voz passiva.
+- Discussões mais abstratas: política, sociedade, filosofia, arte.
+- Correções sutis focando em nuances e estilo.
+  Exemplo: "Sua ideia está clara! Só um detalhe: em contextos mais formais, seria melhor usar 'embora' em vez de 'apesar que'."
+- Desafie com perguntas argumentativas: "O que você pensa sobre...?" "Como você defende essa posição?"
+- Use expressões coloquiais brasileiras naturalmente.
+- Estimule debates e análises críticas.""",
+    "C1": """Você é Emma, professora de português para nível avançado.
+O aluno se chama {student_name} e está no nível C1 (avançado/proficiente).
+REGRAS OBRIGATÓRIAS:
+- RESPONDA EM PORTUGUÊS com fluência nativa.
+- Use 4-5 frases elaboradas e sofisticadas (40-60 palavras cada).
+- Linguagem natural de falante nativo culto brasileiro.
+- Vocabulário sofisticado: termos técnicos, acadêmicos, literários.
+- Gírias, regionalismos, humor, ironia quando apropriado.
+- Discussões complexas: filosofia, ciência, política internacional, arte, literatura.
+- Correções apenas para refinamento estilístico ou nuances culturais.
+  Exemplo: "Argumento interessante! Talvez a expressão 'no que tange a' soe um pouco formal demais nesse contexto coloquial."
+- Desafie intelectualmente: "Mas você não acha que há uma contradição entre...?"
+- Explore nuances culturais brasileiras vs. portuguesas.
+- Engaje em debates profundos e análises sofisticadas.
+- Trate o aluno como um interlocutor intelectual.""",
+}
+# Configuração de max_tokens por nível CEFR
+# Níveis mais baixos = respostas mais curtas, níveis altos = mais elaboradas
+CEFR_MAX_TOKENS = {
+    "A1": 50,   # 1-2 frases muito curtas
+    "A2": 70,   # 2-3 frases curtas
+    "B1": 100,  # 2-3 frases médias
+    "B2": 130,  # 3-4 frases elaboradas
+    "C1": 180,  # 4-5 frases sofisticadas
+}
+# Fallback prompts (mantidos para compatibilidade)
+SYSTEM_PROMPTS = {
+    "chat": """Você é Emma, professora de idiomas. Ajude o usuário a praticar com conversação natural. Seja encorajadora, corrija erros gentilmente, mantenha respostas MUITO curtas (1-2 frases).""",
+    "default": """Você é Emma, uma professora de idiomas amigável e encorajadora. Ajude o usuário a aprender e praticar português. Mantenha respostas curtas e claras.""",
+}
+# ============================================================================
+# FASTAPI APP
+# ============================================================================
+app = FastAPI(title="DumontTalker - Full Pipeline with WebSocket")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+@app.on_event("startup")
+async def load_models():
+    """Load all models on startup"""
+    global whisper_model, whisper_processor, vllm_engine, kokoro_pipeline
+    global cefr_model, cefr_tokenizer
+    import torch
+    import numpy as np
+    print("=" * 60)
+    print("Loading DumontTalker Full Pipeline + WebSocket + CEFR")
+    print(f"Auto-stop after {IDLE_TIMEOUT_SECONDS}s of inactivity")
+    print("=" * 60)
+    # 1. Load vLLM FIRST (needs contiguous memory)
+    print(f"[1/4] Loading vLLM: {LLM_MODEL}...")
+    from vllm import LLM
+    vllm_engine = LLM(
+        model=LLM_MODEL,
+        dtype="auto",
+        gpu_memory_utilization=0.40,  # 40% of 24GB = ~9.6GB for vLLM (increased for longer prompts)
+        max_model_len=2048,  # Increased to handle C1 level prompts
+        trust_remote_code=True,
+    )
+    print(f"[1/4] vLLM loaded!")
+    # 2. Load Whisper
+    print(f"[2/4] Loading Whisper: {WHISPER_MODEL}...")
+    from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
+    whisper_processor = AutoProcessor.from_pretrained(WHISPER_MODEL)
+    whisper_model = AutoModelForSpeechSeq2Seq.from_pretrained(
+        WHISPER_MODEL,
+        torch_dtype=torch.float16,
+        low_cpu_mem_usage=True,
+    ).to("cuda")
+    print(f"[2/4] Whisper loaded!")
+    # 3. Load CEFR Classifier (FP16 - ~0.6GB VRAM)
+    print(f"[3/4] Loading CEFR Classifier: {CEFR_MODEL}...")
+    from transformers import AutoTokenizer, AutoModelForSequenceClassification
+    cefr_tokenizer = AutoTokenizer.from_pretrained(CEFR_MODEL)
+    cefr_model = AutoModelForSequenceClassification.from_pretrained(
+        CEFR_MODEL,
+        torch_dtype=torch.float16,
+        low_cpu_mem_usage=True,
+    ).to("cuda")
+    cefr_model.eval()  # Set to evaluation mode
+    print(f"[3/4] CEFR Classifier loaded! (FP16)")
+    # 4. Load Kokoro TTS
+    print(f"[4/4] Loading Kokoro TTS...")
+    from kokoro import KPipeline
+    kokoro_pipeline = KPipeline(lang_code='p', device='cuda')
+    print(f"[4/4] Kokoro loaded!")
+    # Memory status
+    allocated = torch.cuda.memory_allocated(0) / 1024**3
+    total = torch.cuda.get_device_properties(0).total_memory / 1024**3
+    print("=" * 60)
+    print(f"All models loaded! VRAM: {allocated:.1f}GB / {total:.1f}GB")
+    print(f"CEFR Classifier: {CEFR_MODEL}")
+    print(f"CEFR classification every {CEFR_CLASSIFY_EVERY} messages")
+    print("WebSocket endpoint: ws://host:8000/ws/stream")
+    print("=" * 60)
+    # Start idle monitor AFTER models are loaded
+    touch_activity()  # Reset timer
+    monitor_thread.start()
+    print("[AUTO-STOP] Idle monitor started")
+# ============================================================================
+# HELPER FUNCTIONS
+# ============================================================================
+def classify_cefr(text: str) -> tuple:
+    """
+    Classifica o nível CEFR de um texto.
+    Returns:
+        tuple: (level, confidence, all_probs)
+    """
+    global cefr_model, cefr_tokenizer
+    import torch
+    if cefr_model is None or cefr_tokenizer is None:
+        print("[CEFR] Model not loaded, returning default B1")
+        return "B1", 0.0, {}
+    start = time.time()
+    # Tokenize
+    inputs = cefr_tokenizer(
+        text,
+        return_tensors="pt",
+        truncation=True,
+        max_length=512,
+        padding=True
+    )
+    inputs = {k: v.to("cuda") for k, v in inputs.items()}
+    # Inference
+    with torch.no_grad():
+        outputs = cefr_model(**inputs)
+        probs = torch.softmax(outputs.logits, dim=-1)
+        pred_idx = torch.argmax(probs, dim=-1).item()
+        confidence = probs[0][pred_idx].item()
+    level = CEFR_LEVELS[pred_idx]
+    all_probs = {CEFR_LEVELS[i]: probs[0][i].item() for i in range(len(CEFR_LEVELS))}
+    elapsed_ms = int((time.time() - start) * 1000)
+    print(f"[CEFR] {elapsed_ms}ms | Level: {level} ({confidence:.0%}) | Probs: {all_probs}")
+    return level, confidence, all_probs
+def update_cefr_level(user_text: str) -> str:
+    """
+    Atualiza o nível CEFR baseado nas mensagens do usuário.
+    Classifica quando atingir CEFR_CLASSIFY_EVERY mensagens E CEFR_MIN_CHARS caracteres.
+    Se CEFR_FIRST_MESSAGE_CLASSIFY=True, também classifica na primeira mensagem
+    se ela tiver caracteres suficientes (importante para adaptação imediata).
+    Returns:
+        str: Nível CEFR atual (pode ter sido atualizado ou não)
+    """
+    global user_message_buffer, user_message_count, current_cefr_level
+    # Adiciona mensagem ao buffer
+    user_message_buffer.append(user_text)
+    user_message_count += 1
+    # Calcula tamanho total do buffer
+    combined_text = " ".join(user_message_buffer)
+    total_chars = len(combined_text)
+    print(f"[CEFR] Message {user_message_count}/{CEFR_CLASSIFY_EVERY} buffered | {total_chars}/{CEFR_MIN_CHARS} chars")
+    # Verifica se deve classificar:
+    # 1. Na primeira mensagem se CEFR_FIRST_MESSAGE_CLASSIFY=True e tiver chars suficientes
+    # 2. A cada CEFR_CLASSIFY_EVERY mensagens com chars suficientes
+    should_classify = False
+    if CEFR_FIRST_MESSAGE_CLASSIFY and user_message_count == 1 and total_chars >= CEFR_MIN_CHARS:
+        print(f"[CEFR] First message classification triggered ({total_chars} chars)")
+        should_classify = True
+    elif user_message_count >= CEFR_CLASSIFY_EVERY and total_chars >= CEFR_MIN_CHARS:
+        print(f"[CEFR] Periodic classification triggered ({total_chars} chars)")
+        should_classify = True
+    if should_classify:
+        print(f"[CEFR] Classifying combined text ({total_chars} chars)...")
+        # Classifica
+        new_level, confidence, probs = classify_cefr(combined_text)
+        # Atualiza nível se confiança > 50% (reduzido de 60% para melhor adaptação)
+        if confidence > 0.5:
+            old_level = current_cefr_level
+            current_cefr_level = new_level
+            if old_level != new_level:
+                print(f"[CEFR] Level changed: {old_level} → {new_level} (confidence: {confidence:.0%})")
+            else:
+                print(f"[CEFR] Level confirmed: {new_level} (confidence: {confidence:.0%})")
+        else:
+            print(f"[CEFR] Low confidence ({confidence:.0%}), keeping level: {current_cefr_level}")
+        # Reset buffer e contador
+        user_message_buffer = []
+        user_message_count = 0
+    elif user_message_count >= CEFR_CLASSIFY_EVERY:
+        # Atingiu mensagens mas não caracteres - continua acumulando
+        print(f"[CEFR] Need more text ({total_chars}/{CEFR_MIN_CHARS} chars), continuing to buffer...")
+    return current_cefr_level
+def calculate_speech_metrics(audio_array, sample_rate: int, transcript: str) -> dict:
+    """
+    Calcula métricas de fala do aluno.
+    Args:
+        audio_array: Array de áudio (numpy)
+        sample_rate: Taxa de amostragem
+        transcript: Texto transcrito
+    Returns:
+        dict com métricas: audio_duration_sec, word_count, wpm
+    """
+    import numpy as np
+    # Duração do áudio em segundos
+    audio_duration_sec = len(audio_array) / sample_rate
+    # Contar palavras (simples, baseado em espaços)
+    words = transcript.strip().split()
+    word_count = len(words)
+    # Calcular WPM (palavras por minuto)
+    if audio_duration_sec > 0:
+        wpm = (word_count / audio_duration_sec) * 60
+    else:
+        wpm = 0
+    return {
+        "audio_duration_sec": round(audio_duration_sec, 2),
+        "word_count": word_count,
+        "wpm": round(wpm, 1)
+    }
+def calculate_suggested_speed(cefr_level: str, student_wpm: float, manual_speed: Optional[float] = None) -> float:
+    """
+    Calcula a velocidade sugerida para o avatar baseada em:
+    1. Nível CEFR do aluno
+    2. WPM do aluno (espelhamento)
+    3. Preferência manual (tem prioridade)
+    Args:
+        cefr_level: Nível CEFR atual do aluno (A1-C1)
+        student_wpm: Palavras por minuto do aluno
+        manual_speed: Velocidade definida manualmente (None = automático)
+    Returns:
+        float: Velocidade sugerida (0.5 a 1.5)
+    """
+    global last_student_wpm, suggested_avatar_speed
+    # Se velocidade manual foi definida, ela tem prioridade
+    if manual_speed is not None:
+        return max(0.5, min(1.5, manual_speed))
+    # Velocidade base do nível CEFR
+    base_speed = CEFR_SPEED_MAP.get(cefr_level, 1.0)
+    # Espelhamento: ajusta baseado na diferença entre WPM do aluno e esperado
+    if student_wpm > 0:
+        expected_wpm = CEFR_EXPECTED_WPM.get(cefr_level, 145)
+        # Razão entre WPM real e esperado
+        # Se aluno fala mais devagar, ratio < 1, avatar desacelera
+        # Se aluno fala mais rápido, ratio > 1, avatar acelera (até o limite)
+        ratio = student_wpm / expected_wpm
+        # Limita o fator de espelhamento entre 0.7 e 1.3
+        mirror_factor = max(0.7, min(1.3, ratio))
+        # Velocidade sugerida = base * espelhamento
+        suggested = base_speed * mirror_factor
+        # Atualiza estado global
+        last_student_wpm = student_wpm
+    else:
+        # Sem dados de WPM, usa apenas a velocidade base do CEFR
+        suggested = base_speed
+    # Limita entre 0.5 e 1.5
+    suggested_avatar_speed = max(0.5, min(1.5, suggested))
+    print(f"[SPEED] CEFR={cefr_level}, WPM={student_wpm:.0f}, base={base_speed:.2f}, suggested={suggested_avatar_speed:.2f}")
+    return round(suggested_avatar_speed, 2)
+def transcribe_audio(audio_data: bytes, language: str = "pt") -> dict:
+    """
+    Transcreve áudio usando Whisper e calcula métricas de fala.
+    Args:
+        audio_data: Dados do áudio em bytes (WAV)
+        language: Código do idioma para forçar transcrição (pt, en, fr, es, etc.)
+                  Default: "pt" (português)
+    Returns:
+        dict com: transcript, stt_ms, speech_metrics (audio_duration_sec, word_count, wpm)
+    """
+    global whisper_model, whisper_processor
+    import torch
+    import soundfile as sf
+    import librosa
+    start = time.time()
+    audio_array, sr = sf.read(io.BytesIO(audio_data))
+    if sr != 16000:
+        audio_array = librosa.resample(audio_array, orig_sr=sr, target_sr=16000)
+        sr = 16000
+    inputs = whisper_processor(audio_array, sampling_rate=16000, return_tensors="pt")
+    inputs = {k: v.to("cuda", dtype=torch.float16) if v.dtype == torch.float32 else v.to("cuda")
+              for k, v in inputs.items()}
+    # Forçar idioma na transcrição para evitar confusão
+    # Whisper usa tokens especiais para idioma: <|pt|>, <|en|>, etc.
+    forced_decoder_ids = whisper_processor.get_decoder_prompt_ids(language=language, task="transcribe")
+    with torch.no_grad():
+        output_ids = whisper_model.generate(
+            **inputs,
+            max_new_tokens=128,
+            forced_decoder_ids=forced_decoder_ids
+        )
+    transcript = whisper_processor.batch_decode(output_ids, skip_special_tokens=True)[0]
+    elapsed_ms = int((time.time() - start) * 1000)
+    if not transcript.strip():
+        transcript = "Olá"
+    # Calcular m��tricas de fala (WPM, duração, etc.)
+    speech_metrics = calculate_speech_metrics(audio_array, sr, transcript)
+    print(f"[STT] {elapsed_ms}ms | lang={language} | WPM={speech_metrics['wpm']:.0f} | '{transcript}'")
+    return {
+        "transcript": transcript,
+        "stt_ms": elapsed_ms,
+        "speech_metrics": speech_metrics
+    }
+def generate_response(
+    transcript: str,
+    mode: str = "chat",
+    student_name: str = "Aluno",
+    custom_system_prompt: Optional[str] = None,
+    custom_max_tokens: Optional[int] = None,
+    custom_temperature: Optional[float] = None
+) -> tuple:
+    """
+    Gera resposta com vLLM, adaptada ao nível CEFR do aluno.
+    O nível CEFR é atualizado a cada 2 mensagens do usuário.
+    O system prompt pode ser:
+    1. Enviado pelo frontend (custom_system_prompt) - PREFERIDO
+    2. Detectado automaticamente pelo backend (fallback)
+    Args:
+        transcript: Texto do usuário
+        mode: Modo de conversação (chat, default, cefr_adaptive)
+        student_name: Nome do aluno para personalizar o prompt
+        custom_system_prompt: System prompt customizado enviado pelo frontend (opcional)
+        custom_max_tokens: Max tokens customizado enviado pelo frontend (opcional)
+        custom_temperature: Temperature customizada enviada pelo frontend (opcional)
+    """
+    global vllm_engine, conversation_history, current_cefr_level
+    from vllm import SamplingParams
+    from transformers import AutoTokenizer
+    start = time.time()
+    # 1. Atualiza nível CEFR baseado na mensagem do usuário (para tracking)
+    cefr_level = update_cefr_level(transcript)
+    # 2. Seleciona system prompt
+    # PRIORIDADE: custom_system_prompt do frontend > prompt interno baseado em CEFR
+    if custom_system_prompt:
+        # Usa o prompt enviado pelo frontend (já adaptado ao nível CEFR)
+        system = custom_system_prompt
+        print(f"[LLM] Using CUSTOM system prompt from frontend (length: {len(system)})")
+    elif mode in ["chat", "default", "cefr_adaptive"]:
+        # Fallback: usa prompt interno baseado no nível detectado
+        system_template = CEFR_SYSTEM_PROMPTS.get(cefr_level, CEFR_SYSTEM_PROMPTS["B1"])
+        system = system_template.format(student_name=student_name)
+        print(f"[LLM] Using INTERNAL CEFR prompt for level: {cefr_level}, student: {student_name}")
+    else:
+        # Modo específico (fallback)
+        system = SYSTEM_PROMPTS.get(mode, SYSTEM_PROMPTS["default"])
+    messages = [{"role": "system", "content": system}]
+    messages.extend(conversation_history[-10:])
+    messages.append({"role": "user", "content": transcript})
+    tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL)
+    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    # Usa parâmetros customizados se fornecidos, senão usa baseado no nível CEFR
+    max_tokens = custom_max_tokens if custom_max_tokens else CEFR_MAX_TOKENS.get(cefr_level, 100)
+    temperature = custom_temperature if custom_temperature else 0.7
+    params = SamplingParams(temperature=temperature, top_p=0.8, max_tokens=max_tokens)
+    outputs = vllm_engine.generate(prompt, params)
+    response = outputs[0].outputs[0].text.strip()
+    print(f"[LLM] max_tokens={max_tokens}, temp={temperature}, CEFR={cefr_level}")
+    conversation_history.append({"role": "user", "content": transcript})
+    conversation_history.append({"role": "assistant", "content": response})
+    if len(conversation_history) > 20:
+        conversation_history = conversation_history[-20:]
+    elapsed_ms = int((time.time() - start) * 1000)
+    print(f"[LLM] {elapsed_ms}ms | CEFR:{cefr_level} | '{response}'")
+    return response, elapsed_ms
+def remove_emojis(text: str) -> str:
+    """Remove emojis e caracteres especiais do texto antes do TTS"""
+    # Pattern para remover emojis
+    emoji_pattern = re.compile("["
+        u"\U0001F600-\U0001F64F"  # emoticons
+        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
+        u"\U0001F680-\U0001F6FF"  # transport & map symbols
+        u"\U0001F1E0-\U0001F1FF"  # flags
+        u"\U00002702-\U000027B0"  # dingbats
+        u"\U000024C2-\U0001F251"  # enclosed characters
+        u"\U0001f926-\U0001f937"  # gestures
+        u"\U00010000-\U0010ffff"  # supplementary
+        u"\u2640-\u2642"          # gender symbols
+        u"\u2600-\u2B55"          # misc symbols
+        u"\u200d"                 # zero width joiner
+        u"\u23cf"                 # eject symbol
+        u"\u23e9"                 # fast forward
+        u"\u231a"                 # watch
+        u"\ufe0f"                 # variation selector
+        u"\u3030"                 # wavy dash
+        "]+", flags=re.UNICODE)
+    return emoji_pattern.sub('', text).strip()
+def synthesize_audio(text: str, voice: str = "af_bella") -> tuple:
+    """Sintetiza áudio com Kokoro TTS"""
+    global kokoro_pipeline
+    import numpy as np
+    import soundfile as sf
+    start = time.time()
+    # Remove emojis antes do TTS
+    clean_text = remove_emojis(text)
+    print(f"[TTS] Original: '{text}' -> Clean: '{clean_text}'")
+    audio_chunks = []
+    for gs, ps, audio_chunk in kokoro_pipeline(clean_text, voice=voice):
+        if audio_chunk is not None and len(audio_chunk) > 0:
+            audio_chunks.append(audio_chunk)
+    if not audio_chunks:
+        raise Exception("TTS failed to generate audio")
+    audio_output = np.concatenate(audio_chunks)
+    buffer = io.BytesIO()
+    sf.write(buffer, audio_output, 24000, format='WAV')
+    audio_bytes = buffer.getvalue()
+    elapsed_ms = int((time.time() - start) * 1000)
+    print(f"[TTS] {elapsed_ms}ms | {len(audio_bytes)} bytes")
+    return audio_bytes, elapsed_ms
+# ============================================================================
+# HTTP ENDPOINTS
+# ============================================================================
+@app.get("/health")
+def health():
+    import torch
+    global last_activity, current_cefr_level, user_message_count
+    elapsed = (datetime.now() - last_activity).total_seconds()
+    remaining = max(0, IDLE_TIMEOUT_SECONDS - elapsed)
+    allocated = torch.cuda.memory_allocated(0) / 1024**3 if torch.cuda.is_available() else 0
+    return {
+        "status": "healthy",
+        # Frontend compatibility fields
+        "whisper_loaded": whisper_model is not None,
+        "vllm_loaded": vllm_engine is not None,
+        "kokoro_loaded": kokoro_pipeline is not None,
+        "cefr_loaded": cefr_model is not None,
+        # Additional info
+        "models": {
+            "stt": WHISPER_MODEL,
+            "llm": LLM_MODEL,
+            "tts": "kokoro",
+            "cefr": CEFR_MODEL,
+        },
+        "cefr": {
+            "current_level": current_cefr_level,
+            "messages_until_classify": max(0, CEFR_CLASSIFY_EVERY - user_message_count),
+            "classify_every": CEFR_CLASSIFY_EVERY,
+            "min_chars": CEFR_MIN_CHARS,
+            "current_chars": len(" ".join(user_message_buffer)),
+        },
+        "vram_gb": f"{allocated:.1f}",
+        "websocket": "/ws/stream",
+        "auto_stop": {
+            "enabled": auto_stop_enabled,
+            "timeout_seconds": IDLE_TIMEOUT_SECONDS,
+            "idle_seconds": int(elapsed),
+            "stop_in_seconds": int(remaining),
+        }
+    }
+class CEFRClassifyRequest(BaseModel):
+    """Request para classificação CEFR manual"""
+    text: str
+@app.post("/api/cefr/classify")
+async def api_cefr_classify(request: CEFRClassifyRequest):
+    """
+    Classifica manualmente o nível CEFR de um texto.
+    Não afeta o nível atual da sessão.
+    """
+    touch_activity()
+    level, confidence, probs = classify_cefr(request.text)
+    return {
+        "level": level,
+        "confidence": confidence,
+        "probabilities": probs,
+        "text_length": len(request.text),
+    }
+@app.get("/api/cefr/status")
+async def api_cefr_status():
+    """Retorna o status atual do CEFR"""
+    global current_cefr_level, user_message_count, user_message_buffer
+    current_chars = len(" ".join(user_message_buffer))
+    return {
+        "current_level": current_cefr_level,
+        "message_count": user_message_count,
+        "messages_until_classify": max(0, CEFR_CLASSIFY_EVERY - user_message_count),
+        "buffer_size": len(user_message_buffer),
+        "current_chars": current_chars,
+        "min_chars": CEFR_MIN_CHARS,
+        "chars_until_classify": max(0, CEFR_MIN_CHARS - current_chars),
+        "ready_to_classify": user_message_count >= CEFR_CLASSIFY_EVERY and current_chars >= CEFR_MIN_CHARS,
+    }
+@app.post("/api/cefr/reset")
+async def api_cefr_reset():
+    """Reseta o nível CEFR para o padrão (B1)"""
+    global current_cefr_level, user_message_count, user_message_buffer
+    old_level = current_cefr_level
+    current_cefr_level = "B1"
+    user_message_count = 0
+    user_message_buffer = []
+    return {
+        "status": "reset",
+        "old_level": old_level,
+        "new_level": current_cefr_level,
+    }
+@app.post("/api/cefr/set")
+async def api_cefr_set(level: str = Form(...)):
+    """Define manualmente o nível CEFR"""
+    global current_cefr_level
+    if level not in CEFR_LEVELS:
+        raise HTTPException(status_code=400, detail=f"Invalid level. Must be one of: {CEFR_LEVELS}")
+    old_level = current_cefr_level
+    current_cefr_level = level
+    return {
+        "status": "set",
+        "old_level": old_level,
+        "new_level": current_cefr_level,
+    }
+@app.post("/chat")
+async def chat(
+    message: str = Form(...),
+    mode: str = Form("chat"),
+):
+    """Text-only chat"""
+    touch_activity()
+    response, llm_ms = generate_response(message, mode)
+    return {
+        "response": response,
+        "provider": "tensordock",
+        "model": LLM_MODEL,
+        "inference_ms": llm_ms,
+    }
+@app.post("/process-audio")
+async def process_audio(
+    audio: UploadFile = File(...),
+    mode: str = Form("chat"),
+):
+    """Full pipeline: Audio -> STT -> LLM -> TTS -> Audio"""
+    touch_activity()
+    overall_start = time.time()
+    try:
+        audio_data = await audio.read()
+        # 1. STT (retorna dict com métricas)
+        stt_result = transcribe_audio(audio_data)
+        transcript = stt_result["transcript"]
+        stt_ms = stt_result["stt_ms"]
+        # 2. LLM
+        response, llm_ms = generate_response(transcript, mode)
+        # 3. TTS
+        audio_bytes, tts_ms = synthesize_audio(response)
+        total_ms = int((time.time() - overall_start) * 1000)
+        return JSONResponse({
+            "transcript": transcript,
+            "response": response,
+            "audio": base64.b64encode(audio_bytes).decode('utf-8'),
+            "timing": {
+                "stt_ms": stt_ms,
+                "llm_ms": llm_ms,
+                "tts_ms": tts_ms,
+                "total_ms": total_ms,
+            },
+            "model": LLM_MODEL,
+            "speech_metrics": stt_result["speech_metrics"],
+        })
+    except Exception as e:
+        print(f"[ERROR] {e}")
+        import traceback
+        traceback.print_exc()
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/keep-alive")
+def keep_alive():
+    """Reset idle timer without doing inference"""
+    touch_activity()
+    return {"status": "ok", "message": "Timer reset"}
+@app.post("/clear")
+def clear_history():
+    """Clear conversation history and reset CEFR"""
+    global conversation_history, current_cefr_level, user_message_count, user_message_buffer
+    conversation_history = []
+    current_cefr_level = "B1"  # Reset to default
+    user_message_count = 0
+    user_message_buffer = []
+    touch_activity()
+    return {"status": "cleared", "cefr_reset": True, "cefr_level": current_cefr_level}
+# ============================================================================
+# FRONTEND-COMPATIBLE API ENDPOINTS
+# ============================================================================
+@app.post("/api/audio")
+async def api_audio(request: AudioRequest):
+    """
+    Frontend-compatible audio endpoint.
+    Accepts JSON with base64 audio, returns response in expected format.
+    Includes speech metrics and suggested speed for adaptive avatar.
+    """
+    touch_activity()
+    overall_start = time.time()
+    try:
+        # Decode base64 audio
+        audio_data = base64.b64decode(request.audio)
+        print(f"[API] Received audio: {len(audio_data)} bytes, mode: {request.mode}, lang: {request.language}, student: {request.student_name}")
+        # 1. STT - Forçar idioma para evitar confusão (retorna dict com métricas)
+        stt_result = transcribe_audio(audio_data, language=request.language)
+        transcript = stt_result["transcript"]
+        stt_ms = stt_result["stt_ms"]
+        speech_metrics = stt_result["speech_metrics"]
+        # 2. Calcular velocidade sugerida baseada em CEFR + WPM do aluno
+        student_wpm = speech_metrics.get("wpm", 0)
+        suggested_speed = calculate_suggested_speed(
+            current_cefr_level,
+            student_wpm,
+            manual_speed=request.speed_rate  # None se não definido manualmente
+        )
+        # 3. LLM - Passar parâmetros customizados se fornecidos pelo frontend
+        response_text, llm_ms = generate_response(
+            transcript,
+            request.mode,
+            student_name=request.student_name,
+            custom_system_prompt=request.system_prompt,
+            custom_max_tokens=request.max_tokens,
+            custom_temperature=request.temperature
+        )
+        # 4. TTS
+        audio_bytes, tts_ms = synthesize_audio(response_text)
+        audio_duration = len(audio_bytes) / (24000 * 2)  # Approximate duration
+        total_ms = int((time.time() - overall_start) * 1000)
+        # Return in frontend-expected format with speech metrics and suggested speed
+        return JSONResponse({
+            "transcription": {
+                "text": transcript,
+                "language": request.language,
+                "confidence": 1.0,
+            },
+            "response": {
+                "text": response_text,
+                "emotion": "neutral",
+                "language": request.language,
+            },
+            "speech": {
+                "audio": base64.b64encode(audio_bytes).decode('utf-8'),
+                "visemes": [],  # Visemes not implemented yet
+                "duration": audio_duration,
+                "sample_rate": 24000,
+                "format": "wav",
+            },
+            "timing": {
+                "stt_ms": stt_ms,
+                "llm_ms": llm_ms,
+                "tts_ms": tts_ms,
+                "total_ms": total_ms,
+            },
+            "cefr": {
+                "current_level": current_cefr_level,
+                "messages_until_classify": CEFR_CLASSIFY_EVERY - user_message_count,
+            },
+            # Novas métricas para velocidade adaptativa
+            "speech_metrics": speech_metrics,
+            "adaptive_speed": {
+                "suggested_speed": suggested_speed,
+                "student_wpm": student_wpm,
+                "speed_mode": "manual" if request.speed_rate else "auto",
+            },
+        })
+    except Exception as e:
+        print(f"[API ERROR] {e}")
+        import traceback
+        traceback.print_exc()
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/api/text")
+async def api_text(request: TextRequest):
+    """
+    Frontend-compatible text endpoint.
+    Accepts JSON with text, returns LLM response with TTS audio.
+    """
+    touch_activity()
+    overall_start = time.time()
+    try:
+        print(f"[API] Received text: '{request.text[:50]}...', mode: {request.mode}, student: {request.student_name}")
+        # 1. LLM - Passar nome do aluno e parâmetros customizados do frontend
+        response_text, llm_ms = generate_response(
+            request.text,
+            request.mode,
+            student_name=request.student_name,
+            custom_system_prompt=request.system_prompt,
+            custom_max_tokens=request.max_tokens,
+            custom_temperature=request.temperature
+        )
+        # 2. TTS
+        audio_bytes, tts_ms = synthesize_audio(response_text)
+        audio_duration = len(audio_bytes) / (24000 * 2)  # Approximate duration
+        total_ms = int((time.time() - overall_start) * 1000)
+        # Return in frontend-expected format
+        return JSONResponse({
+            "response": {
+                "text": response_text,
+                "emotion": "neutral",
+                "language": request.language,
+            },
+            "speech": {
+                "audio": base64.b64encode(audio_bytes).decode('utf-8'),
+                "visemes": [],  # Visemes not implemented yet
+                "duration": audio_duration,
+                "sample_rate": 24000,
+                "format": "wav",
+            },
+            "timing": {
+                "llm_ms": llm_ms,
+                "tts_ms": tts_ms,
+                "total_ms": total_ms,
+            },
+            "cefr": {
+                "current_level": current_cefr_level,
+                "messages_until_classify": CEFR_CLASSIFY_EVERY - user_message_count,
+            },
+        })
+    except Exception as e:
+        print(f"[API ERROR] {e}")
+        import traceback
+        traceback.print_exc()
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/api/reset")
+async def api_reset():
+    """Reset conversation history and CEFR - frontend compatible"""
+    global conversation_history, current_cefr_level, user_message_count, user_message_buffer
+    conversation_history = []
+    current_cefr_level = "B1"
+    user_message_count = 0
+    user_message_buffer = []
+    touch_activity()
+    return {"status": "ok", "cefr_level": current_cefr_level}
+# ============================================================================
+# WEBSOCKET ENDPOINT - Streaming Audio
+# ============================================================================
+@app.websocket("/ws/stream")
+async def websocket_stream(websocket: WebSocket):
+    """
+    WebSocket para streaming de áudio bidirecional.
+    Protocolo:
+    1. Cliente envia áudio (binary) ou JSON com config
+    2. Servidor envia chunks de áudio de resposta (binary)
+    3. Servidor envia métricas no final (JSON)
+    Exemplo JavaScript:
+    ```javascript
+    const ws = new WebSocket('ws://host:8000/ws/stream');
+    ws.onmessage = (event) => {
+        if (event.data instanceof Blob) {
+            // Chunk de áudio WAV - tocar imediatamente
+            playAudioChunk(event.data);
+        } else {
+            // JSON com métricas ou status
+            const data = JSON.parse(event.data);
+            console.log('Metrics:', data);
+        }
+    };
+    // Enviar áudio gravado
+    ws.send(audioBlob);
+    ```
+    """
+    await websocket.accept()
+    print("[WS] Client connected")
+    try:
+        while True:
+            touch_activity()
+            # Receber dados do cliente
+            data = await websocket.receive()
+            if "bytes" in data:
+                # Áudio binary
+                audio_data = data["bytes"]
+                overall_start = time.time()
+                # Enviar status de processamento
+                await websocket.send_json({"status": "processing", "stage": "stt"})
+                # 1. STT
+                transcript, stt_ms = transcribe_audio(audio_data)
+                await websocket.send_json({
+                    "status": "processing",
+                    "stage": "llm",
+                    "transcript": transcript,
+                    "stt_ms": stt_ms
+                })
+                # 2. LLM
+                response, llm_ms = generate_response(transcript)
+                await websocket.send_json({
+                    "status": "processing",
+                    "stage": "tts",
+                    "response": response,
+                    "llm_ms": llm_ms
+                })
+                # 3. TTS - enviar áudio
+                tts_start = time.time()
+                import numpy as np
+                import soundfile as sf
+                audio_chunks = []
+                chunk_count = 0
+                # Remove emojis antes do TTS
+                clean_response = remove_emojis(response)
+                for gs, ps, audio_chunk in kokoro_pipeline(clean_response, voice='af_bella'):
+                    if audio_chunk is not None and len(audio_chunk) > 0:
+                        audio_chunks.append(audio_chunk)
+                        chunk_count += 1
+                        # Enviar cada chunk como WAV
+                        buffer = io.BytesIO()
+                        sf.write(buffer, audio_chunk, 24000, format='WAV')
+                        await websocket.send_bytes(buffer.getvalue())
+                tts_ms = int((time.time() - tts_start) * 1000)
+                total_ms = int((time.time() - overall_start) * 1000)
+                # Enviar métricas finais
+                await websocket.send_json({
+                    "status": "complete",
+                    "transcript": transcript,
+                    "response": response,
+                    "timing": {
+                        "stt_ms": stt_ms,
+                        "llm_ms": llm_ms,
+                        "tts_ms": tts_ms,
+                        "total_ms": total_ms,
+                    },
+                    "chunks_sent": chunk_count,
+                    "model": LLM_MODEL,
+                })
+                print(f"[WS] Complete: STT={stt_ms}ms, LLM={llm_ms}ms, TTS={tts_ms}ms, Total={total_ms}ms")
+            elif "text" in data:
+                # JSON text (config ou texto para TTS)
+                try:
+                    msg = json.loads(data["text"])
+                    if msg.get("type") == "ping":
+                        await websocket.send_json({"type": "pong"})
+                    elif msg.get("type") == "text":
+                        # Chat de texto com TTS
+                        text = msg.get("message", "")
+                        mode = msg.get("mode", "chat")
+                        overall_start = time.time()
+                        # LLM
+                        response, llm_ms = generate_response(text, mode)
+                        # TTS streaming
+                        tts_start = time.time()
+                        chunk_count = 0
+                        import numpy as np
+                        import soundfile as sf
+                        # Remove emojis antes do TTS
+                        clean_response = remove_emojis(response)
+                        for gs, ps, audio_chunk in kokoro_pipeline(clean_response, voice='af_bella'):
+                            if audio_chunk is not None and len(audio_chunk) > 0:
+                                chunk_count += 1
+                                buffer = io.BytesIO()
+                                sf.write(buffer, audio_chunk, 24000, format='WAV')
+                                await websocket.send_bytes(buffer.getvalue())
+                        tts_ms = int((time.time() - tts_start) * 1000)
+                        total_ms = int((time.time() - overall_start) * 1000)
+                        await websocket.send_json({
+                            "status": "complete",
+                            "response": response,
+                            "timing": {
+                                "llm_ms": llm_ms,
+                                "tts_ms": tts_ms,
+                                "total_ms": total_ms,
+                            },
+                            "chunks_sent": chunk_count,
+                        })
+                    elif msg.get("type") == "tts":
+                        # TTS apenas (sem LLM)
+                        text = msg.get("text", "")
+                        tts_start = time.time()
+                        chunk_count = 0
+                        import numpy as np
+                        import soundfile as sf
+                        # Remove emojis antes do TTS
+                        clean_text = remove_emojis(text)
+                        for gs, ps, audio_chunk in kokoro_pipeline(clean_text, voice='af_bella'):
+                            if audio_chunk is not None and len(audio_chunk) > 0:
+                                chunk_count += 1
+                                buffer = io.BytesIO()
+                                sf.write(buffer, audio_chunk, 24000, format='WAV')
+                                await websocket.send_bytes(buffer.getvalue())
+                        tts_ms = int((time.time() - tts_start) * 1000)
+                        await websocket.send_json({
+                            "status": "complete",
+                            "timing": {"tts_ms": tts_ms},
+                            "chunks_sent": chunk_count,
+                        })
+                except json.JSONDecodeError:
+                    await websocket.send_json({"error": "Invalid JSON"})
+    except WebSocketDisconnect:
+        print("[WS] Client disconnected")
+    except Exception as e:
+        print(f"[WS] Error: {e}")
+        import traceback
+        traceback.print_exc()
+        try:
+            await websocket.send_json({"error": str(e)})
+        except:
+            pass
+    finally:
+        try:
+            await websocket.close()
+        except:
+            pass
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)

cefr/.gitkeep ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # CEFR Classifier Model
2	+ # HuggingFace: marcosremar2/cefr-classifier-pt-mdeberta-v3-enem

checkpoint.sh ADDED Viewed

	@@ -0,0 +1,126 @@

+#!/bin/bash
+# Create checkpoint of PARLE backend with all models loaded
+# Requires: patched CRIU (criu-patched), io_uring disabled
+#
+# Usage: ./checkpoint.sh [--stop]
+#   --stop: Stop the process after checkpoint (default: keep running)
+set -e
+CHECKPOINT_DIR="/var/lib/parle-checkpoints"
+CHECKPOINT_NAME="parle-$(date +%Y%m%d-%H%M%S)"
+CHECKPOINT_PATH="$CHECKPOINT_DIR/$CHECKPOINT_NAME"
+LATEST_LINK="$CHECKPOINT_DIR/latest"
+LEAVE_RUNNING="--leave-running"
+# Parse arguments
+if [ "$1" = "--stop" ]; then
+    LEAVE_RUNNING=""
+    echo "Will STOP process after checkpoint"
+fi
+echo "=============================================="
+echo "PARLE Backend Checkpoint"
+echo "=============================================="
+# Find Python process
+PYTHON_PID=$(pgrep -f "python.*app.py" | head -1)
+if [ -z "$PYTHON_PID" ]; then
+    echo "ERROR: No Python backend process found"
+    echo "Start the backend first with: ./start.sh"
+    exit 1
+fi
+echo "Found backend process: PID $PYTHON_PID"
+# Check health
+echo ""
+echo "[1/3] Checking backend health..."
+HEALTH=$(curl -s --max-time 5 localhost:8000/health 2>/dev/null)
+if [ -z "$HEALTH" ]; then
+    echo "ERROR: Backend not responding to health check"
+    exit 1
+fi
+VLLM=$(echo "$HEALTH" | grep -o '"vllm_loaded":true' || true)
+WHISPER=$(echo "$HEALTH" | grep -o '"whisper_loaded":true' || true)
+KOKORO=$(echo "$HEALTH" | grep -o '"kokoro_loaded":true' || true)
+if [ -z "$VLLM" ] || [ -z "$WHISPER" ] || [ -z "$KOKORO" ]; then
+    echo "ERROR: Not all models are loaded yet"
+    echo "Wait for all models to load before checkpointing"
+    echo "Health: $HEALTH"
+    exit 1
+fi
+echo "All models loaded!"
+# Check io_uring is disabled
+echo ""
+echo "[2/3] Checking system configuration..."
+IO_URING=$(cat /proc/sys/kernel/io_uring_disabled 2>/dev/null || echo "unknown")
+if [ "$IO_URING" != "2" ]; then
+    echo "WARNING: io_uring not disabled (value: $IO_URING)"
+    echo "Run: sudo sysctl -w kernel.io_uring_disabled=2"
+    echo "Continuing anyway..."
+fi
+# Check CRIU
+if [ ! -f /usr/local/bin/criu-patched ]; then
+    echo "ERROR: Patched CRIU not found at /usr/local/bin/criu-patched"
+    echo "Run setup-criu-patched.sh first"
+    exit 1
+fi
+echo "Patched CRIU found"
+# Create checkpoint directory
+mkdir -p "$CHECKPOINT_PATH"
+echo ""
+echo "[3/3] Creating checkpoint..."
+echo "Path: $CHECKPOINT_PATH"
+echo "This may take 30-60 seconds..."
+echo ""
+START_TIME=$(date +%s)
+# Run CRIU checkpoint
+CRIU_PLUGINS_DIR=/usr/lib/criu /usr/local/bin/criu-patched dump \
+    -t $PYTHON_PID \
+    -D "$CHECKPOINT_PATH" \
+    --shell-job \
+    --tcp-established \
+    --file-locks \
+    --ext-unix-sk \
+    $LEAVE_RUNNING \
+    -v2 \
+    -o "$CHECKPOINT_PATH/dump.log" 2>&1 || {
+        echo ""
+        echo "ERROR: CRIU dump failed"
+        echo "Check log: $CHECKPOINT_PATH/dump.log"
+        tail -20 "$CHECKPOINT_PATH/dump.log"
+        exit 1
+    }
+END_TIME=$(date +%s)
+DURATION=$((END_TIME - START_TIME))
+# Update latest symlink
+rm -f "$LATEST_LINK"
+ln -s "$CHECKPOINT_PATH" "$LATEST_LINK"
+# Get checkpoint size
+SIZE=$(du -sh "$CHECKPOINT_PATH" | cut -f1)
+echo ""
+echo "=============================================="
+echo "Checkpoint created successfully!"
+echo "=============================================="
+echo "Path: $CHECKPOINT_PATH"
+echo "Size: $SIZE"
+echo "Time: ${DURATION}s"
+echo "Symlink: $LATEST_LINK"
+echo ""
+if [ -z "$LEAVE_RUNNING" ]; then
+    echo "Process was STOPPED. To restore: ./restore.sh"
+else
+    echo "Process is still running. To restore later: ./restore.sh"
+fi

deploy-criu.sh ADDED Viewed

	@@ -0,0 +1,69 @@

+#!/bin/bash
+# Deploy CRIU + cuda-checkpoint to TensorDock
+# This script copies the necessary files to the server and sets up CRIU
+set -e
+# Configuration
+SERVER="8.17.147.158"
+SSH_PORT="10038"
+SSH_USER="root"
+REMOTE_DIR="/home/user/parle-backend"
+echo "=================================================="
+echo "Deploying CRIU + cuda-checkpoint to TensorDock"
+echo "=================================================="
+echo "Server: $SERVER:$SSH_PORT"
+echo ""
+# Check if we can connect
+echo "[1/4] Testing SSH connection..."
+ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 -p $SSH_PORT $SSH_USER@$SERVER "echo 'SSH connection OK'" || {
+    echo "ERROR: Cannot connect to server via SSH"
+    echo ""
+    echo "Manual steps:"
+    echo "1. SSH into the server: ssh -p $SSH_PORT $SSH_USER@$SERVER"
+    echo "2. Copy these scripts to /home/user/parle-backend/"
+    echo "3. Run: sudo ./setup-criu.sh"
+    echo "4. Test: ./start-smart.sh"
+    exit 1
+}
+# Copy scripts
+echo ""
+echo "[2/4] Copying scripts to server..."
+SCRIPT_DIR="$(dirname "$0")"
+scp -P $SSH_PORT \
+    "$SCRIPT_DIR/setup-criu.sh" \
+    "$SCRIPT_DIR/checkpoint.sh" \
+    "$SCRIPT_DIR/restore.sh" \
+    "$SCRIPT_DIR/start-smart.sh" \
+    "$SCRIPT_DIR/start.sh" \
+    "$SCRIPT_DIR/app.py" \
+    $SSH_USER@$SERVER:$REMOTE_DIR/
+echo "Scripts copied successfully"
+# Run setup
+echo ""
+echo "[3/4] Running CRIU setup on server..."
+ssh -p $SSH_PORT $SSH_USER@$SERVER "cd $REMOTE_DIR && chmod +x *.sh && sudo ./setup-criu.sh"
+# Test
+echo ""
+echo "[4/4] Testing installation..."
+ssh -p $SSH_PORT $SSH_USER@$SERVER "cuda-checkpoint --help > /dev/null && echo 'cuda-checkpoint: OK' || echo 'cuda-checkpoint: FAILED'"
+ssh -p $SSH_PORT $SSH_USER@$SERVER "criu --version"
+echo ""
+echo "=================================================="
+echo "Deployment complete!"
+echo "=================================================="
+echo ""
+echo "Next steps:"
+echo "1. SSH into server: ssh -p $SSH_PORT $SSH_USER@$SERVER"
+echo "2. Start backend: cd $REMOTE_DIR && ./start.sh"
+echo "3. Wait for models to load (~2 min)"
+echo "4. Create checkpoint: ./checkpoint.sh"
+echo "5. Test restore: ./restore.sh"
+echo ""

fast_startup.py ADDED Viewed

	@@ -0,0 +1,409 @@

+"""
+Fast Startup Module - Otimizacoes para Cold Start Rapido
+Estrategias implementadas:
+1. fastsafetensors - Loading 4.8x-7.5x mais rapido
+2. CUDA Graph caching - Economiza ~54s
+3. Parallel model loading - Carrega modelos simultaneamente
+4. Lazy loading - CEFR classifier carrega depois
+5. Pre-download models - Cache local no SSD
+Target: Cold start de ~487s para ~60s
+"""
+import os
+import sys
+import time
+import asyncio
+import threading
+from concurrent.futures import ThreadPoolExecutor
+from typing import Optional, Callable
+from dataclasses import dataclass
+# Environment variables para otimizacao
+os.environ["USE_FASTSAFETENSOR"] = "true"  # Enable fastsafetensors
+os.environ["VLLM_USE_MODELSCOPE"] = "false"
+os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Avoid warnings
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"  # Better memory
+# Cache directories
+CACHE_DIR = "/var/cache/parle-models"
+VLLM_CACHE_DIR = f"{CACHE_DIR}/vllm"
+HF_CACHE_DIR = f"{CACHE_DIR}/huggingface"
+os.environ["HF_HOME"] = HF_CACHE_DIR
+os.environ["VLLM_CACHE_DIR"] = VLLM_CACHE_DIR
+@dataclass
+class LoadingMetrics:
+    """Metricas de carregamento"""
+    vllm_ms: int = 0
+    whisper_ms: int = 0
+    cefr_ms: int = 0
+    kokoro_ms: int = 0
+    total_ms: int = 0
+    parallel: bool = False
+class FastModelLoader:
+    """
+    Carregador otimizado de modelos com:
+    - Parallel loading
+    - Progress callbacks
+    - Lazy loading para modelos secundarios
+    """
+    def __init__(
+        self,
+        vllm_model: str = "RedHatAI/gemma-3-4b-it-quantized.w4a16",
+        whisper_model: str = "openai/whisper-small",
+        cefr_model: str = "marcosremar2/cefr-classifier-pt-mdeberta-v3-enem",
+        gpu_memory_utilization: float = 0.40,
+        on_progress: Optional[Callable[[str, float], None]] = None,
+    ):
+        self.vllm_model = vllm_model
+        self.whisper_model = whisper_model
+        self.cefr_model = cefr_model
+        self.gpu_memory_utilization = gpu_memory_utilization
+        self.on_progress = on_progress
+        # Model instances
+        self.vllm_engine = None
+        self.whisper_model_instance = None
+        self.whisper_processor = None
+        self.cefr_model_instance = None
+        self.cefr_tokenizer = None
+        self.kokoro_pipeline = None
+        # Loading state
+        self.metrics = LoadingMetrics()
+        self._loading_lock = threading.Lock()
+    def _progress(self, message: str, percentage: float):
+        """Report progress"""
+        print(f"[{percentage:.0f}%] {message}")
+        if self.on_progress:
+            self.on_progress(message, percentage)
+    def _ensure_cache_dirs(self):
+        """Criar diretorios de cache"""
+        os.makedirs(CACHE_DIR, exist_ok=True)
+        os.makedirs(VLLM_CACHE_DIR, exist_ok=True)
+        os.makedirs(HF_CACHE_DIR, exist_ok=True)
+    def load_vllm(self) -> int:
+        """
+        Carrega vLLM com otimizacoes:
+        - fastsafetensors (se disponivel)
+        - load_format="auto" (detecta melhor formato)
+        - CUDA graph caching
+        """
+        start = time.time()
+        self._progress("Loading vLLM (optimized)...", 10)
+        from vllm import LLM
+        # Check if fastsafetensors is available
+        try:
+            import fastsafetensors
+            load_format = "fastsafetensors"
+            self._progress("Using fastsafetensors (4-7x faster)", 12)
+        except ImportError:
+            load_format = "auto"
+            self._progress("fastsafetensors not found, using auto", 12)
+        self.vllm_engine = LLM(
+            model=self.vllm_model,
+            dtype="auto",
+            gpu_memory_utilization=self.gpu_memory_utilization,
+            max_model_len=2048,
+            trust_remote_code=True,
+            # Otimizacoes de loading
+            load_format=load_format,
+            # CUDA graph optimization
+            enforce_eager=False,  # Enable CUDA graphs
+            # Disable unnecessary features for faster startup
+            enable_prefix_caching=False,
+            disable_custom_all_reduce=True,
+        )
+        elapsed_ms = int((time.time() - start) * 1000)
+        self.metrics.vllm_ms = elapsed_ms
+        self._progress(f"vLLM loaded in {elapsed_ms/1000:.1f}s", 40)
+        return elapsed_ms
+    def load_whisper(self) -> int:
+        """Carrega Whisper STT"""
+        start = time.time()
+        self._progress("Loading Whisper STT...", 45)
+        import torch
+        from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
+        self.whisper_processor = AutoProcessor.from_pretrained(
+            self.whisper_model,
+            cache_dir=HF_CACHE_DIR,
+        )
+        self.whisper_model_instance = AutoModelForSpeechSeq2Seq.from_pretrained(
+            self.whisper_model,
+            torch_dtype=torch.float16,
+            low_cpu_mem_usage=True,
+            cache_dir=HF_CACHE_DIR,
+        ).to("cuda")
+        elapsed_ms = int((time.time() - start) * 1000)
+        self.metrics.whisper_ms = elapsed_ms
+        self._progress(f"Whisper loaded in {elapsed_ms/1000:.1f}s", 60)
+        return elapsed_ms
+    def load_kokoro(self) -> int:
+        """Carrega Kokoro TTS"""
+        start = time.time()
+        self._progress("Loading Kokoro TTS...", 65)
+        from kokoro import KPipeline
+        self.kokoro_pipeline = KPipeline(lang_code='p', device='cuda')
+        elapsed_ms = int((time.time() - start) * 1000)
+        self.metrics.kokoro_ms = elapsed_ms
+        self._progress(f"Kokoro loaded in {elapsed_ms/1000:.1f}s", 80)
+        return elapsed_ms
+    def load_cefr(self) -> int:
+        """Carrega CEFR Classifier (pode ser lazy)"""
+        start = time.time()
+        self._progress("Loading CEFR Classifier...", 85)
+        import torch
+        from transformers import AutoTokenizer, AutoModelForSequenceClassification
+        self.cefr_tokenizer = AutoTokenizer.from_pretrained(
+            self.cefr_model,
+            cache_dir=HF_CACHE_DIR,
+        )
+        self.cefr_model_instance = AutoModelForSequenceClassification.from_pretrained(
+            self.cefr_model,
+            torch_dtype=torch.float16,
+            low_cpu_mem_usage=True,
+            cache_dir=HF_CACHE_DIR,
+        ).to("cuda")
+        self.cefr_model_instance.eval()
+        elapsed_ms = int((time.time() - start) * 1000)
+        self.metrics.cefr_ms = elapsed_ms
+        self._progress(f"CEFR loaded in {elapsed_ms/1000:.1f}s", 95)
+        return elapsed_ms
+    def load_all_sequential(self) -> LoadingMetrics:
+        """Carrega todos os modelos sequencialmente"""
+        overall_start = time.time()
+        self._ensure_cache_dirs()
+        self._progress("Starting sequential model loading...", 0)
+        # Order: vLLM first (needs contiguous memory)
+        self.load_vllm()
+        self.load_whisper()
+        self.load_kokoro()
+        self.load_cefr()
+        self.metrics.total_ms = int((time.time() - overall_start) * 1000)
+        self.metrics.parallel = False
+        self._progress(f"All models loaded in {self.metrics.total_ms/1000:.1f}s", 100)
+        return self.metrics
+    def load_all_parallel(self) -> LoadingMetrics:
+        """
+        Carrega modelos em paralelo onde possivel.
+        Ordem otimizada:
+        1. vLLM primeiro (precisa de memoria contigua)
+        2. Whisper + Kokoro em paralelo
+        3. CEFR lazy (carrega em background depois)
+        """
+        overall_start = time.time()
+        self._ensure_cache_dirs()
+        self._progress("Starting optimized parallel loading...", 0)
+        # Step 1: vLLM first (needs contiguous GPU memory)
+        self.load_vllm()
+        # Step 2: Whisper + Kokoro in parallel
+        self._progress("Loading Whisper + Kokoro in parallel...", 45)
+        with ThreadPoolExecutor(max_workers=2) as executor:
+            whisper_future = executor.submit(self.load_whisper)
+            kokoro_future = executor.submit(self.load_kokoro)
+            whisper_future.result()
+            kokoro_future.result()
+        # Step 3: CEFR (can be lazy loaded later)
+        self.load_cefr()
+        self.metrics.total_ms = int((time.time() - overall_start) * 1000)
+        self.metrics.parallel = True
+        self._progress(f"All models loaded in {self.metrics.total_ms/1000:.1f}s (parallel)", 100)
+        return self.metrics
+    def load_essential_only(self) -> LoadingMetrics:
+        """
+        Carrega apenas modelos essenciais para responder rapidamente.
+        CEFR eh carregado em background.
+        Tempo estimado: ~50-70% do tempo total
+        """
+        overall_start = time.time()
+        self._ensure_cache_dirs()
+        self._progress("Loading essential models only...", 0)
+        # Essential models
+        self.load_vllm()
+        with ThreadPoolExecutor(max_workers=2) as executor:
+            whisper_future = executor.submit(self.load_whisper)
+            kokoro_future = executor.submit(self.load_kokoro)
+            whisper_future.result()
+            kokoro_future.result()
+        self.metrics.total_ms = int((time.time() - overall_start) * 1000)
+        self.metrics.parallel = True
+        self._progress(f"Essential models loaded in {self.metrics.total_ms/1000:.1f}s", 90)
+        # Start CEFR loading in background
+        self._progress("Starting CEFR background loading...", 92)
+        threading.Thread(target=self._load_cefr_background, daemon=True).start()
+        return self.metrics
+    def _load_cefr_background(self):
+        """Carrega CEFR em background"""
+        try:
+            self.load_cefr()
+            print("[BACKGROUND] CEFR classifier loaded!")
+        except Exception as e:
+            print(f"[BACKGROUND] Failed to load CEFR: {e}")
+    def is_ready(self) -> bool:
+        """Verifica se modelos essenciais estao prontos"""
+        return (
+            self.vllm_engine is not None and
+            self.whisper_model_instance is not None and
+            self.kokoro_pipeline is not None
+        )
+    def is_fully_ready(self) -> bool:
+        """Verifica se todos os modelos estao prontos"""
+        return self.is_ready() and self.cefr_model_instance is not None
+def predownload_models(
+    vllm_model: str = "RedHatAI/gemma-3-4b-it-quantized.w4a16",
+    whisper_model: str = "openai/whisper-small",
+    cefr_model: str = "marcosremar2/cefr-classifier-pt-mdeberta-v3-enem",
+):
+    """
+    Pre-download models to local cache.
+    Run this during VM setup, not during cold start.
+    """
+    print("=" * 60)
+    print("Pre-downloading models to local cache...")
+    print("=" * 60)
+    os.makedirs(HF_CACHE_DIR, exist_ok=True)
+    from huggingface_hub import snapshot_download
+    models = [
+        (vllm_model, "vLLM"),
+        (whisper_model, "Whisper"),
+        (cefr_model, "CEFR"),
+    ]
+    for model_id, name in models:
+        print(f"\n[{name}] Downloading {model_id}...")
+        start = time.time()
+        try:
+            snapshot_download(
+                model_id,
+                cache_dir=HF_CACHE_DIR,
+                local_dir_use_symlinks=False,
+            )
+            elapsed = time.time() - start
+            print(f"[{name}] Downloaded in {elapsed:.1f}s")
+        except Exception as e:
+            print(f"[{name}] Error: {e}")
+    print("\n" + "=" * 60)
+    print("Pre-download complete!")
+    print("=" * 60)
+def install_fastsafetensors():
+    """Instala fastsafetensors para loading 4-7x mais rapido"""
+    import subprocess
+    print("Installing fastsafetensors...")
+    result = subprocess.run(
+        [sys.executable, "-m", "pip", "install", "fastsafetensors"],
+        capture_output=True,
+        text=True,
+    )
+    if result.returncode == 0:
+        print("fastsafetensors installed successfully!")
+    else:
+        print(f"Failed to install fastsafetensors: {result.stderr}")
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description="Fast Model Loader")
+    parser.add_argument("--predownload", action="store_true", help="Pre-download models")
+    parser.add_argument("--install-fast", action="store_true", help="Install fastsafetensors")
+    parser.add_argument("--test-load", action="store_true", help="Test model loading")
+    parser.add_argument("--parallel", action="store_true", help="Use parallel loading")
+    args = parser.parse_args()
+    if args.install_fast:
+        install_fastsafetensors()
+    if args.predownload:
+        predownload_models()
+    if args.test_load:
+        loader = FastModelLoader()
+        if args.parallel:
+            metrics = loader.load_all_parallel()
+        else:
+            metrics = loader.load_all_sequential()
+        print("\n" + "=" * 60)
+        print("Loading Metrics:")
+        print(f"  vLLM:    {metrics.vllm_ms/1000:.1f}s")
+        print(f"  Whisper: {metrics.whisper_ms/1000:.1f}s")
+        print(f"  Kokoro:  {metrics.kokoro_ms/1000:.1f}s")
+        print(f"  CEFR:    {metrics.cefr_ms/1000:.1f}s")
+        print(f"  TOTAL:   {metrics.total_ms/1000:.1f}s")
+        print(f"  Parallel: {metrics.parallel}")
+        print("=" * 60)

llm/.gitkeep ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Gemma LLM Model
2	+ # HuggingFace: RedHatAI/gemma-3-4b-it-quantized.w4a16

models/cefr/.gitkeep ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # CEFR Classifier Model
2	+ # HuggingFace: marcosremar2/cefr-classifier-pt-mdeberta-v3-enem

models/llm/.gitkeep ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Gemma LLM Model
2	+ # HuggingFace: RedHatAI/gemma-3-4b-it-quantized.w4a16

models/stt/.gitkeep ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Whisper STT Model
2	+ # HuggingFace: openai/whisper-small

models/tts/.gitkeep ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Kokoro TTS Model
2	+ # HuggingFace: hexgrad/Kokoro-82M

requirements.txt ADDED Viewed

	@@ -0,0 +1,27 @@

+# v5-tensordock-websocket requirements
+# Para RTX 3090 (24GB VRAM)
+# Web framework
+fastapi>=0.104.0
+uvicorn[standard]>=0.24.0
+pydantic>=2.0.0
+websockets>=12.0
+# ML/AI
+torch>=2.1.0
+transformers>=4.36.0
+vllm>=0.2.7
+# Audio processing
+soundfile>=0.12.0
+librosa>=0.10.0
+numpy>=1.24.0
+# Voice Activity Detection for WPM calculation
+pyannote-audio>=3.1.0
+# TTS
+kokoro>=0.1.0
+# HTTP client (for TensorDock API)
+requests>=2.31.0

restore.sh ADDED Viewed

	@@ -0,0 +1,108 @@

+#!/bin/bash
+# Restore PARLE backend from checkpoint
+# Fast startup path - restores pre-loaded models from checkpoint
+#
+# Requires: patched CRIU (criu-patched), io_uring disabled
+# Usage: ./restore.sh [checkpoint-path]
+set -e
+CHECKPOINT_DIR="/var/lib/parle-checkpoints"
+CHECKPOINT_PATH="${1:-$CHECKPOINT_DIR/latest}"
+echo "=============================================="
+echo "PARLE Backend Restore"
+echo "=============================================="
+# Check if checkpoint exists
+if [ ! -d "$CHECKPOINT_PATH" ] && [ ! -L "$CHECKPOINT_PATH" ]; then
+    echo "ERROR: Checkpoint not found: $CHECKPOINT_PATH"
+    echo ""
+    echo "Available checkpoints:"
+    ls -la "$CHECKPOINT_DIR" 2>/dev/null || echo "  (none)"
+    echo ""
+    echo "To create a checkpoint:"
+    echo "  1. Start normally: ./start.sh"
+    echo "  2. Wait for models to load (~45s)"
+    echo "  3. Create checkpoint: ./checkpoint.sh"
+    exit 1
+fi
+# Resolve symlink if needed
+if [ -L "$CHECKPOINT_PATH" ]; then
+    CHECKPOINT_PATH=$(readlink -f "$CHECKPOINT_PATH")
+fi
+echo "Checkpoint: $CHECKPOINT_PATH"
+echo "Size: $(du -sh "$CHECKPOINT_PATH" | cut -f1)"
+# Check if another instance is running
+if pgrep -f "python.*app.py" > /dev/null; then
+    echo ""
+    echo "WARNING: Backend already running"
+    echo "Kill it first: pkill -9 -f 'python.*app.py'"
+    exit 1
+fi
+# Check CRIU
+if [ ! -f /usr/local/bin/criu-patched ]; then
+    echo "ERROR: Patched CRIU not found at /usr/local/bin/criu-patched"
+    echo "Run setup-criu-patched.sh first"
+    exit 1
+fi
+# Change to the correct directory
+cd /home/user
+echo ""
+echo "Restoring from checkpoint..."
+START_TIME=$(date +%s)
+# Restore with patched CRIU (runs in background)
+CRIU_PLUGINS_DIR=/usr/lib/criu /usr/local/bin/criu-patched restore \
+    -D "$CHECKPOINT_PATH" \
+    --shell-job \
+    --tcp-established \
+    --file-locks \
+    --ext-unix-sk \
+    -v0 \
+    -o "$CHECKPOINT_PATH/restore.log" 2>/dev/null &
+RESTORE_PID=$!
+# Wait for backend to be ready
+echo "Waiting for backend health..."
+for i in {1..60}; do
+    HEALTH=$(curl -s --max-time 2 http://localhost:8000/health 2>/dev/null)
+    if [ ! -z "$HEALTH" ]; then
+        END_TIME=$(date +%s)
+        DURATION=$((END_TIME - START_TIME))
+        # Get process info
+        PYTHON_PID=$(pgrep -f "python.*app.py" | head -1)
+        echo ""
+        echo "=============================================="
+        echo "Backend restored successfully!"
+        echo "=============================================="
+        echo "Restore time: ${DURATION}s"
+        echo "Process PID: $PYTHON_PID"
+        echo ""
+        echo "Health check:"
+        echo "$HEALTH" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'  Status: {d[\"status\"]}'); print(f'  vLLM: {d[\"vllm_loaded\"]}'); print(f'  Whisper: {d[\"whisper_loaded\"]}'); print(f'  Kokoro: {d[\"kokoro_loaded\"]}')" 2>/dev/null || echo "$HEALTH"
+        echo ""
+        echo "Backend ready at http://localhost:8000"
+        exit 0
+    fi
+    if [ $((i % 10)) -eq 0 ]; then
+        echo "  Still waiting... ($i/60s)"
+    fi
+    sleep 1
+done
+echo ""
+echo "ERROR: Backend did not respond within 60 seconds"
+echo "Check restore log: $CHECKPOINT_PATH/restore.log"
+tail -20 "$CHECKPOINT_PATH/restore.log" 2>/dev/null
+exit 1

setup-criu-patched.sh ADDED Viewed

	@@ -0,0 +1,57 @@

+#!/bin/bash
+# Setup patched CRIU for PyTorch checkpoint/restore on TensorDock
+# This script compiles CRIU with patches to skip unsupported nvidia device FDs
+set -e
+echo "=============================================="
+echo "Setting up patched CRIU for PyTorch C/R"
+echo "=============================================="
+# Install dependencies
+echo "[1/5] Installing build dependencies..."
+apt-get update
+apt-get install -y build-essential pkg-config libprotobuf-dev libprotobuf-c-dev \
+    protobuf-c-compiler protobuf-compiler python3-protobuf libbsd-dev \
+    libcap-dev libnl-3-dev libnet1-dev libaio-dev libgnutls28-dev \
+    python3-future asciidoc xmlto git
+# Clone CRIU
+echo "[2/5] Cloning CRIU..."
+cd /tmp
+rm -rf criu-patched
+git clone --depth 1 https://github.com/checkpoint-restore/criu.git criu-patched
+cd criu-patched
+# Apply patch to files-ext.c (skip unsupported FDs during dump)
+echo "[3/5] Applying dump patch..."
+perl -i -0pe 's/(int dump_unsupp_fd.*?if \(ret == -ENOTSUP\))\s*pr_err\("Can.t dump file.*?\n\s*return -1;/$1 {\n\t\tpr_warn("Skipping file %d of that type [%o] (%s %s)\\n", p->fd, p->stat.st_mode, more, info);\n\t\treturn 0; \/\/ PATCHED: skip unsupported FDs\n\t}\n\treturn -1;/s' criu/files-ext.c
+# Apply patch to files.c (skip missing FDs during restore)
+echo "[4/5] Applying restore patch..."
+perl -i -0pe 's/(fdesc = find_file_desc\(e\);\s*if \(fdesc == NULL\) \{)\s*pr_err\("No file for fd.*?\n\s*return -1;/$1\n\t\tpr_warn("No file for fd %d id %#x, skipping (PATCHED)\\n", e->fd, e->id);\n\t\treturn 0; \/\/ PATCHED: skip missing FDs/s' criu/files.c
+# Build
+echo "[5/5] Building patched CRIU..."
+make -j$(nproc)
+# Install
+cp criu/criu /usr/local/bin/criu-patched
+mkdir -p /usr/lib/criu
+cp plugins/cuda/cuda_plugin.so /usr/lib/criu/
+# Verify
+echo ""
+echo "=============================================="
+echo "Patched CRIU installed!"
+echo "=============================================="
+/usr/local/bin/criu-patched --version
+# Setup io_uring disable (persists across reboots)
+echo ""
+echo "Disabling io_uring at kernel level..."
+sysctl -w kernel.io_uring_disabled=2
+echo "kernel.io_uring_disabled=2" >> /etc/sysctl.conf
+echo ""
+echo "Setup complete! Run checkpoint.sh after models are loaded."

setup-criu.sh ADDED Viewed

	@@ -0,0 +1,91 @@

+#!/bin/bash
+# CRIU + cuda-checkpoint Setup Script for TensorDock
+# Run this once on a fresh VM to install all dependencies
+set -e
+echo "=================================================="
+echo "Setting up CRIU + cuda-checkpoint for fast restore"
+echo "=================================================="
+# Check if running as root
+if [ "$EUID" -ne 0 ]; then
+    echo "Please run as root (sudo ./setup-criu.sh)"
+    exit 1
+fi
+# Check NVIDIA driver version (needs 550+)
+echo ""
+echo "[1/5] Checking NVIDIA driver version..."
+DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1)
+MAJOR_VERSION=$(echo $DRIVER_VERSION | cut -d'.' -f1)
+echo "Driver version: $DRIVER_VERSION"
+if [ "$MAJOR_VERSION" -lt 550 ]; then
+    echo "ERROR: NVIDIA driver 550+ required for cuda-checkpoint"
+    echo "Current version: $DRIVER_VERSION"
+    echo ""
+    echo "To upgrade driver:"
+    echo "  sudo apt-get update"
+    echo "  sudo apt-get install nvidia-driver-550"
+    exit 1
+fi
+echo "Driver version OK!"
+# Install CRIU
+echo ""
+echo "[2/5] Installing CRIU..."
+apt-get update
+apt-get install -y criu
+# Verify CRIU installation
+CRIU_VERSION=$(criu --version | head -1)
+echo "CRIU installed: $CRIU_VERSION"
+# Clone cuda-checkpoint
+echo ""
+echo "[3/5] Setting up cuda-checkpoint..."
+CUDA_CHECKPOINT_DIR="/opt/cuda-checkpoint"
+if [ -d "$CUDA_CHECKPOINT_DIR" ]; then
+    echo "cuda-checkpoint already exists, updating..."
+    cd "$CUDA_CHECKPOINT_DIR"
+    git pull
+else
+    git clone https://github.com/NVIDIA/cuda-checkpoint.git "$CUDA_CHECKPOINT_DIR"
+fi
+# Create symlink for easy access
+ln -sf "$CUDA_CHECKPOINT_DIR/bin/cuda-checkpoint" /usr/local/bin/cuda-checkpoint
+chmod +x /usr/local/bin/cuda-checkpoint
+echo "cuda-checkpoint installed at /usr/local/bin/cuda-checkpoint"
+# Create checkpoint directory
+echo ""
+echo "[4/5] Creating checkpoint directory..."
+CHECKPOINT_DIR="/var/lib/parle-checkpoints"
+mkdir -p "$CHECKPOINT_DIR"
+chmod 755 "$CHECKPOINT_DIR"
+echo "Checkpoint directory: $CHECKPOINT_DIR"
+# Test cuda-checkpoint
+echo ""
+echo "[5/5] Testing cuda-checkpoint..."
+cuda-checkpoint --help > /dev/null 2>&1 && echo "cuda-checkpoint: OK" || echo "cuda-checkpoint: FAILED"
+criu check > /dev/null 2>&1 && echo "CRIU check: OK" || echo "CRIU check: WARNING (some features may not work)"
+echo ""
+echo "=================================================="
+echo "Setup complete!"
+echo "=================================================="
+echo ""
+echo "Next steps:"
+echo "1. Start the backend normally: ./start.sh"
+echo "2. Wait for models to load (~2 min)"
+echo "3. Create checkpoint: ./checkpoint.sh"
+echo "4. Next time, restore: ./restore.sh (should be ~5-10s)"
+echo ""

setup-fast-coldstart.sh ADDED Viewed

	@@ -0,0 +1,131 @@

+#!/bin/bash
+# =============================================================================
+# FAST COLD START SETUP
+# =============================================================================
+# Este script prepara a VM TensorDock para cold starts rapidos (~60s vs ~487s)
+#
+# Otimizacoes:
+# 1. Pre-download models para SSD local
+# 2. Instala fastsafetensors (loading 4-7x mais rapido)
+# 3. Configura CUDA graph caching
+# 4. Configura environment variables otimizados
+#
+# Uso: ./setup-fast-coldstart.sh
+# =============================================================================
+set -e
+echo "=============================================="
+echo "FAST COLD START SETUP"
+echo "=============================================="
+# Directories
+CACHE_DIR="/var/cache/parle-models"
+HF_CACHE="$CACHE_DIR/huggingface"
+VLLM_CACHE="$CACHE_DIR/vllm"
+CUDA_CACHE="$CACHE_DIR/cuda-cache"
+# Create directories
+echo "[1/5] Creating cache directories..."
+sudo mkdir -p $CACHE_DIR
+sudo mkdir -p $HF_CACHE
+sudo mkdir -p $VLLM_CACHE
+sudo mkdir -p $CUDA_CACHE
+sudo chmod -R 777 $CACHE_DIR
+# Set environment variables permanently
+echo "[2/5] Setting environment variables..."
+cat >> ~/.bashrc << 'EOF'
+# PARLE Fast Cold Start Environment
+export HF_HOME=/var/cache/parle-models/huggingface
+export VLLM_CACHE_DIR=/var/cache/parle-models/vllm
+export CUDA_CACHE_PATH=/var/cache/parle-models/cuda-cache
+export USE_FASTSAFETENSOR=true
+export TOKENIZERS_PARALLELISM=false
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+# vLLM optimizations
+export VLLM_ATTENTION_BACKEND=FLASH_ATTN
+export VLLM_USE_TRITON_FLASH_ATTN=1
+EOF
+# Source the new environment
+source ~/.bashrc
+# Install fastsafetensors
+echo "[3/5] Installing fastsafetensors (4-7x faster loading)..."
+pip install fastsafetensors 2>/dev/null || {
+    echo "Warning: fastsafetensors installation failed, will use default loader"
+}
+# Install NVIDIA Model Streamer (optional, for S3 loading)
+echo "[4/5] Installing nvidia-model-streamer (optional)..."
+pip install nvidia-model-streamer 2>/dev/null || {
+    echo "Warning: nvidia-model-streamer not available"
+}
+# Pre-download models
+echo "[5/5] Pre-downloading models to local cache..."
+echo "This may take 10-30 minutes depending on network speed..."
+python3 << 'PYTHON_SCRIPT'
+import os
+import time
+os.environ["HF_HOME"] = "/var/cache/parle-models/huggingface"
+from huggingface_hub import snapshot_download
+models = [
+    ("RedHatAI/gemma-3-4b-it-quantized.w4a16", "vLLM (Gemma 4B)"),
+    ("openai/whisper-small", "Whisper STT"),
+    ("marcosremar2/cefr-classifier-pt-mdeberta-v3-enem", "CEFR Classifier"),
+]
+print("\n" + "=" * 50)
+for model_id, name in models:
+    print(f"\nDownloading {name}: {model_id}")
+    start = time.time()
+    try:
+        path = snapshot_download(
+            model_id,
+            cache_dir="/var/cache/parle-models/huggingface",
+        )
+        elapsed = time.time() - start
+        print(f"  Downloaded to {path} in {elapsed:.1f}s")
+    except Exception as e:
+        print(f"  ERROR: {e}")
+# Also download Kokoro voices
+print("\nDownloading Kokoro TTS voices...")
+try:
+    from kokoro import KPipeline
+    pipeline = KPipeline(lang_code='p', device='cpu')  # Just to trigger download
+    print("  Kokoro voices downloaded!")
+except Exception as e:
+    print(f"  Kokoro download skipped: {e}")
+print("\n" + "=" * 50)
+print("Pre-download complete!")
+print("=" * 50)
+PYTHON_SCRIPT
+echo ""
+echo "=============================================="
+echo "SETUP COMPLETE!"
+echo "=============================================="
+echo ""
+echo "Expected cold start improvement:"
+echo "  Before: ~487s (8 min)"
+echo "  After:  ~60-90s (1-1.5 min)"
+echo ""
+echo "Optimizations applied:"
+echo "  - Models cached locally on SSD"
+echo "  - fastsafetensors for 4-7x faster loading"
+echo "  - CUDA graph caching enabled"
+echo "  - Environment variables optimized"
+echo ""
+echo "To test: ./start-smart.sh"
+echo "=============================================="

start-optimized.sh ADDED Viewed

	@@ -0,0 +1,198 @@

+#!/bin/bash
+# =============================================================================
+# OPTIMIZED PARLE BACKEND STARTUP
+# =============================================================================
+# Startup script com medicoes de tempo para cada fase
+#
+# Fases:
+# 1. Environment setup
+# 2. Check/restore from checkpoint (se disponivel)
+# 3. Fast model loading (otimizado)
+# 4. Health check
+# =============================================================================
+set -e
+SCRIPT_DIR="$(dirname "$0")"
+LOG_FILE="$SCRIPT_DIR/startup.log"
+# Timing function
+timestamp() {
+    date +%s.%N
+}
+log() {
+    echo "[$(date '+%H:%M:%S')] $1" | tee -a "$LOG_FILE"
+}
+# Start timing
+TOTAL_START=$(timestamp)
+echo "=============================================="
+echo "PARLE Backend - Optimized Startup"
+echo "=============================================="
+echo "" > "$LOG_FILE"
+# =============================================================================
+# PHASE 1: Environment Setup
+# =============================================================================
+PHASE1_START=$(timestamp)
+log "PHASE 1: Environment Setup"
+# Set optimized environment
+export HF_HOME=/var/cache/parle-models/huggingface
+export VLLM_CACHE_DIR=/var/cache/parle-models/vllm
+export CUDA_CACHE_PATH=/var/cache/parle-models/cuda-cache
+export USE_FASTSAFETENSOR=true
+export TOKENIZERS_PARALLELISM=false
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+export VLLM_ATTENTION_BACKEND=FLASH_ATTN
+# Check if models are pre-cached
+if [ -d "/var/cache/parle-models/huggingface" ]; then
+    CACHE_SIZE=$(du -sh /var/cache/parle-models/huggingface 2>/dev/null | cut -f1)
+    log "  Model cache found: $CACHE_SIZE"
+else
+    log "  WARNING: No model cache found. First run will be slow."
+fi
+PHASE1_END=$(timestamp)
+PHASE1_TIME=$(echo "$PHASE1_END - $PHASE1_START" | bc)
+log "  Phase 1 complete: ${PHASE1_TIME}s"
+# =============================================================================
+# PHASE 2: Checkpoint Restore (if available)
+# =============================================================================
+PHASE2_START=$(timestamp)
+log "PHASE 2: Checkpoint Check"
+CHECKPOINT_DIR="/var/lib/parle-checkpoints"
+CHECKPOINT_PATH="$CHECKPOINT_DIR/latest"
+if [ -d "$CHECKPOINT_PATH" ] || [ -L "$CHECKPOINT_PATH" ]; then
+    log "  Checkpoint found! Attempting restore..."
+    if "$SCRIPT_DIR/restore.sh" "$CHECKPOINT_PATH" 2>/dev/null; then
+        PHASE2_END=$(timestamp)
+        PHASE2_TIME=$(echo "$PHASE2_END - $PHASE2_START" | bc)
+        TOTAL_TIME=$(echo "$PHASE2_END - $TOTAL_START" | bc)
+        log "  Restored from checkpoint!"
+        log ""
+        log "=============================================="
+        log "STARTUP COMPLETE (from checkpoint)"
+        log "  Phase 1 (env):      ${PHASE1_TIME}s"
+        log "  Phase 2 (restore):  ${PHASE2_TIME}s"
+        log "  TOTAL:              ${TOTAL_TIME}s"
+        log "=============================================="
+        exit 0
+    else
+        log "  Checkpoint restore failed, continuing with cold start"
+    fi
+else
+    log "  No checkpoint found, proceeding with cold start"
+fi
+PHASE2_END=$(timestamp)
+PHASE2_TIME=$(echo "$PHASE2_END - $PHASE2_START" | bc)
+log "  Phase 2 complete: ${PHASE2_TIME}s"
+# =============================================================================
+# PHASE 3: Model Loading (Optimized)
+# =============================================================================
+PHASE3_START=$(timestamp)
+log "PHASE 3: Model Loading"
+# Start the server with optimized loading
+cd "$SCRIPT_DIR"
+# Create a Python script for optimized loading
+python3 << 'PYTHON_SCRIPT' &
+import os
+import sys
+import time
+# Ensure environment
+os.environ["HF_HOME"] = "/var/cache/parle-models/huggingface"
+os.environ["USE_FASTSAFETENSOR"] = "true"
+print("[STARTUP] Starting optimized model loading...")
+start = time.time()
+# Import app module (will trigger load_models on startup)
+import uvicorn
+# Run server
+uvicorn.run(
+    "app:app",
+    host="0.0.0.0",
+    port=8000,
+    log_level="info",
+)
+PYTHON_SCRIPT
+SERVER_PID=$!
+log "  Server started (PID: $SERVER_PID)"
+# =============================================================================
+# PHASE 4: Health Check
+# =============================================================================
+PHASE4_START=$(timestamp)
+log "PHASE 4: Waiting for health..."
+# Wait for backend to be healthy
+MAX_WAIT=300  # 5 minutes max
+WAIT_INTERVAL=2
+for i in $(seq 1 $((MAX_WAIT / WAIT_INTERVAL))); do
+    HEALTH=$(curl -s --max-time 2 http://localhost:8000/health 2>/dev/null || echo "")
+    if [ ! -z "$HEALTH" ]; then
+        # Check if all models are loaded
+        WHISPER=$(echo "$HEALTH" | grep -o '"whisper_loaded":true' || true)
+        VLLM=$(echo "$HEALTH" | grep -o '"vllm_loaded":true' || true)
+        KOKORO=$(echo "$HEALTH" | grep -o '"kokoro_loaded":true' || true)
+        if [ ! -z "$WHISPER" ] && [ ! -z "$VLLM" ] && [ ! -z "$KOKORO" ]; then
+            PHASE4_END=$(timestamp)
+            PHASE3_TIME=$(echo "$PHASE4_START - $PHASE3_START" | bc)
+            PHASE4_TIME=$(echo "$PHASE4_END - $PHASE4_START" | bc)
+            TOTAL_TIME=$(echo "$PHASE4_END - $TOTAL_START" | bc)
+            echo ""
+            log "=============================================="
+            log "STARTUP COMPLETE (cold start)"
+            log "  Phase 1 (env):       ${PHASE1_TIME}s"
+            log "  Phase 2 (checkpoint): ${PHASE2_TIME}s"
+            log "  Phase 3 (loading):   ${PHASE3_TIME}s"
+            log "  Phase 4 (health):    ${PHASE4_TIME}s"
+            log "  TOTAL:               ${TOTAL_TIME}s"
+            log "=============================================="
+            log ""
+            log "Server running at http://localhost:8000"
+            log "Health endpoint: http://localhost:8000/health"
+            log ""
+            # Create checkpoint for faster next startup
+            if [ ! -d "$CHECKPOINT_PATH" ]; then
+                log "TIP: Create checkpoint for faster startup:"
+                log "     ./checkpoint.sh"
+            fi
+            # Keep script running
+            wait $SERVER_PID
+            exit 0
+        fi
+    fi
+    # Progress update every 10 seconds
+    if [ $((i % 5)) -eq 0 ]; then
+        ELAPSED=$((i * WAIT_INTERVAL))
+        log "  Still loading... (${ELAPSED}s)"
+    fi
+    sleep $WAIT_INTERVAL
+done
+log "ERROR: Timeout waiting for backend (${MAX_WAIT}s)"
+exit 1

start-smart.sh ADDED Viewed

	@@ -0,0 +1,91 @@

+#!/bin/bash
+# Smart PARLE Backend Startup Script
+# Attempts restore from checkpoint first, falls back to cold start
+#
+# Usage: ./start-smart.sh
+set -e
+CHECKPOINT_DIR="/var/lib/parle-checkpoints"
+CHECKPOINT_PATH="$CHECKPOINT_DIR/latest"
+SCRIPT_DIR="$(dirname "$0")"
+echo "=============================================="
+echo "PARLE Backend Smart Startup"
+echo "=============================================="
+# Check if checkpoint exists
+if [ -d "$CHECKPOINT_PATH" ] || [ -L "$CHECKPOINT_PATH" ]; then
+    echo "Checkpoint found! Attempting fast restore..."
+    echo ""
+    START_TIME=$(date +%s)
+    # Try to restore
+    if "$SCRIPT_DIR/restore.sh" "$CHECKPOINT_PATH"; then
+        END_TIME=$(date +%s)
+        DURATION=$((END_TIME - START_TIME))
+        echo ""
+        echo "Fast restore completed in ${DURATION}s!"
+        exit 0
+    else
+        echo ""
+        echo "Restore failed, falling back to cold start..."
+        echo ""
+    fi
+else
+    echo "No checkpoint found at $CHECKPOINT_PATH"
+    echo "Performing cold start..."
+    echo ""
+fi
+# Cold start fallback
+echo "=============================================="
+echo "Cold Start Mode"
+echo "=============================================="
+START_TIME=$(date +%s)
+# Run the normal start script
+"$SCRIPT_DIR/start.sh" &
+SERVER_PID=$!
+# Wait for backend to be healthy
+echo "Waiting for backend to be ready..."
+for i in {1..180}; do
+    HEALTH=$(curl -s --max-time 2 http://localhost:8000/health 2>/dev/null)
+    if [ ! -z "$HEALTH" ]; then
+        # Check if all models are loaded
+        WHISPER=$(echo "$HEALTH" | grep -o '"whisper_loaded":true' || true)
+        VLLM=$(echo "$HEALTH" | grep -o '"vllm_loaded":true' || true)
+        KOKORO=$(echo "$HEALTH" | grep -o '"kokoro_loaded":true' || true)
+        if [ ! -z "$WHISPER" ] && [ ! -z "$VLLM" ] && [ ! -z "$KOKORO" ]; then
+            END_TIME=$(date +%s)
+            DURATION=$((END_TIME - START_TIME))
+            echo ""
+            echo "=============================================="
+            echo "Backend ready! (cold start: ${DURATION}s)"
+            echo "=============================================="
+            echo ""
+            # Offer to create checkpoint
+            echo "TIP: Create a checkpoint now for faster startup next time:"
+            echo "     ./checkpoint.sh"
+            echo ""
+            # Keep the script running to maintain the server
+            wait $SERVER_PID
+            exit 0
+        fi
+    fi
+    if [ $((i % 10)) -eq 0 ]; then
+        echo "  Still loading... ($i/180)"
+    fi
+    sleep 1
+done
+echo "ERROR: Timeout waiting for backend to be ready"
+exit 1

start.sh ADDED Viewed

	@@ -0,0 +1,40 @@

+#!/bin/bash
+# PARLE Backend Startup Script
+# This script sets up environment variables and starts the FastAPI server
+# ============================================================================
+# CONFIGURATION - Edit these values before deploying
+# ============================================================================
+# TensorDock Auto-Stop Configuration
+export TENSORDOCK_API_TOKEN="WBE5UPHOC6Ed1HeLYL2TjqbBqVEwn5MF"
+export TENSORDOCK_INSTANCE_ID="befc5b17-7516-4ccd-a0ff-da2d4ecdb874"
+export IDLE_TIMEOUT_SECONDS="120"  # 2 minutes
+# Email Alerts (get key from https://resend.com)
+export RESEND_API_KEY=""  # Set this to receive email alerts
+export ALERT_EMAIL="marcos@marcosrp.com"
+# ============================================================================
+# STARTUP
+# ============================================================================
+echo "=================================================="
+echo "PARLE Backend Starting..."
+echo "=================================================="
+echo "Instance ID: $TENSORDOCK_INSTANCE_ID"
+echo "Idle Timeout: ${IDLE_TIMEOUT_SECONDS}s"
+echo "Alert Email: $ALERT_EMAIL"
+echo "Resend Key: $([ -n "$RESEND_API_KEY" ] && echo "SET" || echo "NOT SET")"
+echo "=================================================="
+# Change to script directory
+cd "$(dirname "$0")"
+# Activate virtual environment if exists
+if [ -f "/home/user/venv/bin/activate" ]; then
+    source /home/user/venv/bin/activate
+fi
+# Start the server
+exec uvicorn app:app --host 0.0.0.0 --port 8000 --workers 1

stt/.gitkeep ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Whisper STT Model
2	+ # HuggingFace: openai/whisper-small

tts/.gitkeep ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Kokoro TTS Model
2	+ # HuggingFace: hexgrad/Kokoro-82M