marcos commited on
Commit
bd4f893
·
0 Parent(s):

Initial commit

Browse files
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PARLE Speech-to-Speech
2
+
3
+ Pipeline completo de Speech-to-Speech para ensino de português com adaptação automática ao nível CEFR do aluno.
4
+
5
+ **HuggingFace:** [marcosremar2/parle-speech-to-speech](https://huggingface.co/marcosremar2/parle-speech-to-speech)
6
+
7
+ **Hardware:** TensorDock RTX 3090 (24GB VRAM)
8
+
9
+ ## Pipeline
10
+
11
+ ```
12
+ Audio -> Whisper (STT) -> CEFR Classifier -> Gemma 3 4B vLLM (LLM) -> Kokoro (TTS) -> Audio
13
+
14
+ Adapta prompt ao nível do aluno (A1-C1)
15
+ ```
16
+
17
+ ## CEFR Adaptativo
18
+
19
+ O sistema classifica automaticamente o nível CEFR do aluno a cada 5 mensagens e adapta as respostas do avatar:
20
+
21
+ | Nível | Comportamento do Avatar |
22
+ |-------|------------------------|
23
+ | **A1** | Frases muito curtas, vocabulário básico, fala devagar |
24
+ | **A2** | Frases simples, conectivos básicos, correções gentis |
25
+ | **B1** | Vocabulário variado, tempos verbais diversos |
26
+ | **B2** | Discussões abstratas, expressões idiomáticas |
27
+ | **C1** | Linguagem nativa, nuances culturais |
28
+
29
+ **Modelo CEFR:** `marcosremar2/cefr-classifier-pt-mdeberta-v3-enem` (96.43% accuracy)
30
+
31
+ ## Modelos HuggingFace
32
+
33
+ | Componente | Modelo | Função |
34
+ |------------|--------|--------|
35
+ | STT | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | Transcrição de áudio |
36
+ | LLM | [RedHatAI/gemma-3-4b-it-quantized.w4a16](https://huggingface.co/RedHatAI/gemma-3-4b-it-quantized.w4a16) | Geração de resposta |
37
+ | TTS | [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) | Síntese de voz |
38
+ | CEFR | [marcosremar2/cefr-classifier-pt-mdeberta-v3-enem](https://huggingface.co/marcosremar2/cefr-classifier-pt-mdeberta-v3-enem) | Classificação de nível |
39
+
40
+ ## Endpoints
41
+
42
+ ### Frontend-Compatible (JSON)
43
+
44
+ | Endpoint | Método | Descrição |
45
+ |----------|--------|-----------|
46
+ | `/health` | GET | Health check com status dos modelos e CEFR |
47
+ | `/api/audio` | POST | Processa áudio (STT → LLM → TTS) |
48
+ | `/api/text` | POST | Processa texto (LLM → TTS) |
49
+ | `/api/reset` | POST | Limpa histórico de conversa e reseta CEFR |
50
+
51
+ ### CEFR Endpoints
52
+
53
+ | Endpoint | Método | Descrição |
54
+ |----------|--------|-----------|
55
+ | `/api/cefr/status` | GET | Status atual do CEFR (nível, contador) |
56
+ | `/api/cefr/classify` | POST | Classifica texto manualmente |
57
+ | `/api/cefr/reset` | POST | Reseta nível para B1 |
58
+ | `/api/cefr/set` | POST | Define nível manualmente |
59
+
60
+ ### WebSocket
61
+
62
+ | Endpoint | Descrição |
63
+ |----------|-----------|
64
+ | `/ws/stream` | Streaming bidirecional de áudio |
65
+
66
+ ## Formato de Requisição
67
+
68
+ ### POST /api/audio
69
+ ```json
70
+ {
71
+ "audio": "<base64 WAV>",
72
+ "language": "pt",
73
+ "voice": "pf_dora",
74
+ "mode": "default"
75
+ }
76
+ ```
77
+
78
+ ### POST /api/text
79
+ ```json
80
+ {
81
+ "text": "Olá, como você está?",
82
+ "language": "pt",
83
+ "voice": "pf_dora",
84
+ "mode": "default"
85
+ }
86
+ ```
87
+
88
+ ## Formato de Resposta
89
+
90
+ ```json
91
+ {
92
+ "transcription": {
93
+ "text": "texto transcrito",
94
+ "language": "pt",
95
+ "confidence": 1.0
96
+ },
97
+ "response": {
98
+ "text": "resposta do LLM",
99
+ "emotion": "neutral",
100
+ "language": "pt"
101
+ },
102
+ "speech": {
103
+ "audio": "<base64 WAV>",
104
+ "visemes": [],
105
+ "duration": 1.5,
106
+ "sample_rate": 24000,
107
+ "format": "wav"
108
+ },
109
+ "timing": {
110
+ "stt_ms": 100,
111
+ "llm_ms": 200,
112
+ "tts_ms": 150,
113
+ "total_ms": 450
114
+ },
115
+ "cefr": {
116
+ "current_level": "B1",
117
+ "messages_until_classify": 3
118
+ }
119
+ }
120
+ ```
121
+
122
+ ### POST /api/cefr/classify
123
+
124
+ ```json
125
+ {
126
+ "text": "Eu gosto de estudar português porque é uma língua bonita."
127
+ }
128
+ ```
129
+
130
+ **Resposta:**
131
+ ```json
132
+ {
133
+ "level": "B1",
134
+ "confidence": 0.87,
135
+ "probabilities": {
136
+ "A1": 0.02,
137
+ "A2": 0.08,
138
+ "B1": 0.87,
139
+ "B2": 0.02,
140
+ "C1": 0.01
141
+ },
142
+ "text_length": 58
143
+ }
144
+ ```
145
+
146
+ ## Deploy no TensorDock
147
+
148
+ ### 1. Criar instância RTX 3090 (24GB)
149
+
150
+ ```bash
151
+ # SSH para a instância
152
+ ssh user@SEU_IP_TENSORDOCK
153
+ ```
154
+
155
+ ### 2. Instalar dependências
156
+
157
+ ```bash
158
+ pip install fastapi uvicorn torch transformers vllm kokoro soundfile librosa
159
+ ```
160
+
161
+ ### 3. Configurar variáveis de ambiente
162
+
163
+ ```bash
164
+ export IDLE_TIMEOUT_SECONDS=300 # 5 minutos
165
+ export TENSORDOCK_API_TOKEN="seu_token"
166
+ export TENSORDOCK_INSTANCE_ID="seu_instance_id"
167
+ ```
168
+
169
+ ### 4. Iniciar servidor
170
+
171
+ ```bash
172
+ python app.py
173
+ # ou
174
+ uvicorn app:app --host 0.0.0.0 --port 8000
175
+ ```
176
+
177
+ ### 5. Configurar frontend
178
+
179
+ No arquivo `.env` do projeto Next.js:
180
+
181
+ ```bash
182
+ NEXT_PUBLIC_CABECAO_BACKEND_URL="http://SEU_IP_TENSORDOCK:8000"
183
+ ```
184
+
185
+ ## Auto-Stop
186
+
187
+ O servidor para automaticamente após 60 segundos de inatividade (configurável via `IDLE_TIMEOUT_SECONDS`).
188
+
189
+ Para manter ativo, o frontend faz chamadas periódicas ao `/health`.
190
+
191
+ ## WebSocket Streaming
192
+
193
+ ```javascript
194
+ const ws = new WebSocket('ws://SEU_IP:8000/ws/stream');
195
+
196
+ ws.onmessage = (event) => {
197
+ if (event.data instanceof Blob) {
198
+ // Chunk de áudio WAV - tocar imediatamente
199
+ playAudioChunk(event.data);
200
+ } else {
201
+ // JSON com métricas ou status
202
+ const data = JSON.parse(event.data);
203
+ console.log('Status:', data);
204
+ }
205
+ };
206
+
207
+ // Enviar áudio gravado
208
+ ws.send(audioBlob);
209
+ ```
app.py ADDED
@@ -0,0 +1,1497 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DumontTalker Inference Server - Full Pipeline with WebSocket Streaming
3
+ TensorDock RTX 3090 (24GB VRAM)
4
+
5
+ Pipeline: Whisper (STT) → Gemma 3 4B vLLM (LLM) → Kokoro (TTS)
6
+ + CEFR Classifier: Classifica nível do aluno a cada 5 mensagens
7
+
8
+ Auto-stop: Para a instância após 60s de inatividade
9
+
10
+ WebSocket: /ws/stream - Streaming bidirecional de áudio
11
+ - Cliente envia: áudio (binary)
12
+ - Servidor envia: chunks de áudio (binary) + métricas (JSON)
13
+ """
14
+
15
+ import base64
16
+ import io
17
+ import os
18
+ import re
19
+ import time
20
+ import json
21
+ import asyncio
22
+ import threading
23
+ import requests as http_requests
24
+ from datetime import datetime
25
+ from typing import Optional
26
+ from dataclasses import dataclass
27
+
28
+ from fastapi import FastAPI, File, Form, UploadFile, HTTPException, WebSocket, WebSocketDisconnect
29
+ from fastapi.responses import JSONResponse
30
+ from fastapi.middleware.cors import CORSMiddleware
31
+ from pydantic import BaseModel
32
+ from typing import List, Optional
33
+
34
+ # ============================================================================
35
+ # PYDANTIC MODELS - Compatible with frontend
36
+ # ============================================================================
37
+ class AudioRequest(BaseModel):
38
+ """Request format expected by frontend"""
39
+ audio: str # base64 encoded WAV
40
+ language: str = "pt" # Idioma para STT (forçar transcrição neste idioma)
41
+ voice: str = "pf_dora"
42
+ mode: str = "default"
43
+ conversation_history: List[dict] = []
44
+ student_name: str = "Aluno" # Nome do aluno para personalizar prompts
45
+ # Novo: system_prompt opcional enviado pelo frontend (para adaptar ao nível CEFR)
46
+ system_prompt: Optional[str] = None
47
+ max_tokens: Optional[int] = None # Max tokens para resposta (opcional)
48
+ temperature: Optional[float] = None # Temperature para LLM (opcional)
49
+ speed_rate: Optional[float] = None # Velocidade manual da fala (0.5-1.5, None = automático)
50
+
51
+ class TextRequest(BaseModel):
52
+ """Text request format expected by frontend"""
53
+ text: str
54
+ language: str = "pt"
55
+ voice: str = "pf_dora"
56
+ mode: str = "default"
57
+ stream: bool = False
58
+ student_name: str = "Aluno" # Nome do aluno para personalizar prompts
59
+ # Novo: system_prompt opcional enviado pelo frontend (para adaptar ao nível CEFR)
60
+ system_prompt: Optional[str] = None
61
+ max_tokens: Optional[int] = None # Max tokens para resposta (opcional)
62
+ temperature: Optional[float] = None # Temperature para LLM (opcional)
63
+
64
+ # ============================================================================
65
+ # AUTO-STOP CONFIGURATION
66
+ # ============================================================================
67
+ IDLE_TIMEOUT_SECONDS = int(os.environ.get("IDLE_TIMEOUT_SECONDS", "60"))
68
+ TENSORDOCK_API_TOKEN = os.environ.get("TENSORDOCK_API_TOKEN", "")
69
+ TENSORDOCK_INSTANCE_ID = os.environ.get("TENSORDOCK_INSTANCE_ID", "")
70
+
71
+ # Email alerts configuration
72
+ RESEND_API_KEY = os.environ.get("RESEND_API_KEY", "")
73
+ ALERT_EMAIL = os.environ.get("ALERT_EMAIL", "marcos@marcosrp.com") # Email to receive alerts
74
+
75
+ # Try to get instance ID from hostname if not set
76
+ if not TENSORDOCK_INSTANCE_ID:
77
+ try:
78
+ import socket
79
+ TENSORDOCK_INSTANCE_ID = socket.gethostname()
80
+ except:
81
+ pass
82
+
83
+ # Global state
84
+ last_activity = datetime.now()
85
+ auto_stop_enabled = True
86
+
87
+ def send_alert_email(subject: str, message: str):
88
+ """Send alert email via Resend API"""
89
+ if not RESEND_API_KEY:
90
+ print(f"[ALERT] No RESEND_API_KEY set, cannot send email: {subject}")
91
+ return False
92
+
93
+ try:
94
+ resp = http_requests.post(
95
+ "https://api.resend.com/emails",
96
+ headers={
97
+ "Authorization": f"Bearer {RESEND_API_KEY}",
98
+ "Content-Type": "application/json"
99
+ },
100
+ json={
101
+ "from": "PARLE Backend <alerts@parle.marcosrp.com>",
102
+ "to": [ALERT_EMAIL],
103
+ "subject": f"[PARLE ALERT] {subject}",
104
+ "html": f"""
105
+ <h2>🚨 PARLE Backend Alert</h2>
106
+ <p><strong>Instance:</strong> {TENSORDOCK_INSTANCE_ID or 'unknown'}</p>
107
+ <p><strong>Time:</strong> {datetime.now().isoformat()}</p>
108
+ <hr/>
109
+ <p>{message}</p>
110
+ <hr/>
111
+ <p style="color: #666; font-size: 12px;">
112
+ This is an automated alert from the PARLE TensorDock backend.
113
+ </p>
114
+ """
115
+ },
116
+ timeout=10
117
+ )
118
+ if resp.status_code == 200:
119
+ print(f"[ALERT] Email sent successfully: {subject}")
120
+ return True
121
+ else:
122
+ print(f"[ALERT] Failed to send email: {resp.status_code} {resp.text}")
123
+ return False
124
+ except Exception as e:
125
+ print(f"[ALERT] Error sending email: {e}")
126
+ return False
127
+
128
+ def touch_activity():
129
+ """Register activity (reset idle timer)"""
130
+ global last_activity
131
+ last_activity = datetime.now()
132
+
133
+ def stop_instance():
134
+ """Stop this TensorDock instance via API"""
135
+ if not TENSORDOCK_API_TOKEN or not TENSORDOCK_INSTANCE_ID:
136
+ error_msg = "Missing API token or instance ID, cannot stop"
137
+ print(f"[AUTO-STOP] {error_msg}")
138
+ send_alert_email(
139
+ "Auto-Stop FAILED - Missing Credentials",
140
+ f"""
141
+ <p><strong>Error:</strong> {error_msg}</p>
142
+ <p><strong>TENSORDOCK_API_TOKEN:</strong> {'SET' if TENSORDOCK_API_TOKEN else 'NOT SET'}</p>
143
+ <p><strong>TENSORDOCK_INSTANCE_ID:</strong> {TENSORDOCK_INSTANCE_ID or 'NOT SET'}</p>
144
+ <p style="color: red;"><strong>⚠️ The instance is still running and costing money!</strong></p>
145
+ <p>Please SSH into the VM and set the environment variables, or stop the instance manually.</p>
146
+ """
147
+ )
148
+ return False
149
+
150
+ try:
151
+ print(f"[AUTO-STOP] Stopping instance {TENSORDOCK_INSTANCE_ID}...")
152
+ resp = http_requests.post(
153
+ f"https://dashboard.tensordock.com/api/v2/instances/{TENSORDOCK_INSTANCE_ID}/stop",
154
+ headers={"Authorization": f"Bearer {TENSORDOCK_API_TOKEN}"},
155
+ timeout=30
156
+ )
157
+ if resp.status_code == 200:
158
+ print("[AUTO-STOP] Instance stopped successfully!")
159
+ return True
160
+ else:
161
+ error_msg = f"API returned {resp.status_code}: {resp.text}"
162
+ print(f"[AUTO-STOP] Failed to stop: {error_msg}")
163
+ send_alert_email(
164
+ "Auto-Stop FAILED - API Error",
165
+ f"""
166
+ <p><strong>Error:</strong> {error_msg}</p>
167
+ <p style="color: red;"><strong>⚠️ The instance is still running and costing money!</strong></p>
168
+ <p>Please stop the instance manually via TensorDock dashboard.</p>
169
+ """
170
+ )
171
+ return False
172
+ except Exception as e:
173
+ error_msg = str(e)
174
+ print(f"[AUTO-STOP] Error stopping instance: {error_msg}")
175
+ send_alert_email(
176
+ "Auto-Stop FAILED - Exception",
177
+ f"""
178
+ <p><strong>Exception:</strong> {error_msg}</p>
179
+ <p style="color: red;"><strong>⚠️ The instance is still running and costing money!</strong></p>
180
+ <p>Please stop the instance manually via TensorDock dashboard.</p>
181
+ """
182
+ )
183
+ return False
184
+
185
+ def idle_monitor():
186
+ """Background thread that monitors idle time and stops instance"""
187
+ global last_activity, auto_stop_enabled
188
+
189
+ print(f"[AUTO-STOP] Monitor started. Timeout: {IDLE_TIMEOUT_SECONDS}s")
190
+
191
+ while auto_stop_enabled:
192
+ time.sleep(10) # Check every 10 seconds
193
+
194
+ elapsed = (datetime.now() - last_activity).total_seconds()
195
+ remaining = max(0, IDLE_TIMEOUT_SECONDS - elapsed)
196
+
197
+ if elapsed >= IDLE_TIMEOUT_SECONDS:
198
+ print(f"[AUTO-STOP] Idle for {elapsed:.0f}s, stopping instance...")
199
+ success = stop_instance()
200
+ if not success:
201
+ # Alert already sent by stop_instance, but log the failure
202
+ print("[AUTO-STOP] CRITICAL: Failed to stop instance! Will keep trying every 60s...")
203
+ # Keep trying every 60 seconds instead of giving up
204
+ while auto_stop_enabled:
205
+ time.sleep(60)
206
+ if stop_instance():
207
+ break
208
+ break
209
+ elif remaining <= 30:
210
+ print(f"[AUTO-STOP] Warning: stopping in {remaining:.0f}s if no activity")
211
+
212
+ # Start idle monitor thread
213
+ monitor_thread = threading.Thread(target=idle_monitor, daemon=True)
214
+
215
+ # ============================================================================
216
+ # TEXT CHUNKER - Divide texto em chunks para TTS streaming
217
+ # ============================================================================
218
+ @dataclass
219
+ class ChunkConfig:
220
+ """Configuração do chunker"""
221
+ min_words: int = 3
222
+ max_words: int = 15
223
+ filler_words: list = None
224
+
225
+ def __post_init__(self):
226
+ if self.filler_words is None:
227
+ self.filler_words = ["hmm,", "bem,", "então,", "bom,", "olha,"]
228
+
229
+
230
+ class TextChunker:
231
+ """
232
+ Divide streaming de texto em chunks para TTS.
233
+
234
+ Prioridades de quebra:
235
+ 5: Fim de frase (. ! ?)
236
+ 4: Quebras semânticas fortes (porém, entretanto, ; :)
237
+ 3: Quebras médias (enquanto, embora, ,)
238
+ 2: Conectivos (e, mas, porque)
239
+ 1: Fallback por contagem de palavras
240
+ """
241
+
242
+ def __init__(self, config: ChunkConfig = None):
243
+ self.config = config or ChunkConfig()
244
+ self.buffer = ""
245
+ self.word_count = 0
246
+
247
+ # Padrões de quebra com prioridades
248
+ self.break_patterns = {
249
+ 5: [r'[.!?](?:\s|$)'], # Fim de frase
250
+ 4: [r'[;:](?:\s|$)', r'\b(porém|entretanto|contudo|todavia|portanto)\b'],
251
+ 3: [r',(?:\s|$)', r'\b(enquanto|embora|desde)\b'],
252
+ 2: [r'\b(e|mas|porque|então|ou)\b'],
253
+ }
254
+
255
+ def add_token(self, token: str) -> Optional[str]:
256
+ """
257
+ Adiciona token ao buffer e retorna chunk se pronto.
258
+
259
+ Returns:
260
+ Chunk de texto pronto para TTS, ou None se ainda acumulando.
261
+ """
262
+ self.buffer += token
263
+ self.word_count = len(self.buffer.split())
264
+
265
+ # Verificar quebras por prioridade
266
+ for priority in [5, 4, 3, 2]:
267
+ for pattern in self.break_patterns.get(priority, []):
268
+ match = re.search(pattern, self.buffer, re.IGNORECASE)
269
+ if match and self.word_count >= self.config.min_words:
270
+ # Encontrou ponto de quebra
271
+ split_pos = match.end()
272
+ chunk = self.buffer[:split_pos].strip()
273
+ self.buffer = self.buffer[split_pos:].strip()
274
+ self.word_count = len(self.buffer.split())
275
+ return chunk
276
+
277
+ # Fallback: quebrar por contagem de palavras
278
+ if self.word_count >= self.config.max_words:
279
+ words = self.buffer.split()
280
+ chunk = " ".join(words[:self.config.max_words])
281
+ self.buffer = " ".join(words[self.config.max_words:])
282
+ self.word_count = len(self.buffer.split())
283
+ return chunk
284
+
285
+ return None
286
+
287
+ def flush(self) -> Optional[str]:
288
+ """Retorna qualquer texto restante no buffer."""
289
+ if self.buffer.strip():
290
+ chunk = self.buffer.strip()
291
+ self.buffer = ""
292
+ self.word_count = 0
293
+ return chunk
294
+ return None
295
+
296
+
297
+ # ============================================================================
298
+ # MODELS
299
+ # ============================================================================
300
+ whisper_model = None
301
+ whisper_processor = None
302
+ vllm_engine = None
303
+ kokoro_pipeline = None
304
+ conversation_history = []
305
+
306
+ # CEFR Classifier
307
+ cefr_model = None
308
+ cefr_tokenizer = None
309
+ CEFR_MODEL = "marcosremar2/cefr-classifier-pt-mdeberta-v3-enem"
310
+ CEFR_LEVELS = ["A1", "A2", "B1", "B2", "C1"]
311
+
312
+ # CEFR tracking per session
313
+ user_message_buffer = [] # Buffer das mensagens do usuário
314
+ user_message_count = 0 # Contador de mensagens
315
+ current_cefr_level = "B1" # Nível padrão inicial
316
+ CEFR_CLASSIFY_EVERY = 2 # Classificar a cada N mensagens (reduzido para adaptação mais rápida)
317
+ CEFR_MIN_CHARS = 50 # Mínimo de caracteres para classificar (reduzido para A1)
318
+ CEFR_FIRST_MESSAGE_CLASSIFY = True # Classificar já na primeira mensagem se tiver chars suficientes
319
+
320
+ # Adaptive Speed System - Velocidade baseada em CEFR + espelhamento do aluno
321
+ CEFR_SPEED_MAP = {
322
+ "A1": 0.70, # Muito lento para iniciantes
323
+ "A2": 0.85, # Lento, bem articulado
324
+ "B1": 1.00, # Normal
325
+ "B2": 1.10, # Um pouco mais rápido
326
+ "C1": 1.25, # Fluente, rápido
327
+ }
328
+
329
+ CEFR_EXPECTED_WPM = {
330
+ "A1": 90, # Iniciantes falam ~90 palavras/min
331
+ "A2": 115,
332
+ "B1": 145,
333
+ "B2": 175,
334
+ "C1": 200, # Avançados falam ~200 palavras/min
335
+ }
336
+
337
+ # Estado da velocidade adaptativa
338
+ last_student_wpm = 0.0
339
+ suggested_avatar_speed = 1.0
340
+
341
+ LLM_MODEL = "RedHatAI/gemma-3-4b-it-quantized.w4a16"
342
+ WHISPER_MODEL = "openai/whisper-small"
343
+
344
+ # System prompts adaptados por nível CEFR
345
+ # {student_name} será substituído pelo nome do aluno
346
+ CEFR_SYSTEM_PROMPTS = {
347
+ "A1": """Você é Emma, professora de português para iniciantes.
348
+ O aluno se chama {student_name} e está no nível A1 (iniciante absoluto).
349
+
350
+ REGRAS OBRIGATÓRIAS:
351
+ - RESPONDA SEMPRE E SOMENTE EM PORTUGUÊS. NUNCA use inglês, francês ou outras línguas.
352
+ - Use 1-2 frases MUITO curtas (5-10 palavras cada).
353
+ - Vocabulário MUITO básico: saudações, números, cores, família, comida.
354
+ - Frases simples: sujeito + verbo + objeto. Ex: "Eu gosto de pizza."
355
+ - SEMPRE CORRIJA os erros do aluno de forma gentil, mostrando a forma correta.
356
+ Exemplo: Se o aluno disser "Eu gostar pizza", responda: "Muito bem! Em português dizemos 'Eu gosto de pizza'. Você gosta de pizza! 🍕"
357
+ - Se o aluno usar palavras em inglês/francês, ensine a palavra em português.
358
+ Exemplo: Se disser "happy", responda: "Feliz! Você está feliz! 😊"
359
+ - Celebre cada tentativa com entusiasmo.
360
+ - Faça perguntas simples: "Você gosta de...?" "O que é isso?"
361
+ - Use muitos emojis para tornar a conversa visual e amigável.""",
362
+
363
+ "A2": """Você é Emma, professora de português para nível básico.
364
+ O aluno se chama {student_name} e está no nível A2 (elementar).
365
+
366
+ REGRAS OBRIGATÓRIAS:
367
+ - RESPONDA SEMPRE E SOMENTE EM PORTUGUÊS. NUNCA use outras línguas.
368
+ - Use 2-3 frases curtas (10-15 palavras cada).
369
+ - Vocabulário do dia-a-dia: rotina, trabalho, hobbies, viagens.
370
+ - Conectivos básicos: e, mas, porque, quando, depois.
371
+ - Tempos verbais: presente, passado simples, "vou + infinitivo".
372
+ - CORRIJA erros importantes de forma natural e encorajadora.
373
+ Exemplo: "Ótimo! Só uma dica: dizemos 'fui ao cinema' em vez de 'fui no cinema'. Continue assim!"
374
+ - Se o aluno errar preposições ou conjugações, corrija gentilmente.
375
+ - Pergunte sobre rotina, família, hobbies, fins de semana.
376
+ - Seja paciente e encorajadora, mas ensine a forma correta.""",
377
+
378
+ "B1": """Você é Emma, professora de português para nível intermediário.
379
+ O aluno se chama {student_name} e está no nível B1 (intermediário).
380
+
381
+ REGRAS OBRIGATÓRIAS:
382
+ - RESPONDA SEMPRE E SOMENTE EM PORTUGUÊS.
383
+ - Use 2-3 frases de tamanho médio (15-25 palavras cada).
384
+ - Vocabulário variado com expressões comuns do português.
385
+ - Use diferentes tempos verbais naturalmente (presente, passado, futuro, condicional).
386
+ - Introduza o subjuntivo em contextos comuns: "Espero que você goste", "Talvez seja bom".
387
+ - Corrija erros de forma natural, integrada à conversa.
388
+ Exemplo: "Interessante! Eu também acho que seja importante... aliás, nesse caso dizemos 'é importante' no indicativo."
389
+ - Encoraje o aluno a elaborar mais: "Me conta mais sobre isso!"
390
+ - Tópicos: opiniões, experiências, planos, notícias, cultura.
391
+ - Faça perguntas que estimulem respostas mais longas.""",
392
+
393
+ "B2": """Você é Emma, professora de português para nível intermediário-avançado.
394
+ O aluno se chama {student_name} e está no nível B2 (intermediário superior).
395
+
396
+ REGRAS OBRIGATÓRIAS:
397
+ - RESPONDA SEMPRE EM PORTUGUÊS com naturalidade.
398
+ - Use 3-4 frases elaboradas (25-40 palavras cada).
399
+ - Vocabulário rico: expressões idiomáticas, phrasal verbs, colocações.
400
+ - Todas as estruturas gramaticais: subjuntivo, condicional, voz passiva.
401
+ - Discussões mais abstratas: política, sociedade, filosofia, arte.
402
+ - Correções sutis focando em nuances e estilo.
403
+ Exemplo: "Sua ideia está clara! Só um detalhe: em contextos mais formais, seria melhor usar 'embora' em vez de 'apesar que'."
404
+ - Desafie com perguntas argumentativas: "O que você pensa sobre...?" "Como você defende essa posição?"
405
+ - Use expressões coloquiais brasileiras naturalmente.
406
+ - Estimule debates e análises críticas.""",
407
+
408
+ "C1": """Você é Emma, professora de português para nível avançado.
409
+ O aluno se chama {student_name} e está no nível C1 (avançado/proficiente).
410
+
411
+ REGRAS OBRIGATÓRIAS:
412
+ - RESPONDA EM PORTUGUÊS com fluência nativa.
413
+ - Use 4-5 frases elaboradas e sofisticadas (40-60 palavras cada).
414
+ - Linguagem natural de falante nativo culto brasileiro.
415
+ - Vocabulário sofisticado: termos técnicos, acadêmicos, literários.
416
+ - Gírias, regionalismos, humor, ironia quando apropriado.
417
+ - Discussões complexas: filosofia, ciência, política internacional, arte, literatura.
418
+ - Correções apenas para refinamento estilístico ou nuances culturais.
419
+ Exemplo: "Argumento interessante! Talvez a expressão 'no que tange a' soe um pouco formal demais nesse contexto coloquial."
420
+ - Desafie intelectualmente: "Mas você não acha que há uma contradição entre...?"
421
+ - Explore nuances culturais brasileiras vs. portuguesas.
422
+ - Engaje em debates profundos e análises sofisticadas.
423
+ - Trate o aluno como um interlocutor intelectual.""",
424
+ }
425
+
426
+ # Configuração de max_tokens por nível CEFR
427
+ # Níveis mais baixos = respostas mais curtas, níveis altos = mais elaboradas
428
+ CEFR_MAX_TOKENS = {
429
+ "A1": 50, # 1-2 frases muito curtas
430
+ "A2": 70, # 2-3 frases curtas
431
+ "B1": 100, # 2-3 frases médias
432
+ "B2": 130, # 3-4 frases elaboradas
433
+ "C1": 180, # 4-5 frases sofisticadas
434
+ }
435
+
436
+ # Fallback prompts (mantidos para compatibilidade)
437
+ SYSTEM_PROMPTS = {
438
+ "chat": """Você é Emma, professora de idiomas. Ajude o usuário a praticar com conversação natural. Seja encorajadora, corrija erros gentilmente, mantenha respostas MUITO curtas (1-2 frases).""",
439
+ "default": """Você é Emma, uma professora de idiomas amigável e encorajadora. Ajude o usuário a aprender e praticar português. Mantenha respostas curtas e claras.""",
440
+ }
441
+
442
+ # ============================================================================
443
+ # FASTAPI APP
444
+ # ============================================================================
445
+ app = FastAPI(title="DumontTalker - Full Pipeline with WebSocket")
446
+
447
+ app.add_middleware(
448
+ CORSMiddleware,
449
+ allow_origins=["*"],
450
+ allow_credentials=True,
451
+ allow_methods=["*"],
452
+ allow_headers=["*"],
453
+ )
454
+
455
+ @app.on_event("startup")
456
+ async def load_models():
457
+ """Load all models on startup"""
458
+ global whisper_model, whisper_processor, vllm_engine, kokoro_pipeline
459
+ global cefr_model, cefr_tokenizer
460
+ import torch
461
+ import numpy as np
462
+
463
+ print("=" * 60)
464
+ print("Loading DumontTalker Full Pipeline + WebSocket + CEFR")
465
+ print(f"Auto-stop after {IDLE_TIMEOUT_SECONDS}s of inactivity")
466
+ print("=" * 60)
467
+
468
+ # 1. Load vLLM FIRST (needs contiguous memory)
469
+ print(f"[1/4] Loading vLLM: {LLM_MODEL}...")
470
+ from vllm import LLM
471
+
472
+ vllm_engine = LLM(
473
+ model=LLM_MODEL,
474
+ dtype="auto",
475
+ gpu_memory_utilization=0.40, # 40% of 24GB = ~9.6GB for vLLM (increased for longer prompts)
476
+ max_model_len=2048, # Increased to handle C1 level prompts
477
+ trust_remote_code=True,
478
+ )
479
+ print(f"[1/4] vLLM loaded!")
480
+
481
+ # 2. Load Whisper
482
+ print(f"[2/4] Loading Whisper: {WHISPER_MODEL}...")
483
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
484
+
485
+ whisper_processor = AutoProcessor.from_pretrained(WHISPER_MODEL)
486
+ whisper_model = AutoModelForSpeechSeq2Seq.from_pretrained(
487
+ WHISPER_MODEL,
488
+ torch_dtype=torch.float16,
489
+ low_cpu_mem_usage=True,
490
+ ).to("cuda")
491
+ print(f"[2/4] Whisper loaded!")
492
+
493
+ # 3. Load CEFR Classifier (FP16 - ~0.6GB VRAM)
494
+ print(f"[3/4] Loading CEFR Classifier: {CEFR_MODEL}...")
495
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
496
+
497
+ cefr_tokenizer = AutoTokenizer.from_pretrained(CEFR_MODEL)
498
+ cefr_model = AutoModelForSequenceClassification.from_pretrained(
499
+ CEFR_MODEL,
500
+ torch_dtype=torch.float16,
501
+ low_cpu_mem_usage=True,
502
+ ).to("cuda")
503
+ cefr_model.eval() # Set to evaluation mode
504
+ print(f"[3/4] CEFR Classifier loaded! (FP16)")
505
+
506
+ # 4. Load Kokoro TTS
507
+ print(f"[4/4] Loading Kokoro TTS...")
508
+ from kokoro import KPipeline
509
+
510
+ kokoro_pipeline = KPipeline(lang_code='p', device='cuda')
511
+ print(f"[4/4] Kokoro loaded!")
512
+
513
+ # Memory status
514
+ allocated = torch.cuda.memory_allocated(0) / 1024**3
515
+ total = torch.cuda.get_device_properties(0).total_memory / 1024**3
516
+ print("=" * 60)
517
+ print(f"All models loaded! VRAM: {allocated:.1f}GB / {total:.1f}GB")
518
+ print(f"CEFR Classifier: {CEFR_MODEL}")
519
+ print(f"CEFR classification every {CEFR_CLASSIFY_EVERY} messages")
520
+ print("WebSocket endpoint: ws://host:8000/ws/stream")
521
+ print("=" * 60)
522
+
523
+ # Start idle monitor AFTER models are loaded
524
+ touch_activity() # Reset timer
525
+ monitor_thread.start()
526
+ print("[AUTO-STOP] Idle monitor started")
527
+
528
+
529
+ # ============================================================================
530
+ # HELPER FUNCTIONS
531
+ # ============================================================================
532
+
533
+ def classify_cefr(text: str) -> tuple:
534
+ """
535
+ Classifica o nível CEFR de um texto.
536
+
537
+ Returns:
538
+ tuple: (level, confidence, all_probs)
539
+ """
540
+ global cefr_model, cefr_tokenizer
541
+ import torch
542
+
543
+ if cefr_model is None or cefr_tokenizer is None:
544
+ print("[CEFR] Model not loaded, returning default B1")
545
+ return "B1", 0.0, {}
546
+
547
+ start = time.time()
548
+
549
+ # Tokenize
550
+ inputs = cefr_tokenizer(
551
+ text,
552
+ return_tensors="pt",
553
+ truncation=True,
554
+ max_length=512,
555
+ padding=True
556
+ )
557
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
558
+
559
+ # Inference
560
+ with torch.no_grad():
561
+ outputs = cefr_model(**inputs)
562
+ probs = torch.softmax(outputs.logits, dim=-1)
563
+ pred_idx = torch.argmax(probs, dim=-1).item()
564
+ confidence = probs[0][pred_idx].item()
565
+
566
+ level = CEFR_LEVELS[pred_idx]
567
+ all_probs = {CEFR_LEVELS[i]: probs[0][i].item() for i in range(len(CEFR_LEVELS))}
568
+
569
+ elapsed_ms = int((time.time() - start) * 1000)
570
+ print(f"[CEFR] {elapsed_ms}ms | Level: {level} ({confidence:.0%}) | Probs: {all_probs}")
571
+
572
+ return level, confidence, all_probs
573
+
574
+
575
+ def update_cefr_level(user_text: str) -> str:
576
+ """
577
+ Atualiza o nível CEFR baseado nas mensagens do usuário.
578
+ Classifica quando atingir CEFR_CLASSIFY_EVERY mensagens E CEFR_MIN_CHARS caracteres.
579
+
580
+ Se CEFR_FIRST_MESSAGE_CLASSIFY=True, também classifica na primeira mensagem
581
+ se ela tiver caracteres suficientes (importante para adaptação imediata).
582
+
583
+ Returns:
584
+ str: Nível CEFR atual (pode ter sido atualizado ou não)
585
+ """
586
+ global user_message_buffer, user_message_count, current_cefr_level
587
+
588
+ # Adiciona mensagem ao buffer
589
+ user_message_buffer.append(user_text)
590
+ user_message_count += 1
591
+
592
+ # Calcula tamanho total do buffer
593
+ combined_text = " ".join(user_message_buffer)
594
+ total_chars = len(combined_text)
595
+
596
+ print(f"[CEFR] Message {user_message_count}/{CEFR_CLASSIFY_EVERY} buffered | {total_chars}/{CEFR_MIN_CHARS} chars")
597
+
598
+ # Verifica se deve classificar:
599
+ # 1. Na primeira mensagem se CEFR_FIRST_MESSAGE_CLASSIFY=True e tiver chars suficientes
600
+ # 2. A cada CEFR_CLASSIFY_EVERY mensagens com chars suficientes
601
+ should_classify = False
602
+
603
+ if CEFR_FIRST_MESSAGE_CLASSIFY and user_message_count == 1 and total_chars >= CEFR_MIN_CHARS:
604
+ print(f"[CEFR] First message classification triggered ({total_chars} chars)")
605
+ should_classify = True
606
+ elif user_message_count >= CEFR_CLASSIFY_EVERY and total_chars >= CEFR_MIN_CHARS:
607
+ print(f"[CEFR] Periodic classification triggered ({total_chars} chars)")
608
+ should_classify = True
609
+
610
+ if should_classify:
611
+ print(f"[CEFR] Classifying combined text ({total_chars} chars)...")
612
+
613
+ # Classifica
614
+ new_level, confidence, probs = classify_cefr(combined_text)
615
+
616
+ # Atualiza nível se confiança > 50% (reduzido de 60% para melhor adaptação)
617
+ if confidence > 0.5:
618
+ old_level = current_cefr_level
619
+ current_cefr_level = new_level
620
+ if old_level != new_level:
621
+ print(f"[CEFR] Level changed: {old_level} → {new_level} (confidence: {confidence:.0%})")
622
+ else:
623
+ print(f"[CEFR] Level confirmed: {new_level} (confidence: {confidence:.0%})")
624
+ else:
625
+ print(f"[CEFR] Low confidence ({confidence:.0%}), keeping level: {current_cefr_level}")
626
+
627
+ # Reset buffer e contador
628
+ user_message_buffer = []
629
+ user_message_count = 0
630
+
631
+ elif user_message_count >= CEFR_CLASSIFY_EVERY:
632
+ # Atingiu mensagens mas não caracteres - continua acumulando
633
+ print(f"[CEFR] Need more text ({total_chars}/{CEFR_MIN_CHARS} chars), continuing to buffer...")
634
+
635
+ return current_cefr_level
636
+
637
+
638
+ def calculate_speech_metrics(audio_array, sample_rate: int, transcript: str) -> dict:
639
+ """
640
+ Calcula métricas de fala do aluno.
641
+
642
+ Args:
643
+ audio_array: Array de áudio (numpy)
644
+ sample_rate: Taxa de amostragem
645
+ transcript: Texto transcrito
646
+
647
+ Returns:
648
+ dict com métricas: audio_duration_sec, word_count, wpm
649
+ """
650
+ import numpy as np
651
+
652
+ # Duração do áudio em segundos
653
+ audio_duration_sec = len(audio_array) / sample_rate
654
+
655
+ # Contar palavras (simples, baseado em espaços)
656
+ words = transcript.strip().split()
657
+ word_count = len(words)
658
+
659
+ # Calcular WPM (palavras por minuto)
660
+ if audio_duration_sec > 0:
661
+ wpm = (word_count / audio_duration_sec) * 60
662
+ else:
663
+ wpm = 0
664
+
665
+ return {
666
+ "audio_duration_sec": round(audio_duration_sec, 2),
667
+ "word_count": word_count,
668
+ "wpm": round(wpm, 1)
669
+ }
670
+
671
+
672
+ def calculate_suggested_speed(cefr_level: str, student_wpm: float, manual_speed: Optional[float] = None) -> float:
673
+ """
674
+ Calcula a velocidade sugerida para o avatar baseada em:
675
+ 1. Nível CEFR do aluno
676
+ 2. WPM do aluno (espelhamento)
677
+ 3. Preferência manual (tem prioridade)
678
+
679
+ Args:
680
+ cefr_level: Nível CEFR atual do aluno (A1-C1)
681
+ student_wpm: Palavras por minuto do aluno
682
+ manual_speed: Velocidade definida manualmente (None = automático)
683
+
684
+ Returns:
685
+ float: Velocidade sugerida (0.5 a 1.5)
686
+ """
687
+ global last_student_wpm, suggested_avatar_speed
688
+
689
+ # Se velocidade manual foi definida, ela tem prioridade
690
+ if manual_speed is not None:
691
+ return max(0.5, min(1.5, manual_speed))
692
+
693
+ # Velocidade base do nível CEFR
694
+ base_speed = CEFR_SPEED_MAP.get(cefr_level, 1.0)
695
+
696
+ # Espelhamento: ajusta baseado na diferença entre WPM do aluno e esperado
697
+ if student_wpm > 0:
698
+ expected_wpm = CEFR_EXPECTED_WPM.get(cefr_level, 145)
699
+
700
+ # Razão entre WPM real e esperado
701
+ # Se aluno fala mais devagar, ratio < 1, avatar desacelera
702
+ # Se aluno fala mais rápido, ratio > 1, avatar acelera (até o limite)
703
+ ratio = student_wpm / expected_wpm
704
+
705
+ # Limita o fator de espelhamento entre 0.7 e 1.3
706
+ mirror_factor = max(0.7, min(1.3, ratio))
707
+
708
+ # Velocidade sugerida = base * espelhamento
709
+ suggested = base_speed * mirror_factor
710
+
711
+ # Atualiza estado global
712
+ last_student_wpm = student_wpm
713
+ else:
714
+ # Sem dados de WPM, usa apenas a velocidade base do CEFR
715
+ suggested = base_speed
716
+
717
+ # Limita entre 0.5 e 1.5
718
+ suggested_avatar_speed = max(0.5, min(1.5, suggested))
719
+
720
+ print(f"[SPEED] CEFR={cefr_level}, WPM={student_wpm:.0f}, base={base_speed:.2f}, suggested={suggested_avatar_speed:.2f}")
721
+
722
+ return round(suggested_avatar_speed, 2)
723
+
724
+
725
+ def transcribe_audio(audio_data: bytes, language: str = "pt") -> dict:
726
+ """
727
+ Transcreve áudio usando Whisper e calcula métricas de fala.
728
+
729
+ Args:
730
+ audio_data: Dados do áudio em bytes (WAV)
731
+ language: Código do idioma para forçar transcrição (pt, en, fr, es, etc.)
732
+ Default: "pt" (português)
733
+
734
+ Returns:
735
+ dict com: transcript, stt_ms, speech_metrics (audio_duration_sec, word_count, wpm)
736
+ """
737
+ global whisper_model, whisper_processor
738
+ import torch
739
+ import soundfile as sf
740
+ import librosa
741
+
742
+ start = time.time()
743
+
744
+ audio_array, sr = sf.read(io.BytesIO(audio_data))
745
+
746
+ if sr != 16000:
747
+ audio_array = librosa.resample(audio_array, orig_sr=sr, target_sr=16000)
748
+ sr = 16000
749
+
750
+ inputs = whisper_processor(audio_array, sampling_rate=16000, return_tensors="pt")
751
+ inputs = {k: v.to("cuda", dtype=torch.float16) if v.dtype == torch.float32 else v.to("cuda")
752
+ for k, v in inputs.items()}
753
+
754
+ # Forçar idioma na transcrição para evitar confusão
755
+ # Whisper usa tokens especiais para idioma: <|pt|>, <|en|>, etc.
756
+ forced_decoder_ids = whisper_processor.get_decoder_prompt_ids(language=language, task="transcribe")
757
+
758
+ with torch.no_grad():
759
+ output_ids = whisper_model.generate(
760
+ **inputs,
761
+ max_new_tokens=128,
762
+ forced_decoder_ids=forced_decoder_ids
763
+ )
764
+
765
+ transcript = whisper_processor.batch_decode(output_ids, skip_special_tokens=True)[0]
766
+ elapsed_ms = int((time.time() - start) * 1000)
767
+
768
+ if not transcript.strip():
769
+ transcript = "Olá"
770
+
771
+ # Calcular m��tricas de fala (WPM, duração, etc.)
772
+ speech_metrics = calculate_speech_metrics(audio_array, sr, transcript)
773
+
774
+ print(f"[STT] {elapsed_ms}ms | lang={language} | WPM={speech_metrics['wpm']:.0f} | '{transcript}'")
775
+
776
+ return {
777
+ "transcript": transcript,
778
+ "stt_ms": elapsed_ms,
779
+ "speech_metrics": speech_metrics
780
+ }
781
+
782
+
783
+ def generate_response(
784
+ transcript: str,
785
+ mode: str = "chat",
786
+ student_name: str = "Aluno",
787
+ custom_system_prompt: Optional[str] = None,
788
+ custom_max_tokens: Optional[int] = None,
789
+ custom_temperature: Optional[float] = None
790
+ ) -> tuple:
791
+ """
792
+ Gera resposta com vLLM, adaptada ao nível CEFR do aluno.
793
+
794
+ O nível CEFR é atualizado a cada 2 mensagens do usuário.
795
+ O system prompt pode ser:
796
+ 1. Enviado pelo frontend (custom_system_prompt) - PREFERIDO
797
+ 2. Detectado automaticamente pelo backend (fallback)
798
+
799
+ Args:
800
+ transcript: Texto do usuário
801
+ mode: Modo de conversação (chat, default, cefr_adaptive)
802
+ student_name: Nome do aluno para personalizar o prompt
803
+ custom_system_prompt: System prompt customizado enviado pelo frontend (opcional)
804
+ custom_max_tokens: Max tokens customizado enviado pelo frontend (opcional)
805
+ custom_temperature: Temperature customizada enviada pelo frontend (opcional)
806
+ """
807
+ global vllm_engine, conversation_history, current_cefr_level
808
+ from vllm import SamplingParams
809
+ from transformers import AutoTokenizer
810
+
811
+ start = time.time()
812
+
813
+ # 1. Atualiza nível CEFR baseado na mensagem do usuário (para tracking)
814
+ cefr_level = update_cefr_level(transcript)
815
+
816
+ # 2. Seleciona system prompt
817
+ # PRIORIDADE: custom_system_prompt do frontend > prompt interno baseado em CEFR
818
+ if custom_system_prompt:
819
+ # Usa o prompt enviado pelo frontend (já adaptado ao nível CEFR)
820
+ system = custom_system_prompt
821
+ print(f"[LLM] Using CUSTOM system prompt from frontend (length: {len(system)})")
822
+ elif mode in ["chat", "default", "cefr_adaptive"]:
823
+ # Fallback: usa prompt interno baseado no nível detectado
824
+ system_template = CEFR_SYSTEM_PROMPTS.get(cefr_level, CEFR_SYSTEM_PROMPTS["B1"])
825
+ system = system_template.format(student_name=student_name)
826
+ print(f"[LLM] Using INTERNAL CEFR prompt for level: {cefr_level}, student: {student_name}")
827
+ else:
828
+ # Modo específico (fallback)
829
+ system = SYSTEM_PROMPTS.get(mode, SYSTEM_PROMPTS["default"])
830
+
831
+ messages = [{"role": "system", "content": system}]
832
+ messages.extend(conversation_history[-10:])
833
+ messages.append({"role": "user", "content": transcript})
834
+
835
+ tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL)
836
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
837
+
838
+ # Usa parâmetros customizados se fornecidos, senão usa baseado no nível CEFR
839
+ max_tokens = custom_max_tokens if custom_max_tokens else CEFR_MAX_TOKENS.get(cefr_level, 100)
840
+ temperature = custom_temperature if custom_temperature else 0.7
841
+
842
+ params = SamplingParams(temperature=temperature, top_p=0.8, max_tokens=max_tokens)
843
+ outputs = vllm_engine.generate(prompt, params)
844
+ response = outputs[0].outputs[0].text.strip()
845
+
846
+ print(f"[LLM] max_tokens={max_tokens}, temp={temperature}, CEFR={cefr_level}")
847
+
848
+ conversation_history.append({"role": "user", "content": transcript})
849
+ conversation_history.append({"role": "assistant", "content": response})
850
+ if len(conversation_history) > 20:
851
+ conversation_history = conversation_history[-20:]
852
+
853
+ elapsed_ms = int((time.time() - start) * 1000)
854
+ print(f"[LLM] {elapsed_ms}ms | CEFR:{cefr_level} | '{response}'")
855
+
856
+ return response, elapsed_ms
857
+
858
+
859
+ def remove_emojis(text: str) -> str:
860
+ """Remove emojis e caracteres especiais do texto antes do TTS"""
861
+ # Pattern para remover emojis
862
+ emoji_pattern = re.compile("["
863
+ u"\U0001F600-\U0001F64F" # emoticons
864
+ u"\U0001F300-\U0001F5FF" # symbols & pictographs
865
+ u"\U0001F680-\U0001F6FF" # transport & map symbols
866
+ u"\U0001F1E0-\U0001F1FF" # flags
867
+ u"\U00002702-\U000027B0" # dingbats
868
+ u"\U000024C2-\U0001F251" # enclosed characters
869
+ u"\U0001f926-\U0001f937" # gestures
870
+ u"\U00010000-\U0010ffff" # supplementary
871
+ u"\u2640-\u2642" # gender symbols
872
+ u"\u2600-\u2B55" # misc symbols
873
+ u"\u200d" # zero width joiner
874
+ u"\u23cf" # eject symbol
875
+ u"\u23e9" # fast forward
876
+ u"\u231a" # watch
877
+ u"\ufe0f" # variation selector
878
+ u"\u3030" # wavy dash
879
+ "]+", flags=re.UNICODE)
880
+ return emoji_pattern.sub('', text).strip()
881
+
882
+
883
+ def synthesize_audio(text: str, voice: str = "af_bella") -> tuple:
884
+ """Sintetiza áudio com Kokoro TTS"""
885
+ global kokoro_pipeline
886
+ import numpy as np
887
+ import soundfile as sf
888
+
889
+ start = time.time()
890
+
891
+ # Remove emojis antes do TTS
892
+ clean_text = remove_emojis(text)
893
+ print(f"[TTS] Original: '{text}' -> Clean: '{clean_text}'")
894
+
895
+ audio_chunks = []
896
+ for gs, ps, audio_chunk in kokoro_pipeline(clean_text, voice=voice):
897
+ if audio_chunk is not None and len(audio_chunk) > 0:
898
+ audio_chunks.append(audio_chunk)
899
+
900
+ if not audio_chunks:
901
+ raise Exception("TTS failed to generate audio")
902
+
903
+ audio_output = np.concatenate(audio_chunks)
904
+
905
+ buffer = io.BytesIO()
906
+ sf.write(buffer, audio_output, 24000, format='WAV')
907
+ audio_bytes = buffer.getvalue()
908
+
909
+ elapsed_ms = int((time.time() - start) * 1000)
910
+ print(f"[TTS] {elapsed_ms}ms | {len(audio_bytes)} bytes")
911
+
912
+ return audio_bytes, elapsed_ms
913
+
914
+
915
+ # ============================================================================
916
+ # HTTP ENDPOINTS
917
+ # ============================================================================
918
+
919
+ @app.get("/health")
920
+ def health():
921
+ import torch
922
+ global last_activity, current_cefr_level, user_message_count
923
+
924
+ elapsed = (datetime.now() - last_activity).total_seconds()
925
+ remaining = max(0, IDLE_TIMEOUT_SECONDS - elapsed)
926
+
927
+ allocated = torch.cuda.memory_allocated(0) / 1024**3 if torch.cuda.is_available() else 0
928
+ return {
929
+ "status": "healthy",
930
+ # Frontend compatibility fields
931
+ "whisper_loaded": whisper_model is not None,
932
+ "vllm_loaded": vllm_engine is not None,
933
+ "kokoro_loaded": kokoro_pipeline is not None,
934
+ "cefr_loaded": cefr_model is not None,
935
+ # Additional info
936
+ "models": {
937
+ "stt": WHISPER_MODEL,
938
+ "llm": LLM_MODEL,
939
+ "tts": "kokoro",
940
+ "cefr": CEFR_MODEL,
941
+ },
942
+ "cefr": {
943
+ "current_level": current_cefr_level,
944
+ "messages_until_classify": max(0, CEFR_CLASSIFY_EVERY - user_message_count),
945
+ "classify_every": CEFR_CLASSIFY_EVERY,
946
+ "min_chars": CEFR_MIN_CHARS,
947
+ "current_chars": len(" ".join(user_message_buffer)),
948
+ },
949
+ "vram_gb": f"{allocated:.1f}",
950
+ "websocket": "/ws/stream",
951
+ "auto_stop": {
952
+ "enabled": auto_stop_enabled,
953
+ "timeout_seconds": IDLE_TIMEOUT_SECONDS,
954
+ "idle_seconds": int(elapsed),
955
+ "stop_in_seconds": int(remaining),
956
+ }
957
+ }
958
+
959
+
960
+ class CEFRClassifyRequest(BaseModel):
961
+ """Request para classificação CEFR manual"""
962
+ text: str
963
+
964
+
965
+ @app.post("/api/cefr/classify")
966
+ async def api_cefr_classify(request: CEFRClassifyRequest):
967
+ """
968
+ Classifica manualmente o nível CEFR de um texto.
969
+ Não afeta o nível atual da sessão.
970
+ """
971
+ touch_activity()
972
+
973
+ level, confidence, probs = classify_cefr(request.text)
974
+
975
+ return {
976
+ "level": level,
977
+ "confidence": confidence,
978
+ "probabilities": probs,
979
+ "text_length": len(request.text),
980
+ }
981
+
982
+
983
+ @app.get("/api/cefr/status")
984
+ async def api_cefr_status():
985
+ """Retorna o status atual do CEFR"""
986
+ global current_cefr_level, user_message_count, user_message_buffer
987
+
988
+ current_chars = len(" ".join(user_message_buffer))
989
+ return {
990
+ "current_level": current_cefr_level,
991
+ "message_count": user_message_count,
992
+ "messages_until_classify": max(0, CEFR_CLASSIFY_EVERY - user_message_count),
993
+ "buffer_size": len(user_message_buffer),
994
+ "current_chars": current_chars,
995
+ "min_chars": CEFR_MIN_CHARS,
996
+ "chars_until_classify": max(0, CEFR_MIN_CHARS - current_chars),
997
+ "ready_to_classify": user_message_count >= CEFR_CLASSIFY_EVERY and current_chars >= CEFR_MIN_CHARS,
998
+ }
999
+
1000
+
1001
+ @app.post("/api/cefr/reset")
1002
+ async def api_cefr_reset():
1003
+ """Reseta o nível CEFR para o padrão (B1)"""
1004
+ global current_cefr_level, user_message_count, user_message_buffer
1005
+
1006
+ old_level = current_cefr_level
1007
+ current_cefr_level = "B1"
1008
+ user_message_count = 0
1009
+ user_message_buffer = []
1010
+
1011
+ return {
1012
+ "status": "reset",
1013
+ "old_level": old_level,
1014
+ "new_level": current_cefr_level,
1015
+ }
1016
+
1017
+
1018
+ @app.post("/api/cefr/set")
1019
+ async def api_cefr_set(level: str = Form(...)):
1020
+ """Define manualmente o nível CEFR"""
1021
+ global current_cefr_level
1022
+
1023
+ if level not in CEFR_LEVELS:
1024
+ raise HTTPException(status_code=400, detail=f"Invalid level. Must be one of: {CEFR_LEVELS}")
1025
+
1026
+ old_level = current_cefr_level
1027
+ current_cefr_level = level
1028
+
1029
+ return {
1030
+ "status": "set",
1031
+ "old_level": old_level,
1032
+ "new_level": current_cefr_level,
1033
+ }
1034
+
1035
+
1036
+ @app.post("/chat")
1037
+ async def chat(
1038
+ message: str = Form(...),
1039
+ mode: str = Form("chat"),
1040
+ ):
1041
+ """Text-only chat"""
1042
+ touch_activity()
1043
+
1044
+ response, llm_ms = generate_response(message, mode)
1045
+
1046
+ return {
1047
+ "response": response,
1048
+ "provider": "tensordock",
1049
+ "model": LLM_MODEL,
1050
+ "inference_ms": llm_ms,
1051
+ }
1052
+
1053
+
1054
+ @app.post("/process-audio")
1055
+ async def process_audio(
1056
+ audio: UploadFile = File(...),
1057
+ mode: str = Form("chat"),
1058
+ ):
1059
+ """Full pipeline: Audio -> STT -> LLM -> TTS -> Audio"""
1060
+ touch_activity()
1061
+
1062
+ overall_start = time.time()
1063
+
1064
+ try:
1065
+ audio_data = await audio.read()
1066
+
1067
+ # 1. STT (retorna dict com métricas)
1068
+ stt_result = transcribe_audio(audio_data)
1069
+ transcript = stt_result["transcript"]
1070
+ stt_ms = stt_result["stt_ms"]
1071
+
1072
+ # 2. LLM
1073
+ response, llm_ms = generate_response(transcript, mode)
1074
+
1075
+ # 3. TTS
1076
+ audio_bytes, tts_ms = synthesize_audio(response)
1077
+
1078
+ total_ms = int((time.time() - overall_start) * 1000)
1079
+
1080
+ return JSONResponse({
1081
+ "transcript": transcript,
1082
+ "response": response,
1083
+ "audio": base64.b64encode(audio_bytes).decode('utf-8'),
1084
+ "timing": {
1085
+ "stt_ms": stt_ms,
1086
+ "llm_ms": llm_ms,
1087
+ "tts_ms": tts_ms,
1088
+ "total_ms": total_ms,
1089
+ },
1090
+ "model": LLM_MODEL,
1091
+ "speech_metrics": stt_result["speech_metrics"],
1092
+ })
1093
+
1094
+ except Exception as e:
1095
+ print(f"[ERROR] {e}")
1096
+ import traceback
1097
+ traceback.print_exc()
1098
+ raise HTTPException(status_code=500, detail=str(e))
1099
+
1100
+
1101
+ @app.post("/keep-alive")
1102
+ def keep_alive():
1103
+ """Reset idle timer without doing inference"""
1104
+ touch_activity()
1105
+ return {"status": "ok", "message": "Timer reset"}
1106
+
1107
+
1108
+ @app.post("/clear")
1109
+ def clear_history():
1110
+ """Clear conversation history and reset CEFR"""
1111
+ global conversation_history, current_cefr_level, user_message_count, user_message_buffer
1112
+
1113
+ conversation_history = []
1114
+ current_cefr_level = "B1" # Reset to default
1115
+ user_message_count = 0
1116
+ user_message_buffer = []
1117
+
1118
+ touch_activity()
1119
+ return {"status": "cleared", "cefr_reset": True, "cefr_level": current_cefr_level}
1120
+
1121
+
1122
+ # ============================================================================
1123
+ # FRONTEND-COMPATIBLE API ENDPOINTS
1124
+ # ============================================================================
1125
+
1126
+ @app.post("/api/audio")
1127
+ async def api_audio(request: AudioRequest):
1128
+ """
1129
+ Frontend-compatible audio endpoint.
1130
+ Accepts JSON with base64 audio, returns response in expected format.
1131
+ Includes speech metrics and suggested speed for adaptive avatar.
1132
+ """
1133
+ touch_activity()
1134
+ overall_start = time.time()
1135
+
1136
+ try:
1137
+ # Decode base64 audio
1138
+ audio_data = base64.b64decode(request.audio)
1139
+ print(f"[API] Received audio: {len(audio_data)} bytes, mode: {request.mode}, lang: {request.language}, student: {request.student_name}")
1140
+
1141
+ # 1. STT - Forçar idioma para evitar confusão (retorna dict com métricas)
1142
+ stt_result = transcribe_audio(audio_data, language=request.language)
1143
+ transcript = stt_result["transcript"]
1144
+ stt_ms = stt_result["stt_ms"]
1145
+ speech_metrics = stt_result["speech_metrics"]
1146
+
1147
+ # 2. Calcular velocidade sugerida baseada em CEFR + WPM do aluno
1148
+ student_wpm = speech_metrics.get("wpm", 0)
1149
+ suggested_speed = calculate_suggested_speed(
1150
+ current_cefr_level,
1151
+ student_wpm,
1152
+ manual_speed=request.speed_rate # None se não definido manualmente
1153
+ )
1154
+
1155
+ # 3. LLM - Passar parâmetros customizados se fornecidos pelo frontend
1156
+ response_text, llm_ms = generate_response(
1157
+ transcript,
1158
+ request.mode,
1159
+ student_name=request.student_name,
1160
+ custom_system_prompt=request.system_prompt,
1161
+ custom_max_tokens=request.max_tokens,
1162
+ custom_temperature=request.temperature
1163
+ )
1164
+
1165
+ # 4. TTS
1166
+ audio_bytes, tts_ms = synthesize_audio(response_text)
1167
+ audio_duration = len(audio_bytes) / (24000 * 2) # Approximate duration
1168
+
1169
+ total_ms = int((time.time() - overall_start) * 1000)
1170
+
1171
+ # Return in frontend-expected format with speech metrics and suggested speed
1172
+ return JSONResponse({
1173
+ "transcription": {
1174
+ "text": transcript,
1175
+ "language": request.language,
1176
+ "confidence": 1.0,
1177
+ },
1178
+ "response": {
1179
+ "text": response_text,
1180
+ "emotion": "neutral",
1181
+ "language": request.language,
1182
+ },
1183
+ "speech": {
1184
+ "audio": base64.b64encode(audio_bytes).decode('utf-8'),
1185
+ "visemes": [], # Visemes not implemented yet
1186
+ "duration": audio_duration,
1187
+ "sample_rate": 24000,
1188
+ "format": "wav",
1189
+ },
1190
+ "timing": {
1191
+ "stt_ms": stt_ms,
1192
+ "llm_ms": llm_ms,
1193
+ "tts_ms": tts_ms,
1194
+ "total_ms": total_ms,
1195
+ },
1196
+ "cefr": {
1197
+ "current_level": current_cefr_level,
1198
+ "messages_until_classify": CEFR_CLASSIFY_EVERY - user_message_count,
1199
+ },
1200
+ # Novas métricas para velocidade adaptativa
1201
+ "speech_metrics": speech_metrics,
1202
+ "adaptive_speed": {
1203
+ "suggested_speed": suggested_speed,
1204
+ "student_wpm": student_wpm,
1205
+ "speed_mode": "manual" if request.speed_rate else "auto",
1206
+ },
1207
+ })
1208
+
1209
+ except Exception as e:
1210
+ print(f"[API ERROR] {e}")
1211
+ import traceback
1212
+ traceback.print_exc()
1213
+ raise HTTPException(status_code=500, detail=str(e))
1214
+
1215
+
1216
+ @app.post("/api/text")
1217
+ async def api_text(request: TextRequest):
1218
+ """
1219
+ Frontend-compatible text endpoint.
1220
+ Accepts JSON with text, returns LLM response with TTS audio.
1221
+ """
1222
+ touch_activity()
1223
+ overall_start = time.time()
1224
+
1225
+ try:
1226
+ print(f"[API] Received text: '{request.text[:50]}...', mode: {request.mode}, student: {request.student_name}")
1227
+
1228
+ # 1. LLM - Passar nome do aluno e parâmetros customizados do frontend
1229
+ response_text, llm_ms = generate_response(
1230
+ request.text,
1231
+ request.mode,
1232
+ student_name=request.student_name,
1233
+ custom_system_prompt=request.system_prompt,
1234
+ custom_max_tokens=request.max_tokens,
1235
+ custom_temperature=request.temperature
1236
+ )
1237
+
1238
+ # 2. TTS
1239
+ audio_bytes, tts_ms = synthesize_audio(response_text)
1240
+ audio_duration = len(audio_bytes) / (24000 * 2) # Approximate duration
1241
+
1242
+ total_ms = int((time.time() - overall_start) * 1000)
1243
+
1244
+ # Return in frontend-expected format
1245
+ return JSONResponse({
1246
+ "response": {
1247
+ "text": response_text,
1248
+ "emotion": "neutral",
1249
+ "language": request.language,
1250
+ },
1251
+ "speech": {
1252
+ "audio": base64.b64encode(audio_bytes).decode('utf-8'),
1253
+ "visemes": [], # Visemes not implemented yet
1254
+ "duration": audio_duration,
1255
+ "sample_rate": 24000,
1256
+ "format": "wav",
1257
+ },
1258
+ "timing": {
1259
+ "llm_ms": llm_ms,
1260
+ "tts_ms": tts_ms,
1261
+ "total_ms": total_ms,
1262
+ },
1263
+ "cefr": {
1264
+ "current_level": current_cefr_level,
1265
+ "messages_until_classify": CEFR_CLASSIFY_EVERY - user_message_count,
1266
+ },
1267
+ })
1268
+
1269
+ except Exception as e:
1270
+ print(f"[API ERROR] {e}")
1271
+ import traceback
1272
+ traceback.print_exc()
1273
+ raise HTTPException(status_code=500, detail=str(e))
1274
+
1275
+
1276
+ @app.post("/api/reset")
1277
+ async def api_reset():
1278
+ """Reset conversation history and CEFR - frontend compatible"""
1279
+ global conversation_history, current_cefr_level, user_message_count, user_message_buffer
1280
+
1281
+ conversation_history = []
1282
+ current_cefr_level = "B1"
1283
+ user_message_count = 0
1284
+ user_message_buffer = []
1285
+
1286
+ touch_activity()
1287
+ return {"status": "ok", "cefr_level": current_cefr_level}
1288
+
1289
+
1290
+ # ============================================================================
1291
+ # WEBSOCKET ENDPOINT - Streaming Audio
1292
+ # ============================================================================
1293
+
1294
+ @app.websocket("/ws/stream")
1295
+ async def websocket_stream(websocket: WebSocket):
1296
+ """
1297
+ WebSocket para streaming de áudio bidirecional.
1298
+
1299
+ Protocolo:
1300
+ 1. Cliente envia áudio (binary) ou JSON com config
1301
+ 2. Servidor envia chunks de áudio de resposta (binary)
1302
+ 3. Servidor envia métricas no final (JSON)
1303
+
1304
+ Exemplo JavaScript:
1305
+ ```javascript
1306
+ const ws = new WebSocket('ws://host:8000/ws/stream');
1307
+
1308
+ ws.onmessage = (event) => {
1309
+ if (event.data instanceof Blob) {
1310
+ // Chunk de áudio WAV - tocar imediatamente
1311
+ playAudioChunk(event.data);
1312
+ } else {
1313
+ // JSON com métricas ou status
1314
+ const data = JSON.parse(event.data);
1315
+ console.log('Metrics:', data);
1316
+ }
1317
+ };
1318
+
1319
+ // Enviar áudio gravado
1320
+ ws.send(audioBlob);
1321
+ ```
1322
+ """
1323
+ await websocket.accept()
1324
+ print("[WS] Client connected")
1325
+
1326
+ try:
1327
+ while True:
1328
+ touch_activity()
1329
+
1330
+ # Receber dados do cliente
1331
+ data = await websocket.receive()
1332
+
1333
+ if "bytes" in data:
1334
+ # Áudio binary
1335
+ audio_data = data["bytes"]
1336
+ overall_start = time.time()
1337
+
1338
+ # Enviar status de processamento
1339
+ await websocket.send_json({"status": "processing", "stage": "stt"})
1340
+
1341
+ # 1. STT
1342
+ transcript, stt_ms = transcribe_audio(audio_data)
1343
+ await websocket.send_json({
1344
+ "status": "processing",
1345
+ "stage": "llm",
1346
+ "transcript": transcript,
1347
+ "stt_ms": stt_ms
1348
+ })
1349
+
1350
+ # 2. LLM
1351
+ response, llm_ms = generate_response(transcript)
1352
+ await websocket.send_json({
1353
+ "status": "processing",
1354
+ "stage": "tts",
1355
+ "response": response,
1356
+ "llm_ms": llm_ms
1357
+ })
1358
+
1359
+ # 3. TTS - enviar áudio
1360
+ tts_start = time.time()
1361
+
1362
+ import numpy as np
1363
+ import soundfile as sf
1364
+
1365
+ audio_chunks = []
1366
+ chunk_count = 0
1367
+
1368
+ # Remove emojis antes do TTS
1369
+ clean_response = remove_emojis(response)
1370
+ for gs, ps, audio_chunk in kokoro_pipeline(clean_response, voice='af_bella'):
1371
+ if audio_chunk is not None and len(audio_chunk) > 0:
1372
+ audio_chunks.append(audio_chunk)
1373
+ chunk_count += 1
1374
+
1375
+ # Enviar cada chunk como WAV
1376
+ buffer = io.BytesIO()
1377
+ sf.write(buffer, audio_chunk, 24000, format='WAV')
1378
+ await websocket.send_bytes(buffer.getvalue())
1379
+
1380
+ tts_ms = int((time.time() - tts_start) * 1000)
1381
+ total_ms = int((time.time() - overall_start) * 1000)
1382
+
1383
+ # Enviar métricas finais
1384
+ await websocket.send_json({
1385
+ "status": "complete",
1386
+ "transcript": transcript,
1387
+ "response": response,
1388
+ "timing": {
1389
+ "stt_ms": stt_ms,
1390
+ "llm_ms": llm_ms,
1391
+ "tts_ms": tts_ms,
1392
+ "total_ms": total_ms,
1393
+ },
1394
+ "chunks_sent": chunk_count,
1395
+ "model": LLM_MODEL,
1396
+ })
1397
+
1398
+ print(f"[WS] Complete: STT={stt_ms}ms, LLM={llm_ms}ms, TTS={tts_ms}ms, Total={total_ms}ms")
1399
+
1400
+ elif "text" in data:
1401
+ # JSON text (config ou texto para TTS)
1402
+ try:
1403
+ msg = json.loads(data["text"])
1404
+
1405
+ if msg.get("type") == "ping":
1406
+ await websocket.send_json({"type": "pong"})
1407
+
1408
+ elif msg.get("type") == "text":
1409
+ # Chat de texto com TTS
1410
+ text = msg.get("message", "")
1411
+ mode = msg.get("mode", "chat")
1412
+
1413
+ overall_start = time.time()
1414
+
1415
+ # LLM
1416
+ response, llm_ms = generate_response(text, mode)
1417
+
1418
+ # TTS streaming
1419
+ tts_start = time.time()
1420
+ chunk_count = 0
1421
+
1422
+ import numpy as np
1423
+ import soundfile as sf
1424
+
1425
+ # Remove emojis antes do TTS
1426
+ clean_response = remove_emojis(response)
1427
+ for gs, ps, audio_chunk in kokoro_pipeline(clean_response, voice='af_bella'):
1428
+ if audio_chunk is not None and len(audio_chunk) > 0:
1429
+ chunk_count += 1
1430
+ buffer = io.BytesIO()
1431
+ sf.write(buffer, audio_chunk, 24000, format='WAV')
1432
+ await websocket.send_bytes(buffer.getvalue())
1433
+
1434
+ tts_ms = int((time.time() - tts_start) * 1000)
1435
+ total_ms = int((time.time() - overall_start) * 1000)
1436
+
1437
+ await websocket.send_json({
1438
+ "status": "complete",
1439
+ "response": response,
1440
+ "timing": {
1441
+ "llm_ms": llm_ms,
1442
+ "tts_ms": tts_ms,
1443
+ "total_ms": total_ms,
1444
+ },
1445
+ "chunks_sent": chunk_count,
1446
+ })
1447
+
1448
+ elif msg.get("type") == "tts":
1449
+ # TTS apenas (sem LLM)
1450
+ text = msg.get("text", "")
1451
+
1452
+ tts_start = time.time()
1453
+ chunk_count = 0
1454
+
1455
+ import numpy as np
1456
+ import soundfile as sf
1457
+
1458
+ # Remove emojis antes do TTS
1459
+ clean_text = remove_emojis(text)
1460
+ for gs, ps, audio_chunk in kokoro_pipeline(clean_text, voice='af_bella'):
1461
+ if audio_chunk is not None and len(audio_chunk) > 0:
1462
+ chunk_count += 1
1463
+ buffer = io.BytesIO()
1464
+ sf.write(buffer, audio_chunk, 24000, format='WAV')
1465
+ await websocket.send_bytes(buffer.getvalue())
1466
+
1467
+ tts_ms = int((time.time() - tts_start) * 1000)
1468
+
1469
+ await websocket.send_json({
1470
+ "status": "complete",
1471
+ "timing": {"tts_ms": tts_ms},
1472
+ "chunks_sent": chunk_count,
1473
+ })
1474
+
1475
+ except json.JSONDecodeError:
1476
+ await websocket.send_json({"error": "Invalid JSON"})
1477
+
1478
+ except WebSocketDisconnect:
1479
+ print("[WS] Client disconnected")
1480
+ except Exception as e:
1481
+ print(f"[WS] Error: {e}")
1482
+ import traceback
1483
+ traceback.print_exc()
1484
+ try:
1485
+ await websocket.send_json({"error": str(e)})
1486
+ except:
1487
+ pass
1488
+ finally:
1489
+ try:
1490
+ await websocket.close()
1491
+ except:
1492
+ pass
1493
+
1494
+
1495
+ if __name__ == "__main__":
1496
+ import uvicorn
1497
+ uvicorn.run(app, host="0.0.0.0", port=8000)
cefr/.gitkeep ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # CEFR Classifier Model
2
+ # HuggingFace: marcosremar2/cefr-classifier-pt-mdeberta-v3-enem
checkpoint.sh ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Create checkpoint of PARLE backend with all models loaded
3
+ # Requires: patched CRIU (criu-patched), io_uring disabled
4
+ #
5
+ # Usage: ./checkpoint.sh [--stop]
6
+ # --stop: Stop the process after checkpoint (default: keep running)
7
+
8
+ set -e
9
+
10
+ CHECKPOINT_DIR="/var/lib/parle-checkpoints"
11
+ CHECKPOINT_NAME="parle-$(date +%Y%m%d-%H%M%S)"
12
+ CHECKPOINT_PATH="$CHECKPOINT_DIR/$CHECKPOINT_NAME"
13
+ LATEST_LINK="$CHECKPOINT_DIR/latest"
14
+ LEAVE_RUNNING="--leave-running"
15
+
16
+ # Parse arguments
17
+ if [ "$1" = "--stop" ]; then
18
+ LEAVE_RUNNING=""
19
+ echo "Will STOP process after checkpoint"
20
+ fi
21
+
22
+ echo "=============================================="
23
+ echo "PARLE Backend Checkpoint"
24
+ echo "=============================================="
25
+
26
+ # Find Python process
27
+ PYTHON_PID=$(pgrep -f "python.*app.py" | head -1)
28
+ if [ -z "$PYTHON_PID" ]; then
29
+ echo "ERROR: No Python backend process found"
30
+ echo "Start the backend first with: ./start.sh"
31
+ exit 1
32
+ fi
33
+ echo "Found backend process: PID $PYTHON_PID"
34
+
35
+ # Check health
36
+ echo ""
37
+ echo "[1/3] Checking backend health..."
38
+ HEALTH=$(curl -s --max-time 5 localhost:8000/health 2>/dev/null)
39
+ if [ -z "$HEALTH" ]; then
40
+ echo "ERROR: Backend not responding to health check"
41
+ exit 1
42
+ fi
43
+
44
+ VLLM=$(echo "$HEALTH" | grep -o '"vllm_loaded":true' || true)
45
+ WHISPER=$(echo "$HEALTH" | grep -o '"whisper_loaded":true' || true)
46
+ KOKORO=$(echo "$HEALTH" | grep -o '"kokoro_loaded":true' || true)
47
+
48
+ if [ -z "$VLLM" ] || [ -z "$WHISPER" ] || [ -z "$KOKORO" ]; then
49
+ echo "ERROR: Not all models are loaded yet"
50
+ echo "Wait for all models to load before checkpointing"
51
+ echo "Health: $HEALTH"
52
+ exit 1
53
+ fi
54
+ echo "All models loaded!"
55
+
56
+ # Check io_uring is disabled
57
+ echo ""
58
+ echo "[2/3] Checking system configuration..."
59
+ IO_URING=$(cat /proc/sys/kernel/io_uring_disabled 2>/dev/null || echo "unknown")
60
+ if [ "$IO_URING" != "2" ]; then
61
+ echo "WARNING: io_uring not disabled (value: $IO_URING)"
62
+ echo "Run: sudo sysctl -w kernel.io_uring_disabled=2"
63
+ echo "Continuing anyway..."
64
+ fi
65
+
66
+ # Check CRIU
67
+ if [ ! -f /usr/local/bin/criu-patched ]; then
68
+ echo "ERROR: Patched CRIU not found at /usr/local/bin/criu-patched"
69
+ echo "Run setup-criu-patched.sh first"
70
+ exit 1
71
+ fi
72
+ echo "Patched CRIU found"
73
+
74
+ # Create checkpoint directory
75
+ mkdir -p "$CHECKPOINT_PATH"
76
+
77
+ echo ""
78
+ echo "[3/3] Creating checkpoint..."
79
+ echo "Path: $CHECKPOINT_PATH"
80
+ echo "This may take 30-60 seconds..."
81
+ echo ""
82
+
83
+ START_TIME=$(date +%s)
84
+
85
+ # Run CRIU checkpoint
86
+ CRIU_PLUGINS_DIR=/usr/lib/criu /usr/local/bin/criu-patched dump \
87
+ -t $PYTHON_PID \
88
+ -D "$CHECKPOINT_PATH" \
89
+ --shell-job \
90
+ --tcp-established \
91
+ --file-locks \
92
+ --ext-unix-sk \
93
+ $LEAVE_RUNNING \
94
+ -v2 \
95
+ -o "$CHECKPOINT_PATH/dump.log" 2>&1 || {
96
+ echo ""
97
+ echo "ERROR: CRIU dump failed"
98
+ echo "Check log: $CHECKPOINT_PATH/dump.log"
99
+ tail -20 "$CHECKPOINT_PATH/dump.log"
100
+ exit 1
101
+ }
102
+
103
+ END_TIME=$(date +%s)
104
+ DURATION=$((END_TIME - START_TIME))
105
+
106
+ # Update latest symlink
107
+ rm -f "$LATEST_LINK"
108
+ ln -s "$CHECKPOINT_PATH" "$LATEST_LINK"
109
+
110
+ # Get checkpoint size
111
+ SIZE=$(du -sh "$CHECKPOINT_PATH" | cut -f1)
112
+
113
+ echo ""
114
+ echo "=============================================="
115
+ echo "Checkpoint created successfully!"
116
+ echo "=============================================="
117
+ echo "Path: $CHECKPOINT_PATH"
118
+ echo "Size: $SIZE"
119
+ echo "Time: ${DURATION}s"
120
+ echo "Symlink: $LATEST_LINK"
121
+ echo ""
122
+ if [ -z "$LEAVE_RUNNING" ]; then
123
+ echo "Process was STOPPED. To restore: ./restore.sh"
124
+ else
125
+ echo "Process is still running. To restore later: ./restore.sh"
126
+ fi
deploy-criu.sh ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Deploy CRIU + cuda-checkpoint to TensorDock
3
+ # This script copies the necessary files to the server and sets up CRIU
4
+
5
+ set -e
6
+
7
+ # Configuration
8
+ SERVER="8.17.147.158"
9
+ SSH_PORT="10038"
10
+ SSH_USER="root"
11
+ REMOTE_DIR="/home/user/parle-backend"
12
+
13
+ echo "=================================================="
14
+ echo "Deploying CRIU + cuda-checkpoint to TensorDock"
15
+ echo "=================================================="
16
+ echo "Server: $SERVER:$SSH_PORT"
17
+ echo ""
18
+
19
+ # Check if we can connect
20
+ echo "[1/4] Testing SSH connection..."
21
+ ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 -p $SSH_PORT $SSH_USER@$SERVER "echo 'SSH connection OK'" || {
22
+ echo "ERROR: Cannot connect to server via SSH"
23
+ echo ""
24
+ echo "Manual steps:"
25
+ echo "1. SSH into the server: ssh -p $SSH_PORT $SSH_USER@$SERVER"
26
+ echo "2. Copy these scripts to /home/user/parle-backend/"
27
+ echo "3. Run: sudo ./setup-criu.sh"
28
+ echo "4. Test: ./start-smart.sh"
29
+ exit 1
30
+ }
31
+
32
+ # Copy scripts
33
+ echo ""
34
+ echo "[2/4] Copying scripts to server..."
35
+ SCRIPT_DIR="$(dirname "$0")"
36
+ scp -P $SSH_PORT \
37
+ "$SCRIPT_DIR/setup-criu.sh" \
38
+ "$SCRIPT_DIR/checkpoint.sh" \
39
+ "$SCRIPT_DIR/restore.sh" \
40
+ "$SCRIPT_DIR/start-smart.sh" \
41
+ "$SCRIPT_DIR/start.sh" \
42
+ "$SCRIPT_DIR/app.py" \
43
+ $SSH_USER@$SERVER:$REMOTE_DIR/
44
+
45
+ echo "Scripts copied successfully"
46
+
47
+ # Run setup
48
+ echo ""
49
+ echo "[3/4] Running CRIU setup on server..."
50
+ ssh -p $SSH_PORT $SSH_USER@$SERVER "cd $REMOTE_DIR && chmod +x *.sh && sudo ./setup-criu.sh"
51
+
52
+ # Test
53
+ echo ""
54
+ echo "[4/4] Testing installation..."
55
+ ssh -p $SSH_PORT $SSH_USER@$SERVER "cuda-checkpoint --help > /dev/null && echo 'cuda-checkpoint: OK' || echo 'cuda-checkpoint: FAILED'"
56
+ ssh -p $SSH_PORT $SSH_USER@$SERVER "criu --version"
57
+
58
+ echo ""
59
+ echo "=================================================="
60
+ echo "Deployment complete!"
61
+ echo "=================================================="
62
+ echo ""
63
+ echo "Next steps:"
64
+ echo "1. SSH into server: ssh -p $SSH_PORT $SSH_USER@$SERVER"
65
+ echo "2. Start backend: cd $REMOTE_DIR && ./start.sh"
66
+ echo "3. Wait for models to load (~2 min)"
67
+ echo "4. Create checkpoint: ./checkpoint.sh"
68
+ echo "5. Test restore: ./restore.sh"
69
+ echo ""
fast_startup.py ADDED
@@ -0,0 +1,409 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Fast Startup Module - Otimizacoes para Cold Start Rapido
3
+
4
+ Estrategias implementadas:
5
+ 1. fastsafetensors - Loading 4.8x-7.5x mais rapido
6
+ 2. CUDA Graph caching - Economiza ~54s
7
+ 3. Parallel model loading - Carrega modelos simultaneamente
8
+ 4. Lazy loading - CEFR classifier carrega depois
9
+ 5. Pre-download models - Cache local no SSD
10
+
11
+ Target: Cold start de ~487s para ~60s
12
+ """
13
+
14
+ import os
15
+ import sys
16
+ import time
17
+ import asyncio
18
+ import threading
19
+ from concurrent.futures import ThreadPoolExecutor
20
+ from typing import Optional, Callable
21
+ from dataclasses import dataclass
22
+
23
+ # Environment variables para otimizacao
24
+ os.environ["USE_FASTSAFETENSOR"] = "true" # Enable fastsafetensors
25
+ os.environ["VLLM_USE_MODELSCOPE"] = "false"
26
+ os.environ["TOKENIZERS_PARALLELISM"] = "false" # Avoid warnings
27
+ os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True" # Better memory
28
+
29
+ # Cache directories
30
+ CACHE_DIR = "/var/cache/parle-models"
31
+ VLLM_CACHE_DIR = f"{CACHE_DIR}/vllm"
32
+ HF_CACHE_DIR = f"{CACHE_DIR}/huggingface"
33
+
34
+ os.environ["HF_HOME"] = HF_CACHE_DIR
35
+ os.environ["VLLM_CACHE_DIR"] = VLLM_CACHE_DIR
36
+
37
+
38
+ @dataclass
39
+ class LoadingMetrics:
40
+ """Metricas de carregamento"""
41
+ vllm_ms: int = 0
42
+ whisper_ms: int = 0
43
+ cefr_ms: int = 0
44
+ kokoro_ms: int = 0
45
+ total_ms: int = 0
46
+ parallel: bool = False
47
+
48
+
49
+ class FastModelLoader:
50
+ """
51
+ Carregador otimizado de modelos com:
52
+ - Parallel loading
53
+ - Progress callbacks
54
+ - Lazy loading para modelos secundarios
55
+ """
56
+
57
+ def __init__(
58
+ self,
59
+ vllm_model: str = "RedHatAI/gemma-3-4b-it-quantized.w4a16",
60
+ whisper_model: str = "openai/whisper-small",
61
+ cefr_model: str = "marcosremar2/cefr-classifier-pt-mdeberta-v3-enem",
62
+ gpu_memory_utilization: float = 0.40,
63
+ on_progress: Optional[Callable[[str, float], None]] = None,
64
+ ):
65
+ self.vllm_model = vllm_model
66
+ self.whisper_model = whisper_model
67
+ self.cefr_model = cefr_model
68
+ self.gpu_memory_utilization = gpu_memory_utilization
69
+ self.on_progress = on_progress
70
+
71
+ # Model instances
72
+ self.vllm_engine = None
73
+ self.whisper_model_instance = None
74
+ self.whisper_processor = None
75
+ self.cefr_model_instance = None
76
+ self.cefr_tokenizer = None
77
+ self.kokoro_pipeline = None
78
+
79
+ # Loading state
80
+ self.metrics = LoadingMetrics()
81
+ self._loading_lock = threading.Lock()
82
+
83
+ def _progress(self, message: str, percentage: float):
84
+ """Report progress"""
85
+ print(f"[{percentage:.0f}%] {message}")
86
+ if self.on_progress:
87
+ self.on_progress(message, percentage)
88
+
89
+ def _ensure_cache_dirs(self):
90
+ """Criar diretorios de cache"""
91
+ os.makedirs(CACHE_DIR, exist_ok=True)
92
+ os.makedirs(VLLM_CACHE_DIR, exist_ok=True)
93
+ os.makedirs(HF_CACHE_DIR, exist_ok=True)
94
+
95
+ def load_vllm(self) -> int:
96
+ """
97
+ Carrega vLLM com otimizacoes:
98
+ - fastsafetensors (se disponivel)
99
+ - load_format="auto" (detecta melhor formato)
100
+ - CUDA graph caching
101
+ """
102
+ start = time.time()
103
+ self._progress("Loading vLLM (optimized)...", 10)
104
+
105
+ from vllm import LLM
106
+
107
+ # Check if fastsafetensors is available
108
+ try:
109
+ import fastsafetensors
110
+ load_format = "fastsafetensors"
111
+ self._progress("Using fastsafetensors (4-7x faster)", 12)
112
+ except ImportError:
113
+ load_format = "auto"
114
+ self._progress("fastsafetensors not found, using auto", 12)
115
+
116
+ self.vllm_engine = LLM(
117
+ model=self.vllm_model,
118
+ dtype="auto",
119
+ gpu_memory_utilization=self.gpu_memory_utilization,
120
+ max_model_len=2048,
121
+ trust_remote_code=True,
122
+ # Otimizacoes de loading
123
+ load_format=load_format,
124
+ # CUDA graph optimization
125
+ enforce_eager=False, # Enable CUDA graphs
126
+ # Disable unnecessary features for faster startup
127
+ enable_prefix_caching=False,
128
+ disable_custom_all_reduce=True,
129
+ )
130
+
131
+ elapsed_ms = int((time.time() - start) * 1000)
132
+ self.metrics.vllm_ms = elapsed_ms
133
+ self._progress(f"vLLM loaded in {elapsed_ms/1000:.1f}s", 40)
134
+
135
+ return elapsed_ms
136
+
137
+ def load_whisper(self) -> int:
138
+ """Carrega Whisper STT"""
139
+ start = time.time()
140
+ self._progress("Loading Whisper STT...", 45)
141
+
142
+ import torch
143
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
144
+
145
+ self.whisper_processor = AutoProcessor.from_pretrained(
146
+ self.whisper_model,
147
+ cache_dir=HF_CACHE_DIR,
148
+ )
149
+
150
+ self.whisper_model_instance = AutoModelForSpeechSeq2Seq.from_pretrained(
151
+ self.whisper_model,
152
+ torch_dtype=torch.float16,
153
+ low_cpu_mem_usage=True,
154
+ cache_dir=HF_CACHE_DIR,
155
+ ).to("cuda")
156
+
157
+ elapsed_ms = int((time.time() - start) * 1000)
158
+ self.metrics.whisper_ms = elapsed_ms
159
+ self._progress(f"Whisper loaded in {elapsed_ms/1000:.1f}s", 60)
160
+
161
+ return elapsed_ms
162
+
163
+ def load_kokoro(self) -> int:
164
+ """Carrega Kokoro TTS"""
165
+ start = time.time()
166
+ self._progress("Loading Kokoro TTS...", 65)
167
+
168
+ from kokoro import KPipeline
169
+
170
+ self.kokoro_pipeline = KPipeline(lang_code='p', device='cuda')
171
+
172
+ elapsed_ms = int((time.time() - start) * 1000)
173
+ self.metrics.kokoro_ms = elapsed_ms
174
+ self._progress(f"Kokoro loaded in {elapsed_ms/1000:.1f}s", 80)
175
+
176
+ return elapsed_ms
177
+
178
+ def load_cefr(self) -> int:
179
+ """Carrega CEFR Classifier (pode ser lazy)"""
180
+ start = time.time()
181
+ self._progress("Loading CEFR Classifier...", 85)
182
+
183
+ import torch
184
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
185
+
186
+ self.cefr_tokenizer = AutoTokenizer.from_pretrained(
187
+ self.cefr_model,
188
+ cache_dir=HF_CACHE_DIR,
189
+ )
190
+
191
+ self.cefr_model_instance = AutoModelForSequenceClassification.from_pretrained(
192
+ self.cefr_model,
193
+ torch_dtype=torch.float16,
194
+ low_cpu_mem_usage=True,
195
+ cache_dir=HF_CACHE_DIR,
196
+ ).to("cuda")
197
+ self.cefr_model_instance.eval()
198
+
199
+ elapsed_ms = int((time.time() - start) * 1000)
200
+ self.metrics.cefr_ms = elapsed_ms
201
+ self._progress(f"CEFR loaded in {elapsed_ms/1000:.1f}s", 95)
202
+
203
+ return elapsed_ms
204
+
205
+ def load_all_sequential(self) -> LoadingMetrics:
206
+ """Carrega todos os modelos sequencialmente"""
207
+ overall_start = time.time()
208
+ self._ensure_cache_dirs()
209
+
210
+ self._progress("Starting sequential model loading...", 0)
211
+
212
+ # Order: vLLM first (needs contiguous memory)
213
+ self.load_vllm()
214
+ self.load_whisper()
215
+ self.load_kokoro()
216
+ self.load_cefr()
217
+
218
+ self.metrics.total_ms = int((time.time() - overall_start) * 1000)
219
+ self.metrics.parallel = False
220
+
221
+ self._progress(f"All models loaded in {self.metrics.total_ms/1000:.1f}s", 100)
222
+
223
+ return self.metrics
224
+
225
+ def load_all_parallel(self) -> LoadingMetrics:
226
+ """
227
+ Carrega modelos em paralelo onde possivel.
228
+
229
+ Ordem otimizada:
230
+ 1. vLLM primeiro (precisa de memoria contigua)
231
+ 2. Whisper + Kokoro em paralelo
232
+ 3. CEFR lazy (carrega em background depois)
233
+ """
234
+ overall_start = time.time()
235
+ self._ensure_cache_dirs()
236
+
237
+ self._progress("Starting optimized parallel loading...", 0)
238
+
239
+ # Step 1: vLLM first (needs contiguous GPU memory)
240
+ self.load_vllm()
241
+
242
+ # Step 2: Whisper + Kokoro in parallel
243
+ self._progress("Loading Whisper + Kokoro in parallel...", 45)
244
+
245
+ with ThreadPoolExecutor(max_workers=2) as executor:
246
+ whisper_future = executor.submit(self.load_whisper)
247
+ kokoro_future = executor.submit(self.load_kokoro)
248
+
249
+ whisper_future.result()
250
+ kokoro_future.result()
251
+
252
+ # Step 3: CEFR (can be lazy loaded later)
253
+ self.load_cefr()
254
+
255
+ self.metrics.total_ms = int((time.time() - overall_start) * 1000)
256
+ self.metrics.parallel = True
257
+
258
+ self._progress(f"All models loaded in {self.metrics.total_ms/1000:.1f}s (parallel)", 100)
259
+
260
+ return self.metrics
261
+
262
+ def load_essential_only(self) -> LoadingMetrics:
263
+ """
264
+ Carrega apenas modelos essenciais para responder rapidamente.
265
+ CEFR eh carregado em background.
266
+
267
+ Tempo estimado: ~50-70% do tempo total
268
+ """
269
+ overall_start = time.time()
270
+ self._ensure_cache_dirs()
271
+
272
+ self._progress("Loading essential models only...", 0)
273
+
274
+ # Essential models
275
+ self.load_vllm()
276
+
277
+ with ThreadPoolExecutor(max_workers=2) as executor:
278
+ whisper_future = executor.submit(self.load_whisper)
279
+ kokoro_future = executor.submit(self.load_kokoro)
280
+
281
+ whisper_future.result()
282
+ kokoro_future.result()
283
+
284
+ self.metrics.total_ms = int((time.time() - overall_start) * 1000)
285
+ self.metrics.parallel = True
286
+
287
+ self._progress(f"Essential models loaded in {self.metrics.total_ms/1000:.1f}s", 90)
288
+
289
+ # Start CEFR loading in background
290
+ self._progress("Starting CEFR background loading...", 92)
291
+ threading.Thread(target=self._load_cefr_background, daemon=True).start()
292
+
293
+ return self.metrics
294
+
295
+ def _load_cefr_background(self):
296
+ """Carrega CEFR em background"""
297
+ try:
298
+ self.load_cefr()
299
+ print("[BACKGROUND] CEFR classifier loaded!")
300
+ except Exception as e:
301
+ print(f"[BACKGROUND] Failed to load CEFR: {e}")
302
+
303
+ def is_ready(self) -> bool:
304
+ """Verifica se modelos essenciais estao prontos"""
305
+ return (
306
+ self.vllm_engine is not None and
307
+ self.whisper_model_instance is not None and
308
+ self.kokoro_pipeline is not None
309
+ )
310
+
311
+ def is_fully_ready(self) -> bool:
312
+ """Verifica se todos os modelos estao prontos"""
313
+ return self.is_ready() and self.cefr_model_instance is not None
314
+
315
+
316
+ def predownload_models(
317
+ vllm_model: str = "RedHatAI/gemma-3-4b-it-quantized.w4a16",
318
+ whisper_model: str = "openai/whisper-small",
319
+ cefr_model: str = "marcosremar2/cefr-classifier-pt-mdeberta-v3-enem",
320
+ ):
321
+ """
322
+ Pre-download models to local cache.
323
+ Run this during VM setup, not during cold start.
324
+ """
325
+ print("=" * 60)
326
+ print("Pre-downloading models to local cache...")
327
+ print("=" * 60)
328
+
329
+ os.makedirs(HF_CACHE_DIR, exist_ok=True)
330
+
331
+ from huggingface_hub import snapshot_download
332
+
333
+ models = [
334
+ (vllm_model, "vLLM"),
335
+ (whisper_model, "Whisper"),
336
+ (cefr_model, "CEFR"),
337
+ ]
338
+
339
+ for model_id, name in models:
340
+ print(f"\n[{name}] Downloading {model_id}...")
341
+ start = time.time()
342
+
343
+ try:
344
+ snapshot_download(
345
+ model_id,
346
+ cache_dir=HF_CACHE_DIR,
347
+ local_dir_use_symlinks=False,
348
+ )
349
+ elapsed = time.time() - start
350
+ print(f"[{name}] Downloaded in {elapsed:.1f}s")
351
+ except Exception as e:
352
+ print(f"[{name}] Error: {e}")
353
+
354
+ print("\n" + "=" * 60)
355
+ print("Pre-download complete!")
356
+ print("=" * 60)
357
+
358
+
359
+ def install_fastsafetensors():
360
+ """Instala fastsafetensors para loading 4-7x mais rapido"""
361
+ import subprocess
362
+
363
+ print("Installing fastsafetensors...")
364
+ result = subprocess.run(
365
+ [sys.executable, "-m", "pip", "install", "fastsafetensors"],
366
+ capture_output=True,
367
+ text=True,
368
+ )
369
+
370
+ if result.returncode == 0:
371
+ print("fastsafetensors installed successfully!")
372
+ else:
373
+ print(f"Failed to install fastsafetensors: {result.stderr}")
374
+
375
+
376
+ if __name__ == "__main__":
377
+ import argparse
378
+
379
+ parser = argparse.ArgumentParser(description="Fast Model Loader")
380
+ parser.add_argument("--predownload", action="store_true", help="Pre-download models")
381
+ parser.add_argument("--install-fast", action="store_true", help="Install fastsafetensors")
382
+ parser.add_argument("--test-load", action="store_true", help="Test model loading")
383
+ parser.add_argument("--parallel", action="store_true", help="Use parallel loading")
384
+
385
+ args = parser.parse_args()
386
+
387
+ if args.install_fast:
388
+ install_fastsafetensors()
389
+
390
+ if args.predownload:
391
+ predownload_models()
392
+
393
+ if args.test_load:
394
+ loader = FastModelLoader()
395
+
396
+ if args.parallel:
397
+ metrics = loader.load_all_parallel()
398
+ else:
399
+ metrics = loader.load_all_sequential()
400
+
401
+ print("\n" + "=" * 60)
402
+ print("Loading Metrics:")
403
+ print(f" vLLM: {metrics.vllm_ms/1000:.1f}s")
404
+ print(f" Whisper: {metrics.whisper_ms/1000:.1f}s")
405
+ print(f" Kokoro: {metrics.kokoro_ms/1000:.1f}s")
406
+ print(f" CEFR: {metrics.cefr_ms/1000:.1f}s")
407
+ print(f" TOTAL: {metrics.total_ms/1000:.1f}s")
408
+ print(f" Parallel: {metrics.parallel}")
409
+ print("=" * 60)
llm/.gitkeep ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Gemma LLM Model
2
+ # HuggingFace: RedHatAI/gemma-3-4b-it-quantized.w4a16
models/cefr/.gitkeep ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # CEFR Classifier Model
2
+ # HuggingFace: marcosremar2/cefr-classifier-pt-mdeberta-v3-enem
models/llm/.gitkeep ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Gemma LLM Model
2
+ # HuggingFace: RedHatAI/gemma-3-4b-it-quantized.w4a16
models/stt/.gitkeep ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Whisper STT Model
2
+ # HuggingFace: openai/whisper-small
models/tts/.gitkeep ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Kokoro TTS Model
2
+ # HuggingFace: hexgrad/Kokoro-82M
requirements.txt ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # v5-tensordock-websocket requirements
2
+ # Para RTX 3090 (24GB VRAM)
3
+
4
+ # Web framework
5
+ fastapi>=0.104.0
6
+ uvicorn[standard]>=0.24.0
7
+ pydantic>=2.0.0
8
+ websockets>=12.0
9
+
10
+ # ML/AI
11
+ torch>=2.1.0
12
+ transformers>=4.36.0
13
+ vllm>=0.2.7
14
+
15
+ # Audio processing
16
+ soundfile>=0.12.0
17
+ librosa>=0.10.0
18
+ numpy>=1.24.0
19
+
20
+ # Voice Activity Detection for WPM calculation
21
+ pyannote-audio>=3.1.0
22
+
23
+ # TTS
24
+ kokoro>=0.1.0
25
+
26
+ # HTTP client (for TensorDock API)
27
+ requests>=2.31.0
restore.sh ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Restore PARLE backend from checkpoint
3
+ # Fast startup path - restores pre-loaded models from checkpoint
4
+ #
5
+ # Requires: patched CRIU (criu-patched), io_uring disabled
6
+ # Usage: ./restore.sh [checkpoint-path]
7
+
8
+ set -e
9
+
10
+ CHECKPOINT_DIR="/var/lib/parle-checkpoints"
11
+ CHECKPOINT_PATH="${1:-$CHECKPOINT_DIR/latest}"
12
+
13
+ echo "=============================================="
14
+ echo "PARLE Backend Restore"
15
+ echo "=============================================="
16
+
17
+ # Check if checkpoint exists
18
+ if [ ! -d "$CHECKPOINT_PATH" ] && [ ! -L "$CHECKPOINT_PATH" ]; then
19
+ echo "ERROR: Checkpoint not found: $CHECKPOINT_PATH"
20
+ echo ""
21
+ echo "Available checkpoints:"
22
+ ls -la "$CHECKPOINT_DIR" 2>/dev/null || echo " (none)"
23
+ echo ""
24
+ echo "To create a checkpoint:"
25
+ echo " 1. Start normally: ./start.sh"
26
+ echo " 2. Wait for models to load (~45s)"
27
+ echo " 3. Create checkpoint: ./checkpoint.sh"
28
+ exit 1
29
+ fi
30
+
31
+ # Resolve symlink if needed
32
+ if [ -L "$CHECKPOINT_PATH" ]; then
33
+ CHECKPOINT_PATH=$(readlink -f "$CHECKPOINT_PATH")
34
+ fi
35
+
36
+ echo "Checkpoint: $CHECKPOINT_PATH"
37
+ echo "Size: $(du -sh "$CHECKPOINT_PATH" | cut -f1)"
38
+
39
+ # Check if another instance is running
40
+ if pgrep -f "python.*app.py" > /dev/null; then
41
+ echo ""
42
+ echo "WARNING: Backend already running"
43
+ echo "Kill it first: pkill -9 -f 'python.*app.py'"
44
+ exit 1
45
+ fi
46
+
47
+ # Check CRIU
48
+ if [ ! -f /usr/local/bin/criu-patched ]; then
49
+ echo "ERROR: Patched CRIU not found at /usr/local/bin/criu-patched"
50
+ echo "Run setup-criu-patched.sh first"
51
+ exit 1
52
+ fi
53
+
54
+ # Change to the correct directory
55
+ cd /home/user
56
+
57
+ echo ""
58
+ echo "Restoring from checkpoint..."
59
+ START_TIME=$(date +%s)
60
+
61
+ # Restore with patched CRIU (runs in background)
62
+ CRIU_PLUGINS_DIR=/usr/lib/criu /usr/local/bin/criu-patched restore \
63
+ -D "$CHECKPOINT_PATH" \
64
+ --shell-job \
65
+ --tcp-established \
66
+ --file-locks \
67
+ --ext-unix-sk \
68
+ -v0 \
69
+ -o "$CHECKPOINT_PATH/restore.log" 2>/dev/null &
70
+
71
+ RESTORE_PID=$!
72
+
73
+ # Wait for backend to be ready
74
+ echo "Waiting for backend health..."
75
+ for i in {1..60}; do
76
+ HEALTH=$(curl -s --max-time 2 http://localhost:8000/health 2>/dev/null)
77
+ if [ ! -z "$HEALTH" ]; then
78
+ END_TIME=$(date +%s)
79
+ DURATION=$((END_TIME - START_TIME))
80
+
81
+ # Get process info
82
+ PYTHON_PID=$(pgrep -f "python.*app.py" | head -1)
83
+
84
+ echo ""
85
+ echo "=============================================="
86
+ echo "Backend restored successfully!"
87
+ echo "=============================================="
88
+ echo "Restore time: ${DURATION}s"
89
+ echo "Process PID: $PYTHON_PID"
90
+ echo ""
91
+ echo "Health check:"
92
+ echo "$HEALTH" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f' Status: {d[\"status\"]}'); print(f' vLLM: {d[\"vllm_loaded\"]}'); print(f' Whisper: {d[\"whisper_loaded\"]}'); print(f' Kokoro: {d[\"kokoro_loaded\"]}')" 2>/dev/null || echo "$HEALTH"
93
+ echo ""
94
+ echo "Backend ready at http://localhost:8000"
95
+ exit 0
96
+ fi
97
+
98
+ if [ $((i % 10)) -eq 0 ]; then
99
+ echo " Still waiting... ($i/60s)"
100
+ fi
101
+ sleep 1
102
+ done
103
+
104
+ echo ""
105
+ echo "ERROR: Backend did not respond within 60 seconds"
106
+ echo "Check restore log: $CHECKPOINT_PATH/restore.log"
107
+ tail -20 "$CHECKPOINT_PATH/restore.log" 2>/dev/null
108
+ exit 1
setup-criu-patched.sh ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Setup patched CRIU for PyTorch checkpoint/restore on TensorDock
3
+ # This script compiles CRIU with patches to skip unsupported nvidia device FDs
4
+
5
+ set -e
6
+
7
+ echo "=============================================="
8
+ echo "Setting up patched CRIU for PyTorch C/R"
9
+ echo "=============================================="
10
+
11
+ # Install dependencies
12
+ echo "[1/5] Installing build dependencies..."
13
+ apt-get update
14
+ apt-get install -y build-essential pkg-config libprotobuf-dev libprotobuf-c-dev \
15
+ protobuf-c-compiler protobuf-compiler python3-protobuf libbsd-dev \
16
+ libcap-dev libnl-3-dev libnet1-dev libaio-dev libgnutls28-dev \
17
+ python3-future asciidoc xmlto git
18
+
19
+ # Clone CRIU
20
+ echo "[2/5] Cloning CRIU..."
21
+ cd /tmp
22
+ rm -rf criu-patched
23
+ git clone --depth 1 https://github.com/checkpoint-restore/criu.git criu-patched
24
+ cd criu-patched
25
+
26
+ # Apply patch to files-ext.c (skip unsupported FDs during dump)
27
+ echo "[3/5] Applying dump patch..."
28
+ perl -i -0pe 's/(int dump_unsupp_fd.*?if \(ret == -ENOTSUP\))\s*pr_err\("Can.t dump file.*?\n\s*return -1;/$1 {\n\t\tpr_warn("Skipping file %d of that type [%o] (%s %s)\\n", p->fd, p->stat.st_mode, more, info);\n\t\treturn 0; \/\/ PATCHED: skip unsupported FDs\n\t}\n\treturn -1;/s' criu/files-ext.c
29
+
30
+ # Apply patch to files.c (skip missing FDs during restore)
31
+ echo "[4/5] Applying restore patch..."
32
+ perl -i -0pe 's/(fdesc = find_file_desc\(e\);\s*if \(fdesc == NULL\) \{)\s*pr_err\("No file for fd.*?\n\s*return -1;/$1\n\t\tpr_warn("No file for fd %d id %#x, skipping (PATCHED)\\n", e->fd, e->id);\n\t\treturn 0; \/\/ PATCHED: skip missing FDs/s' criu/files.c
33
+
34
+ # Build
35
+ echo "[5/5] Building patched CRIU..."
36
+ make -j$(nproc)
37
+
38
+ # Install
39
+ cp criu/criu /usr/local/bin/criu-patched
40
+ mkdir -p /usr/lib/criu
41
+ cp plugins/cuda/cuda_plugin.so /usr/lib/criu/
42
+
43
+ # Verify
44
+ echo ""
45
+ echo "=============================================="
46
+ echo "Patched CRIU installed!"
47
+ echo "=============================================="
48
+ /usr/local/bin/criu-patched --version
49
+
50
+ # Setup io_uring disable (persists across reboots)
51
+ echo ""
52
+ echo "Disabling io_uring at kernel level..."
53
+ sysctl -w kernel.io_uring_disabled=2
54
+ echo "kernel.io_uring_disabled=2" >> /etc/sysctl.conf
55
+
56
+ echo ""
57
+ echo "Setup complete! Run checkpoint.sh after models are loaded."
setup-criu.sh ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # CRIU + cuda-checkpoint Setup Script for TensorDock
3
+ # Run this once on a fresh VM to install all dependencies
4
+
5
+ set -e
6
+
7
+ echo "=================================================="
8
+ echo "Setting up CRIU + cuda-checkpoint for fast restore"
9
+ echo "=================================================="
10
+
11
+ # Check if running as root
12
+ if [ "$EUID" -ne 0 ]; then
13
+ echo "Please run as root (sudo ./setup-criu.sh)"
14
+ exit 1
15
+ fi
16
+
17
+ # Check NVIDIA driver version (needs 550+)
18
+ echo ""
19
+ echo "[1/5] Checking NVIDIA driver version..."
20
+ DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1)
21
+ MAJOR_VERSION=$(echo $DRIVER_VERSION | cut -d'.' -f1)
22
+
23
+ echo "Driver version: $DRIVER_VERSION"
24
+
25
+ if [ "$MAJOR_VERSION" -lt 550 ]; then
26
+ echo "ERROR: NVIDIA driver 550+ required for cuda-checkpoint"
27
+ echo "Current version: $DRIVER_VERSION"
28
+ echo ""
29
+ echo "To upgrade driver:"
30
+ echo " sudo apt-get update"
31
+ echo " sudo apt-get install nvidia-driver-550"
32
+ exit 1
33
+ fi
34
+
35
+ echo "Driver version OK!"
36
+
37
+ # Install CRIU
38
+ echo ""
39
+ echo "[2/5] Installing CRIU..."
40
+ apt-get update
41
+ apt-get install -y criu
42
+
43
+ # Verify CRIU installation
44
+ CRIU_VERSION=$(criu --version | head -1)
45
+ echo "CRIU installed: $CRIU_VERSION"
46
+
47
+ # Clone cuda-checkpoint
48
+ echo ""
49
+ echo "[3/5] Setting up cuda-checkpoint..."
50
+ CUDA_CHECKPOINT_DIR="/opt/cuda-checkpoint"
51
+
52
+ if [ -d "$CUDA_CHECKPOINT_DIR" ]; then
53
+ echo "cuda-checkpoint already exists, updating..."
54
+ cd "$CUDA_CHECKPOINT_DIR"
55
+ git pull
56
+ else
57
+ git clone https://github.com/NVIDIA/cuda-checkpoint.git "$CUDA_CHECKPOINT_DIR"
58
+ fi
59
+
60
+ # Create symlink for easy access
61
+ ln -sf "$CUDA_CHECKPOINT_DIR/bin/cuda-checkpoint" /usr/local/bin/cuda-checkpoint
62
+ chmod +x /usr/local/bin/cuda-checkpoint
63
+
64
+ echo "cuda-checkpoint installed at /usr/local/bin/cuda-checkpoint"
65
+
66
+ # Create checkpoint directory
67
+ echo ""
68
+ echo "[4/5] Creating checkpoint directory..."
69
+ CHECKPOINT_DIR="/var/lib/parle-checkpoints"
70
+ mkdir -p "$CHECKPOINT_DIR"
71
+ chmod 755 "$CHECKPOINT_DIR"
72
+
73
+ echo "Checkpoint directory: $CHECKPOINT_DIR"
74
+
75
+ # Test cuda-checkpoint
76
+ echo ""
77
+ echo "[5/5] Testing cuda-checkpoint..."
78
+ cuda-checkpoint --help > /dev/null 2>&1 && echo "cuda-checkpoint: OK" || echo "cuda-checkpoint: FAILED"
79
+ criu check > /dev/null 2>&1 && echo "CRIU check: OK" || echo "CRIU check: WARNING (some features may not work)"
80
+
81
+ echo ""
82
+ echo "=================================================="
83
+ echo "Setup complete!"
84
+ echo "=================================================="
85
+ echo ""
86
+ echo "Next steps:"
87
+ echo "1. Start the backend normally: ./start.sh"
88
+ echo "2. Wait for models to load (~2 min)"
89
+ echo "3. Create checkpoint: ./checkpoint.sh"
90
+ echo "4. Next time, restore: ./restore.sh (should be ~5-10s)"
91
+ echo ""
setup-fast-coldstart.sh ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # =============================================================================
3
+ # FAST COLD START SETUP
4
+ # =============================================================================
5
+ # Este script prepara a VM TensorDock para cold starts rapidos (~60s vs ~487s)
6
+ #
7
+ # Otimizacoes:
8
+ # 1. Pre-download models para SSD local
9
+ # 2. Instala fastsafetensors (loading 4-7x mais rapido)
10
+ # 3. Configura CUDA graph caching
11
+ # 4. Configura environment variables otimizados
12
+ #
13
+ # Uso: ./setup-fast-coldstart.sh
14
+ # =============================================================================
15
+
16
+ set -e
17
+
18
+ echo "=============================================="
19
+ echo "FAST COLD START SETUP"
20
+ echo "=============================================="
21
+
22
+ # Directories
23
+ CACHE_DIR="/var/cache/parle-models"
24
+ HF_CACHE="$CACHE_DIR/huggingface"
25
+ VLLM_CACHE="$CACHE_DIR/vllm"
26
+ CUDA_CACHE="$CACHE_DIR/cuda-cache"
27
+
28
+ # Create directories
29
+ echo "[1/5] Creating cache directories..."
30
+ sudo mkdir -p $CACHE_DIR
31
+ sudo mkdir -p $HF_CACHE
32
+ sudo mkdir -p $VLLM_CACHE
33
+ sudo mkdir -p $CUDA_CACHE
34
+ sudo chmod -R 777 $CACHE_DIR
35
+
36
+ # Set environment variables permanently
37
+ echo "[2/5] Setting environment variables..."
38
+ cat >> ~/.bashrc << 'EOF'
39
+
40
+ # PARLE Fast Cold Start Environment
41
+ export HF_HOME=/var/cache/parle-models/huggingface
42
+ export VLLM_CACHE_DIR=/var/cache/parle-models/vllm
43
+ export CUDA_CACHE_PATH=/var/cache/parle-models/cuda-cache
44
+ export USE_FASTSAFETENSOR=true
45
+ export TOKENIZERS_PARALLELISM=false
46
+ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
47
+
48
+ # vLLM optimizations
49
+ export VLLM_ATTENTION_BACKEND=FLASH_ATTN
50
+ export VLLM_USE_TRITON_FLASH_ATTN=1
51
+ EOF
52
+
53
+ # Source the new environment
54
+ source ~/.bashrc
55
+
56
+ # Install fastsafetensors
57
+ echo "[3/5] Installing fastsafetensors (4-7x faster loading)..."
58
+ pip install fastsafetensors 2>/dev/null || {
59
+ echo "Warning: fastsafetensors installation failed, will use default loader"
60
+ }
61
+
62
+ # Install NVIDIA Model Streamer (optional, for S3 loading)
63
+ echo "[4/5] Installing nvidia-model-streamer (optional)..."
64
+ pip install nvidia-model-streamer 2>/dev/null || {
65
+ echo "Warning: nvidia-model-streamer not available"
66
+ }
67
+
68
+ # Pre-download models
69
+ echo "[5/5] Pre-downloading models to local cache..."
70
+ echo "This may take 10-30 minutes depending on network speed..."
71
+
72
+ python3 << 'PYTHON_SCRIPT'
73
+ import os
74
+ import time
75
+
76
+ os.environ["HF_HOME"] = "/var/cache/parle-models/huggingface"
77
+
78
+ from huggingface_hub import snapshot_download
79
+
80
+ models = [
81
+ ("RedHatAI/gemma-3-4b-it-quantized.w4a16", "vLLM (Gemma 4B)"),
82
+ ("openai/whisper-small", "Whisper STT"),
83
+ ("marcosremar2/cefr-classifier-pt-mdeberta-v3-enem", "CEFR Classifier"),
84
+ ]
85
+
86
+ print("\n" + "=" * 50)
87
+ for model_id, name in models:
88
+ print(f"\nDownloading {name}: {model_id}")
89
+ start = time.time()
90
+
91
+ try:
92
+ path = snapshot_download(
93
+ model_id,
94
+ cache_dir="/var/cache/parle-models/huggingface",
95
+ )
96
+ elapsed = time.time() - start
97
+ print(f" Downloaded to {path} in {elapsed:.1f}s")
98
+ except Exception as e:
99
+ print(f" ERROR: {e}")
100
+
101
+ # Also download Kokoro voices
102
+ print("\nDownloading Kokoro TTS voices...")
103
+ try:
104
+ from kokoro import KPipeline
105
+ pipeline = KPipeline(lang_code='p', device='cpu') # Just to trigger download
106
+ print(" Kokoro voices downloaded!")
107
+ except Exception as e:
108
+ print(f" Kokoro download skipped: {e}")
109
+
110
+ print("\n" + "=" * 50)
111
+ print("Pre-download complete!")
112
+ print("=" * 50)
113
+ PYTHON_SCRIPT
114
+
115
+ echo ""
116
+ echo "=============================================="
117
+ echo "SETUP COMPLETE!"
118
+ echo "=============================================="
119
+ echo ""
120
+ echo "Expected cold start improvement:"
121
+ echo " Before: ~487s (8 min)"
122
+ echo " After: ~60-90s (1-1.5 min)"
123
+ echo ""
124
+ echo "Optimizations applied:"
125
+ echo " - Models cached locally on SSD"
126
+ echo " - fastsafetensors for 4-7x faster loading"
127
+ echo " - CUDA graph caching enabled"
128
+ echo " - Environment variables optimized"
129
+ echo ""
130
+ echo "To test: ./start-smart.sh"
131
+ echo "=============================================="
start-optimized.sh ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # =============================================================================
3
+ # OPTIMIZED PARLE BACKEND STARTUP
4
+ # =============================================================================
5
+ # Startup script com medicoes de tempo para cada fase
6
+ #
7
+ # Fases:
8
+ # 1. Environment setup
9
+ # 2. Check/restore from checkpoint (se disponivel)
10
+ # 3. Fast model loading (otimizado)
11
+ # 4. Health check
12
+ # =============================================================================
13
+
14
+ set -e
15
+
16
+ SCRIPT_DIR="$(dirname "$0")"
17
+ LOG_FILE="$SCRIPT_DIR/startup.log"
18
+
19
+ # Timing function
20
+ timestamp() {
21
+ date +%s.%N
22
+ }
23
+
24
+ log() {
25
+ echo "[$(date '+%H:%M:%S')] $1" | tee -a "$LOG_FILE"
26
+ }
27
+
28
+ # Start timing
29
+ TOTAL_START=$(timestamp)
30
+
31
+ echo "=============================================="
32
+ echo "PARLE Backend - Optimized Startup"
33
+ echo "=============================================="
34
+ echo "" > "$LOG_FILE"
35
+
36
+ # =============================================================================
37
+ # PHASE 1: Environment Setup
38
+ # =============================================================================
39
+ PHASE1_START=$(timestamp)
40
+ log "PHASE 1: Environment Setup"
41
+
42
+ # Set optimized environment
43
+ export HF_HOME=/var/cache/parle-models/huggingface
44
+ export VLLM_CACHE_DIR=/var/cache/parle-models/vllm
45
+ export CUDA_CACHE_PATH=/var/cache/parle-models/cuda-cache
46
+ export USE_FASTSAFETENSOR=true
47
+ export TOKENIZERS_PARALLELISM=false
48
+ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
49
+ export VLLM_ATTENTION_BACKEND=FLASH_ATTN
50
+
51
+ # Check if models are pre-cached
52
+ if [ -d "/var/cache/parle-models/huggingface" ]; then
53
+ CACHE_SIZE=$(du -sh /var/cache/parle-models/huggingface 2>/dev/null | cut -f1)
54
+ log " Model cache found: $CACHE_SIZE"
55
+ else
56
+ log " WARNING: No model cache found. First run will be slow."
57
+ fi
58
+
59
+ PHASE1_END=$(timestamp)
60
+ PHASE1_TIME=$(echo "$PHASE1_END - $PHASE1_START" | bc)
61
+ log " Phase 1 complete: ${PHASE1_TIME}s"
62
+
63
+ # =============================================================================
64
+ # PHASE 2: Checkpoint Restore (if available)
65
+ # =============================================================================
66
+ PHASE2_START=$(timestamp)
67
+ log "PHASE 2: Checkpoint Check"
68
+
69
+ CHECKPOINT_DIR="/var/lib/parle-checkpoints"
70
+ CHECKPOINT_PATH="$CHECKPOINT_DIR/latest"
71
+
72
+ if [ -d "$CHECKPOINT_PATH" ] || [ -L "$CHECKPOINT_PATH" ]; then
73
+ log " Checkpoint found! Attempting restore..."
74
+
75
+ if "$SCRIPT_DIR/restore.sh" "$CHECKPOINT_PATH" 2>/dev/null; then
76
+ PHASE2_END=$(timestamp)
77
+ PHASE2_TIME=$(echo "$PHASE2_END - $PHASE2_START" | bc)
78
+ TOTAL_TIME=$(echo "$PHASE2_END - $TOTAL_START" | bc)
79
+
80
+ log " Restored from checkpoint!"
81
+ log ""
82
+ log "=============================================="
83
+ log "STARTUP COMPLETE (from checkpoint)"
84
+ log " Phase 1 (env): ${PHASE1_TIME}s"
85
+ log " Phase 2 (restore): ${PHASE2_TIME}s"
86
+ log " TOTAL: ${TOTAL_TIME}s"
87
+ log "=============================================="
88
+ exit 0
89
+ else
90
+ log " Checkpoint restore failed, continuing with cold start"
91
+ fi
92
+ else
93
+ log " No checkpoint found, proceeding with cold start"
94
+ fi
95
+
96
+ PHASE2_END=$(timestamp)
97
+ PHASE2_TIME=$(echo "$PHASE2_END - $PHASE2_START" | bc)
98
+ log " Phase 2 complete: ${PHASE2_TIME}s"
99
+
100
+ # =============================================================================
101
+ # PHASE 3: Model Loading (Optimized)
102
+ # =============================================================================
103
+ PHASE3_START=$(timestamp)
104
+ log "PHASE 3: Model Loading"
105
+
106
+ # Start the server with optimized loading
107
+ cd "$SCRIPT_DIR"
108
+
109
+ # Create a Python script for optimized loading
110
+ python3 << 'PYTHON_SCRIPT' &
111
+ import os
112
+ import sys
113
+ import time
114
+
115
+ # Ensure environment
116
+ os.environ["HF_HOME"] = "/var/cache/parle-models/huggingface"
117
+ os.environ["USE_FASTSAFETENSOR"] = "true"
118
+
119
+ print("[STARTUP] Starting optimized model loading...")
120
+ start = time.time()
121
+
122
+ # Import app module (will trigger load_models on startup)
123
+ import uvicorn
124
+
125
+ # Run server
126
+ uvicorn.run(
127
+ "app:app",
128
+ host="0.0.0.0",
129
+ port=8000,
130
+ log_level="info",
131
+ )
132
+ PYTHON_SCRIPT
133
+
134
+ SERVER_PID=$!
135
+ log " Server started (PID: $SERVER_PID)"
136
+
137
+ # =============================================================================
138
+ # PHASE 4: Health Check
139
+ # =============================================================================
140
+ PHASE4_START=$(timestamp)
141
+ log "PHASE 4: Waiting for health..."
142
+
143
+ # Wait for backend to be healthy
144
+ MAX_WAIT=300 # 5 minutes max
145
+ WAIT_INTERVAL=2
146
+
147
+ for i in $(seq 1 $((MAX_WAIT / WAIT_INTERVAL))); do
148
+ HEALTH=$(curl -s --max-time 2 http://localhost:8000/health 2>/dev/null || echo "")
149
+
150
+ if [ ! -z "$HEALTH" ]; then
151
+ # Check if all models are loaded
152
+ WHISPER=$(echo "$HEALTH" | grep -o '"whisper_loaded":true' || true)
153
+ VLLM=$(echo "$HEALTH" | grep -o '"vllm_loaded":true' || true)
154
+ KOKORO=$(echo "$HEALTH" | grep -o '"kokoro_loaded":true' || true)
155
+
156
+ if [ ! -z "$WHISPER" ] && [ ! -z "$VLLM" ] && [ ! -z "$KOKORO" ]; then
157
+ PHASE4_END=$(timestamp)
158
+ PHASE3_TIME=$(echo "$PHASE4_START - $PHASE3_START" | bc)
159
+ PHASE4_TIME=$(echo "$PHASE4_END - $PHASE4_START" | bc)
160
+ TOTAL_TIME=$(echo "$PHASE4_END - $TOTAL_START" | bc)
161
+
162
+ echo ""
163
+ log "=============================================="
164
+ log "STARTUP COMPLETE (cold start)"
165
+ log " Phase 1 (env): ${PHASE1_TIME}s"
166
+ log " Phase 2 (checkpoint): ${PHASE2_TIME}s"
167
+ log " Phase 3 (loading): ${PHASE3_TIME}s"
168
+ log " Phase 4 (health): ${PHASE4_TIME}s"
169
+ log " TOTAL: ${TOTAL_TIME}s"
170
+ log "=============================================="
171
+ log ""
172
+ log "Server running at http://localhost:8000"
173
+ log "Health endpoint: http://localhost:8000/health"
174
+ log ""
175
+
176
+ # Create checkpoint for faster next startup
177
+ if [ ! -d "$CHECKPOINT_PATH" ]; then
178
+ log "TIP: Create checkpoint for faster startup:"
179
+ log " ./checkpoint.sh"
180
+ fi
181
+
182
+ # Keep script running
183
+ wait $SERVER_PID
184
+ exit 0
185
+ fi
186
+ fi
187
+
188
+ # Progress update every 10 seconds
189
+ if [ $((i % 5)) -eq 0 ]; then
190
+ ELAPSED=$((i * WAIT_INTERVAL))
191
+ log " Still loading... (${ELAPSED}s)"
192
+ fi
193
+
194
+ sleep $WAIT_INTERVAL
195
+ done
196
+
197
+ log "ERROR: Timeout waiting for backend (${MAX_WAIT}s)"
198
+ exit 1
start-smart.sh ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Smart PARLE Backend Startup Script
3
+ # Attempts restore from checkpoint first, falls back to cold start
4
+ #
5
+ # Usage: ./start-smart.sh
6
+
7
+ set -e
8
+
9
+ CHECKPOINT_DIR="/var/lib/parle-checkpoints"
10
+ CHECKPOINT_PATH="$CHECKPOINT_DIR/latest"
11
+ SCRIPT_DIR="$(dirname "$0")"
12
+
13
+ echo "=============================================="
14
+ echo "PARLE Backend Smart Startup"
15
+ echo "=============================================="
16
+
17
+ # Check if checkpoint exists
18
+ if [ -d "$CHECKPOINT_PATH" ] || [ -L "$CHECKPOINT_PATH" ]; then
19
+ echo "Checkpoint found! Attempting fast restore..."
20
+ echo ""
21
+
22
+ START_TIME=$(date +%s)
23
+
24
+ # Try to restore
25
+ if "$SCRIPT_DIR/restore.sh" "$CHECKPOINT_PATH"; then
26
+ END_TIME=$(date +%s)
27
+ DURATION=$((END_TIME - START_TIME))
28
+ echo ""
29
+ echo "Fast restore completed in ${DURATION}s!"
30
+ exit 0
31
+ else
32
+ echo ""
33
+ echo "Restore failed, falling back to cold start..."
34
+ echo ""
35
+ fi
36
+ else
37
+ echo "No checkpoint found at $CHECKPOINT_PATH"
38
+ echo "Performing cold start..."
39
+ echo ""
40
+ fi
41
+
42
+ # Cold start fallback
43
+ echo "=============================================="
44
+ echo "Cold Start Mode"
45
+ echo "=============================================="
46
+
47
+ START_TIME=$(date +%s)
48
+
49
+ # Run the normal start script
50
+ "$SCRIPT_DIR/start.sh" &
51
+ SERVER_PID=$!
52
+
53
+ # Wait for backend to be healthy
54
+ echo "Waiting for backend to be ready..."
55
+ for i in {1..180}; do
56
+ HEALTH=$(curl -s --max-time 2 http://localhost:8000/health 2>/dev/null)
57
+ if [ ! -z "$HEALTH" ]; then
58
+ # Check if all models are loaded
59
+ WHISPER=$(echo "$HEALTH" | grep -o '"whisper_loaded":true' || true)
60
+ VLLM=$(echo "$HEALTH" | grep -o '"vllm_loaded":true' || true)
61
+ KOKORO=$(echo "$HEALTH" | grep -o '"kokoro_loaded":true' || true)
62
+
63
+ if [ ! -z "$WHISPER" ] && [ ! -z "$VLLM" ] && [ ! -z "$KOKORO" ]; then
64
+ END_TIME=$(date +%s)
65
+ DURATION=$((END_TIME - START_TIME))
66
+
67
+ echo ""
68
+ echo "=============================================="
69
+ echo "Backend ready! (cold start: ${DURATION}s)"
70
+ echo "=============================================="
71
+ echo ""
72
+
73
+ # Offer to create checkpoint
74
+ echo "TIP: Create a checkpoint now for faster startup next time:"
75
+ echo " ./checkpoint.sh"
76
+ echo ""
77
+
78
+ # Keep the script running to maintain the server
79
+ wait $SERVER_PID
80
+ exit 0
81
+ fi
82
+ fi
83
+
84
+ if [ $((i % 10)) -eq 0 ]; then
85
+ echo " Still loading... ($i/180)"
86
+ fi
87
+ sleep 1
88
+ done
89
+
90
+ echo "ERROR: Timeout waiting for backend to be ready"
91
+ exit 1
start.sh ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # PARLE Backend Startup Script
3
+ # This script sets up environment variables and starts the FastAPI server
4
+
5
+ # ============================================================================
6
+ # CONFIGURATION - Edit these values before deploying
7
+ # ============================================================================
8
+
9
+ # TensorDock Auto-Stop Configuration
10
+ export TENSORDOCK_API_TOKEN="WBE5UPHOC6Ed1HeLYL2TjqbBqVEwn5MF"
11
+ export TENSORDOCK_INSTANCE_ID="befc5b17-7516-4ccd-a0ff-da2d4ecdb874"
12
+ export IDLE_TIMEOUT_SECONDS="120" # 2 minutes
13
+
14
+ # Email Alerts (get key from https://resend.com)
15
+ export RESEND_API_KEY="" # Set this to receive email alerts
16
+ export ALERT_EMAIL="marcos@marcosrp.com"
17
+
18
+ # ============================================================================
19
+ # STARTUP
20
+ # ============================================================================
21
+
22
+ echo "=================================================="
23
+ echo "PARLE Backend Starting..."
24
+ echo "=================================================="
25
+ echo "Instance ID: $TENSORDOCK_INSTANCE_ID"
26
+ echo "Idle Timeout: ${IDLE_TIMEOUT_SECONDS}s"
27
+ echo "Alert Email: $ALERT_EMAIL"
28
+ echo "Resend Key: $([ -n "$RESEND_API_KEY" ] && echo "SET" || echo "NOT SET")"
29
+ echo "=================================================="
30
+
31
+ # Change to script directory
32
+ cd "$(dirname "$0")"
33
+
34
+ # Activate virtual environment if exists
35
+ if [ -f "/home/user/venv/bin/activate" ]; then
36
+ source /home/user/venv/bin/activate
37
+ fi
38
+
39
+ # Start the server
40
+ exec uvicorn app:app --host 0.0.0.0 --port 8000 --workers 1
stt/.gitkeep ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Whisper STT Model
2
+ # HuggingFace: openai/whisper-small
tts/.gitkeep ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Kokoro TTS Model
2
+ # HuggingFace: hexgrad/Kokoro-82M