calettippo commited on
Commit
eebc859
·
1 Parent(s): 3531b9b

Add gradio app

Browse files
Files changed (4) hide show
  1. DEPLOYMENT_GUIDE.md +91 -0
  2. app.py +612 -0
  3. assets/ScribeAId.svg +13 -0
  4. requirements.txt +14 -0
DEPLOYMENT_GUIDE.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 Guida al Deployment su Hugging Face Spaces
2
+
3
+ ## 1. Creazione dello Space
4
+
5
+ 1. Vai su [Hugging Face Spaces](https://huggingface.co/spaces)
6
+ 2. Clicca su "Create new Space"
7
+ 3. Scegli un nome per il tuo Space (es. `scribeaid-demo`)
8
+ 4. Seleziona **Gradio** come SDK
9
+ 5. Scegli **Public** o **Private** in base alle tue esigenze
10
+ 6. Clicca su "Create Space"
11
+
12
+ ## 2. Upload dei File
13
+
14
+ Carica tutti i file di questa cartella nello Space:
15
+
16
+ ```
17
+ scribeaid-demo/
18
+ ├── app.py # ✅ File principale dell'app
19
+ ├── requirements.txt # ✅ Dipendenze Python
20
+ ├── README.md # ✅ Metadata e descrizione
21
+ ├── .gitignore # ✅ File da ignorare
22
+ └── assets/
23
+ └── ScribeAId.svg # ✅ Logo dell'app
24
+ ```
25
+
26
+ **IMPORTANTE**: NON caricare la cartella `content/` che contiene i modelli - questi verranno scaricati automaticamente da Hugging Face Hub.
27
+
28
+ ## 3. Configurazione dei Secrets
29
+
30
+ Nel tuo Space, vai su **Settings** → **Variables and secrets** e aggiungi:
31
+
32
+ ### Secrets Obbligatori:
33
+
34
+ ```
35
+ HF_MODEL_ID = "ReportAId/whisper-medium-it-finetuned"
36
+ BASE_WHISPER_MODEL_ID = "openai/whisper-medium"
37
+ ```
38
+
39
+ ### Secrets Opzionali (se necessari):
40
+
41
+ ```
42
+ HF_TOKEN = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxx"
43
+ HUGGINGFACEHUB_API_TOKEN = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxx"
44
+ ```
45
+
46
+ **Nota**: Il token HF è necessario solo se i tuoi modelli sono privati.
47
+
48
+ ## 4. Configurazione Hardware
49
+
50
+ Per prestazioni ottimali, vai su **Settings** → **Hardware** e seleziona:
51
+
52
+ - **CPU basic** (gratuito) - per test
53
+ - **CPU upgrade** - per uso moderato
54
+ - **T4 small** - per prestazioni migliori (richiede abbonamento)
55
+
56
+ ## 5. Deploy e Test
57
+
58
+ 1. Dopo aver caricato i file e configurato i secrets, lo Space si avvierà automaticamente
59
+ 2. Controlla i **Logs** per eventuali errori
60
+ 3. Una volta avviato, testa l'app caricando un file audio
61
+
62
+ ## 6. Risoluzione Problemi Comuni
63
+
64
+ ### Errore "Model not found"
65
+ - Verifica che `HF_MODEL_ID` e `BASE_WHISPER_MODEL_ID` siano corretti
66
+ - Se il modello è privato, assicurati che `HF_TOKEN` sia configurato
67
+
68
+ ### Errore "Out of memory"
69
+ - Prova con un hardware più potente
70
+ - L'app è ottimizzata per funzionare anche su CPU
71
+
72
+ ### Errore "Audio processing failed"
73
+ - Verifica che il file audio sia in un formato supportato (WAV, MP3, etc.)
74
+ - L'audio deve essere di almeno 5-10 secondi per risultati ottimali
75
+
76
+ ## 7. Monitoraggio
77
+
78
+ - Controlla regolarmente i **Logs** dello Space
79
+ - Monitora l'utilizzo delle risorse in **Settings** → **Usage**
80
+
81
+ ## 8. Aggiornamenti
82
+
83
+ Per aggiornare l'app:
84
+
85
+ 1. Modifica i file localmente
86
+ 2. Carica i file aggiornati nello Space
87
+ 3. Lo Space si riavvierà automaticamente
88
+
89
+ ---
90
+
91
+ 🎯 **Il tuo Space sarà disponibile all'indirizzo**: `https://huggingface.co/spaces/TUO_USERNAME/NOME_SPACE`
app.py ADDED
@@ -0,0 +1,612 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import tempfile
4
+ import time
5
+ import logging
6
+ import gc
7
+ from dataclasses import dataclass
8
+ from typing import Optional, Tuple, List, Any, Dict
9
+ from contextlib import contextmanager
10
+
11
+ import gradio as gr
12
+ import torch
13
+ import psutil
14
+ from dotenv import load_dotenv
15
+
16
+ load_dotenv()
17
+
18
+ # Audio preprocessing not available in Hugging Face Spaces deployment
19
+ PREPROCESSING_AVAILABLE = False
20
+
21
+
22
+ def get_env_or_secret(key: str, default: Optional[str] = None) -> Optional[str]:
23
+ """Get environment variable or default."""
24
+ return os.environ.get(key, default)
25
+
26
+
27
+ @dataclass
28
+ class InferenceMetrics:
29
+ """Track inference performance metrics."""
30
+
31
+ processing_time: float
32
+ memory_usage: float
33
+ device_used: str
34
+ dtype_used: str
35
+ model_size_mb: Optional[float] = None
36
+
37
+
38
+ @dataclass
39
+ class PreprocessingConfig:
40
+ """Configuration for audio preprocessing pipeline."""
41
+
42
+ normalize_format: bool = True
43
+ normalize_volume: bool = True
44
+ reduce_noise: bool = False
45
+ remove_silence: bool = False
46
+
47
+
48
+ def load_asr_pipeline(
49
+ model_id: str,
50
+ base_model_id: str,
51
+ device_pref: str = "auto",
52
+ hf_token: Optional[str] = None,
53
+ dtype_pref: str = "auto",
54
+ chunk_length_s: Optional[int] = None,
55
+ return_timestamps: bool = False,
56
+ ):
57
+ logging.basicConfig(level=logging.INFO)
58
+ logger = logging.getLogger(__name__)
59
+
60
+ logger.info(f"Loading ASR pipeline for model: {model_id}")
61
+ logger.info(
62
+ f"Device preference: {device_pref}, Token provided: {hf_token is not None}"
63
+ )
64
+
65
+ import torch
66
+ from transformers import pipeline
67
+
68
+ # Pick optimal device for inference
69
+ device_str = "cpu"
70
+ if device_pref == "auto":
71
+ if torch.cuda.is_available():
72
+ device_str = "cuda"
73
+ logger.info(f"Using CUDA: {torch.cuda.get_device_name()}")
74
+ elif getattr(torch.backends, "mps", None) and torch.backends.mps.is_available():
75
+ device_str = "mps"
76
+ logger.info("Using Apple Silicon MPS for inference")
77
+ else:
78
+ device_str = "cpu"
79
+ logger.info("Using CPU for inference")
80
+ else:
81
+ device_str = device_pref
82
+
83
+ # Pick dtype - optimized for inference performance
84
+ dtype = None
85
+ if dtype_pref == "auto":
86
+ # For whisper-medium models, use float32 for stability in medical transcription
87
+ if "whisper-medium" in model_id:
88
+ dtype = torch.float32
89
+ logger.info(
90
+ f"Using float32 for {model_id} (medical transcription stability)"
91
+ )
92
+ elif device_str == "cuda":
93
+ dtype = torch.float16 # Use half precision on GPU for speed
94
+ logger.info("Using float16 on CUDA for faster inference")
95
+ else:
96
+ dtype = torch.float32
97
+ else:
98
+ dtype = {"float32": torch.float32, "float16": torch.float16}.get(
99
+ dtype_pref, torch.float32
100
+ )
101
+
102
+ logger.info("Pipeline configuration:")
103
+ logger.info(f" Model: {model_id}")
104
+ logger.info(f" Base model: {base_model_id}")
105
+ logger.info(f" Dtype: {dtype}")
106
+ logger.info(f" Device: {device_str}")
107
+ logger.info(f" Chunk length: {chunk_length_s}s")
108
+ logger.info(f" Return timestamps: {return_timestamps}")
109
+
110
+ # Use ultra-simplified approach to avoid all compatibility issues
111
+ try:
112
+ logger.info(
113
+ "Setting up ultra-simplified pipeline to avoid forced_decoder_ids conflicts..."
114
+ )
115
+
116
+ # Create pipeline with absolute minimal configuration
117
+ asr = pipeline(
118
+ task="automatic-speech-recognition",
119
+ model=model_id,
120
+ torch_dtype=dtype,
121
+ device=0
122
+ if device_str == "cuda"
123
+ else ("mps" if device_str == "mps" else "cpu"),
124
+ token=hf_token,
125
+ )
126
+
127
+ # Post-loading cleanup to remove any forced_decoder_ids
128
+ if hasattr(asr.model, "generation_config"):
129
+ if hasattr(asr.model.generation_config, "forced_decoder_ids"):
130
+ logger.info("Removing forced_decoder_ids from model generation config")
131
+ asr.model.generation_config.forced_decoder_ids = None
132
+
133
+ # Set basic parameters after loading
134
+ if chunk_length_s:
135
+ logger.info(f"Setting chunk_length_s to {chunk_length_s}")
136
+
137
+ logger.info(f"Successfully created ultra-simplified pipeline for: {model_id}")
138
+
139
+ except Exception as e:
140
+ logger.error(f"Ultra-simplified pipeline creation failed: {e}")
141
+ logger.info("Falling back to absolute minimal settings...")
142
+
143
+ try:
144
+ # Fallback with absolute minimal settings
145
+ fallback_dtype = torch.float32
146
+
147
+ asr = pipeline(
148
+ task="automatic-speech-recognition",
149
+ model=model_id,
150
+ torch_dtype=fallback_dtype,
151
+ device="cpu", # Force CPU for maximum compatibility
152
+ token=hf_token,
153
+ )
154
+
155
+ # Post-loading cleanup
156
+ if hasattr(asr.model, "generation_config"):
157
+ if hasattr(asr.model.generation_config, "forced_decoder_ids"):
158
+ logger.info("Removing forced_decoder_ids from fallback model")
159
+ asr.model.generation_config.forced_decoder_ids = None
160
+
161
+ device_str = "cpu"
162
+ dtype = fallback_dtype
163
+ logger.info(
164
+ f"Minimal fallback pipeline created with dtype: {fallback_dtype}"
165
+ )
166
+
167
+ except Exception as fallback_error:
168
+ logger.error(f"Minimal fallback failed: {fallback_error}")
169
+ raise
170
+
171
+ return asr, device_str, str(dtype).replace("torch.", "")
172
+
173
+
174
+ @contextmanager
175
+ def memory_monitor():
176
+ """Context manager to monitor memory usage during inference."""
177
+ process = psutil.Process()
178
+ start_memory = process.memory_info().rss / 1024 / 1024 # MB
179
+ yield
180
+ end_memory = process.memory_info().rss / 1024 / 1024 # MB
181
+ return end_memory - start_memory
182
+
183
+
184
+ def transcribe_local(
185
+ audio_path: str,
186
+ model_id: str,
187
+ base_model_id: str,
188
+ language: Optional[str],
189
+ task: str,
190
+ device_pref: str,
191
+ dtype_pref: str,
192
+ hf_token: Optional[str],
193
+ chunk_length_s: Optional[int],
194
+ stride_length_s: Optional[int],
195
+ return_timestamps: bool,
196
+ ) -> Dict[str, Any]:
197
+ logger = logging.getLogger(__name__)
198
+ logger.info(f"Starting transcription: {os.path.basename(audio_path)}")
199
+ logger.info(f"Model: {model_id}")
200
+
201
+ # Validate audio_path
202
+ if audio_path is None:
203
+ raise ValueError("Audio path is None")
204
+ if not isinstance(audio_path, (str, bytes, os.PathLike)):
205
+ raise TypeError(
206
+ f"Audio path must be str, bytes or os.PathLike, got {type(audio_path)}"
207
+ )
208
+ if not os.path.exists(audio_path):
209
+ raise FileNotFoundError(f"Audio file not found: {audio_path}")
210
+
211
+ # Load ASR pipeline with performance monitoring
212
+ start_time = time.time()
213
+
214
+ asr, device_str, dtype_str = load_asr_pipeline(
215
+ model_id=model_id,
216
+ base_model_id=base_model_id,
217
+ device_pref=device_pref,
218
+ hf_token=hf_token,
219
+ dtype_pref=dtype_pref,
220
+ chunk_length_s=chunk_length_s,
221
+ return_timestamps=return_timestamps,
222
+ )
223
+
224
+ load_time = time.time() - start_time
225
+ logger.info(f"Model loaded in {load_time:.2f}s")
226
+
227
+ # Simplified configuration to avoid compatibility issues
228
+ # Let the pipeline handle generation parameters internally
229
+ logger.info("Using simplified configuration to avoid model compatibility issues")
230
+
231
+ # Setup inference parameters with performance monitoring
232
+ try:
233
+ # Start with minimal parameters to avoid conflicts
234
+ asr_kwargs = {}
235
+
236
+ # Only add parameters that are safe and supported
237
+ if return_timestamps:
238
+ asr_kwargs["return_timestamps"] = return_timestamps
239
+ logger.info("Timestamps enabled")
240
+
241
+ # Apply chunking strategy only if supported
242
+ if chunk_length_s:
243
+ try:
244
+ asr_kwargs["chunk_length_s"] = chunk_length_s
245
+ logger.info(f"Using chunking strategy: {chunk_length_s}s")
246
+ except Exception as chunk_error:
247
+ logger.warning(f"Chunking not supported: {chunk_error}")
248
+
249
+ if stride_length_s is not None:
250
+ try:
251
+ asr_kwargs["stride_length_s"] = stride_length_s
252
+ logger.info(f"Using stride: {stride_length_s}s")
253
+ except Exception as stride_error:
254
+ logger.warning(f"Stride not supported: {stride_error}")
255
+
256
+ logger.info(f"Inference parameters configured: {list(asr_kwargs.keys())}")
257
+
258
+ # Run inference with performance monitoring
259
+ inference_start = time.time()
260
+ memory_before = psutil.Process().memory_info().rss / 1024 / 1024 # MB
261
+
262
+ try:
263
+ # Primary inference attempt with safe parameters
264
+ if asr_kwargs:
265
+ result = asr(audio_path, **asr_kwargs)
266
+ else:
267
+ # Fallback to no parameters if all failed
268
+ result = asr(audio_path)
269
+
270
+ inference_time = time.time() - inference_start
271
+ memory_after = psutil.Process().memory_info().rss / 1024 / 1024 # MB
272
+ memory_used = memory_after - memory_before
273
+
274
+ logger.info(f"Inference completed successfully in {inference_time:.2f}s")
275
+ logger.info(f"Memory used: {memory_used:.1f}MB")
276
+
277
+ except Exception as e:
278
+ error_msg = str(e)
279
+ logger.warning(f"Inference failed with parameters: {error_msg}")
280
+
281
+ # Try with absolutely minimal parameters
282
+ if "forced_decoder_ids" in error_msg:
283
+ logger.info(
284
+ "Detected forced_decoder_ids error, trying with no parameters..."
285
+ )
286
+ elif (
287
+ "probability tensor contains either inf, nan or element < 0"
288
+ in error_msg
289
+ ):
290
+ logger.info(
291
+ "Detected numerical instability, trying with no parameters..."
292
+ )
293
+ else:
294
+ logger.info("Unknown error, trying with no parameters...")
295
+
296
+ try:
297
+ inference_start = time.time()
298
+ result = asr(audio_path) # No parameters at all
299
+ inference_time = time.time() - inference_start
300
+ memory_used = 0 # Reset memory tracking
301
+
302
+ logger.info(f"Minimal inference completed in {inference_time:.2f}s")
303
+ except Exception as final_error:
304
+ logger.error(f"All inference attempts failed: {final_error}")
305
+ raise
306
+
307
+ except Exception as e:
308
+ logger.error(f"Inference failed: {e}")
309
+ raise
310
+
311
+ # Cleanup GPU memory after inference
312
+ if device_str == "cuda":
313
+ torch.cuda.empty_cache()
314
+ gc.collect()
315
+
316
+ # Return results with performance metrics
317
+ meta = {
318
+ "device": device_str,
319
+ "dtype": dtype_str,
320
+ "inference_time": inference_time,
321
+ "memory_used_mb": memory_used,
322
+ "model_type": "original" if model_id == base_model_id else "fine-tuned",
323
+ }
324
+
325
+ return {"result": result, "meta": meta}
326
+
327
+
328
+ def handle_whisper_problematic_output(text: str, model_name: str = "Whisper") -> dict:
329
+ """Gestisce gli output problematici di Whisper come '!', '.', stringhe vuote, ecc."""
330
+ if not text:
331
+ return {
332
+ "text": "[WHISPER ISSUE: Output vuoto - Audio troppo corto o silenzioso]",
333
+ "is_problematic": True,
334
+ "original": text,
335
+ "issue_type": "empty",
336
+ }
337
+
338
+ text_stripped = text.strip()
339
+
340
+ # Casi problematici comuni
341
+ problematic_outputs = {
342
+ "!": "Audio troppo corto/silenzioso",
343
+ ".": "Audio di bassa qualità",
344
+ "?": "Audio incomprensibile",
345
+ "...": "Audio troppo lungo senza parlato",
346
+ "--": "Audio distorto",
347
+ "—": "Audio con troppo rumore",
348
+ " per!": "Audio parzialmente comprensibile",
349
+ "per!": "Audio parzialmente comprensibile",
350
+ }
351
+
352
+ if text_stripped in problematic_outputs:
353
+ return {
354
+ "text": f"[WHISPER ISSUE: '{text_stripped}' - {problematic_outputs[text_stripped]}]",
355
+ "is_problematic": True,
356
+ "original": text,
357
+ "issue_type": text_stripped,
358
+ "suggestion": problematic_outputs[text_stripped],
359
+ }
360
+
361
+ # Testo troppo corto (meno di 3 caratteri e non alfabetico)
362
+ if len(text_stripped) <= 2 and not text_stripped.isalpha():
363
+ return {
364
+ "text": f"[WHISPER ISSUE: '{text_stripped}' - Output troppo corto/simbolico]",
365
+ "is_problematic": True,
366
+ "original": text,
367
+ "issue_type": "short_symbolic",
368
+ }
369
+
370
+ return {"text": text, "is_problematic": False, "original": text}
371
+
372
+
373
+ def transcribe_comparison(audio_file):
374
+ """Main function for Gradio interface."""
375
+ if audio_file is None:
376
+ return "❌ Nessun file audio fornito", "❌ Nessun file audio fornito"
377
+
378
+ # Model configuration
379
+ model_id = get_env_or_secret("HF_MODEL_ID")
380
+ base_model_id = get_env_or_secret("BASE_WHISPER_MODEL_ID")
381
+ hf_token = get_env_or_secret("HF_TOKEN") or get_env_or_secret(
382
+ "HUGGINGFACEHUB_API_TOKEN"
383
+ )
384
+
385
+ if not model_id or not base_model_id:
386
+ error_msg = "❌ Modelli non configurati. Impostare HF_MODEL_ID e BASE_WHISPER_MODEL_ID nelle variabili d'ambiente"
387
+ return error_msg, error_msg
388
+
389
+ # Preprocessing sempre attivo (nascosto all'utente)
390
+ # Non viene più utilizzato nel codice ma potrebbe servire per future implementazioni
391
+
392
+ # Fixed settings optimized for medical transcription
393
+ language = "it" # Always Italian for ScribeAId
394
+ task = "transcribe"
395
+ return_ts = True # Timestamps for medical report segments
396
+ device_pref = "auto" # Auto-detect best device
397
+ dtype_pref = "auto" # Auto-select optimal precision
398
+ chunk_len = 7 # 7-second chunks for better context
399
+ stride_len = 1 # Minimal stride for accuracy
400
+
401
+ try:
402
+ # Use the audio file path directly from Gradio
403
+ tmp_path = audio_file
404
+
405
+ original_result = None
406
+ finetuned_result = None
407
+ original_text = ""
408
+ finetuned_text = ""
409
+
410
+ try:
411
+ # Transcribe with original model
412
+ original_result = transcribe_local(
413
+ audio_path=tmp_path,
414
+ model_id=base_model_id,
415
+ base_model_id=base_model_id,
416
+ language=language,
417
+ task=task,
418
+ device_pref=device_pref,
419
+ dtype_pref=dtype_pref,
420
+ hf_token=None, # Base model doesn't need token
421
+ chunk_length_s=int(chunk_len) if chunk_len else None,
422
+ stride_length_s=int(stride_len) if stride_len else None,
423
+ return_timestamps=return_ts,
424
+ )
425
+
426
+ # Extract text from result
427
+ if isinstance(original_result["result"], dict):
428
+ original_text = original_result["result"].get(
429
+ "text"
430
+ ) or original_result["result"].get("transcription")
431
+ elif isinstance(original_result["result"], str):
432
+ original_text = original_result["result"]
433
+
434
+ if original_text:
435
+ result = handle_whisper_problematic_output(
436
+ original_text, "Original Whisper"
437
+ )
438
+ if result["is_problematic"]:
439
+ original_text = f"⚠️ {result['text']}\n\n💡 Suggerimenti:\n• Registra almeno 5-10 secondi di audio\n• Parla chiaramente e ad alto volume\n• Avvicinati al microfono\n• Evita rumori di fondo"
440
+ else:
441
+ original_text = result["text"]
442
+ else:
443
+ original_text = "❌ Nessun testo restituito dal modello originale"
444
+
445
+ except Exception as e:
446
+ original_text = f"❌ Errore modello originale: {str(e)}"
447
+
448
+ try:
449
+ # Transcribe with fine-tuned model
450
+ finetuned_result = transcribe_local(
451
+ audio_path=tmp_path,
452
+ model_id=model_id,
453
+ base_model_id=base_model_id,
454
+ language=language,
455
+ task=task,
456
+ device_pref=device_pref,
457
+ dtype_pref=dtype_pref,
458
+ hf_token=hf_token or None,
459
+ chunk_length_s=int(chunk_len) if chunk_len else None,
460
+ stride_length_s=int(stride_len) if stride_len else None,
461
+ return_timestamps=return_ts,
462
+ )
463
+
464
+ # Extract text from result
465
+ if isinstance(finetuned_result["result"], dict):
466
+ finetuned_text = finetuned_result["result"].get(
467
+ "text"
468
+ ) or finetuned_result["result"].get("transcription")
469
+ elif isinstance(finetuned_result["result"], str):
470
+ finetuned_text = finetuned_result["result"]
471
+
472
+ if finetuned_text:
473
+ result = handle_whisper_problematic_output(
474
+ finetuned_text, "Fine-tuned Model"
475
+ )
476
+ if result["is_problematic"]:
477
+ finetuned_text = f"⚠️ {result['text']}\n\n💡 Suggerimenti:\n• Registra almeno 5-10 secondi di audio\n• Parla chiaramente e ad alto volume\n• Avvicinati al microfono\n• Evita rumori di fondo"
478
+ else:
479
+ finetuned_text = result["text"]
480
+ else:
481
+ finetuned_text = "❌ Nessun testo restituito dal modello fine-tuned"
482
+
483
+ except Exception as e:
484
+ finetuned_text = f"❌ Errore modello fine-tuned: {str(e)}"
485
+
486
+ # GPU memory cleanup
487
+ if torch.cuda.is_available():
488
+ torch.cuda.empty_cache()
489
+ gc.collect()
490
+
491
+ return original_text, finetuned_text
492
+
493
+ except Exception as e:
494
+ error_msg = f"❌ Errore generale: {str(e)}"
495
+ return error_msg, error_msg
496
+
497
+
498
+ # Gradio interface
499
+ def create_interface():
500
+ """Create and configure the Gradio interface."""
501
+
502
+ model_id = get_env_or_secret("HF_MODEL_ID", "ReportAId/whisper-medium-it-finetuned")
503
+ base_model_id = get_env_or_secret("BASE_WHISPER_MODEL_ID", "openai/whisper-medium")
504
+
505
+ # Carica il logo SVG inline per garantirne la visualizzazione anche senza routing file
506
+ logo_html = None
507
+ try:
508
+ logo_path = os.path.join(os.path.dirname(__file__), "assets", "ScribeAId.svg")
509
+ with open(logo_path, "r", encoding="utf-8") as f:
510
+ svg_content = f.read()
511
+ # Wrappa lo svg in un contenitore centrato
512
+ logo_html = f"""
513
+ <div style=\"text-align: center; margin: 16px 0 8px;\">
514
+ <div style=\"display:inline-block; height:60px;\">{svg_content}</div>
515
+ </div>
516
+ """
517
+ except Exception:
518
+ # Fallback al path file= se per qualche motivo non riusciamo a leggere il file
519
+ logo_html = """
520
+ <div style=\"text-align: center; margin: 16px 0 8px;\">
521
+ <img src=\"file=assets/ScribeAId.svg\" alt=\"ScribeAId\" style=\"height: 60px; margin-bottom: 8px;\">
522
+ </div>
523
+ """
524
+
525
+ with gr.Blocks(
526
+ title="ScribeAId - Medical Transcription",
527
+ theme=gr.themes.Default(primary_hue="blue"),
528
+ css=".gradio-container{max-width: 900px !important; margin: 0 auto !important;} .center-col{display:flex;flex-direction:column;align-items:center;} .center-col .wrap{width:100%;}",
529
+ ) as demo:
530
+ # Header con logo ScribeAId (semplice, bianco/nero)
531
+ gr.HTML(logo_html)
532
+ gr.Markdown("""
533
+ Quest’applicazione confronta un Whisper V3 di base con il modello open-source fine-tuned pubblicato da ReportAId su dati ambulatoriali italiani. È progettato per mitigare errori noti e migliorare le performance. Carica un audio o registra la voce: noterai trascrizioni più accurate di termini clinici come “Holter delle 24 ore”, “fibrillazione atriale” o “pressione sistolica”.
534
+ """)
535
+
536
+ with gr.Row():
537
+ with gr.Column():
538
+ gr.Markdown(f"""
539
+ **⚙️ Impostazioni**
540
+ - Modello originale: `{base_model_id}`
541
+ - Modello fine-tuned: `{model_id}`
542
+ - Lingua: Italiano (it)
543
+ - Preprocessing audio: ottimizzato per registrazioni mediche
544
+ """)
545
+
546
+ gr.Markdown("---")
547
+
548
+ # Titolo sezione input
549
+ gr.Markdown("## Input")
550
+
551
+ # Audio input e pulsante allineati a sinistra
552
+ audio_input = gr.Audio(
553
+ label="📥 Registra dal microfono o carica un file",
554
+ type="filepath",
555
+ sources=["microphone", "upload"],
556
+ format="wav",
557
+ streaming=False,
558
+ interactive=True,
559
+ )
560
+ transcribe_btn = gr.Button("🚀 Trascrivi e Confronta", variant="primary")
561
+
562
+ gr.Markdown("---")
563
+
564
+ gr.Markdown("## Output")
565
+
566
+ with gr.Row():
567
+ with gr.Column():
568
+ gr.Markdown("### Modello base (Whisper V3)")
569
+ original_output = gr.Textbox(
570
+ label="Transcription",
571
+ lines=12,
572
+ interactive=False,
573
+ show_copy_button=True,
574
+ )
575
+
576
+ with gr.Column():
577
+ gr.Markdown("### Modello fine-tuned ReportAId")
578
+ finetuned_output = gr.Textbox(
579
+ label="Transcription",
580
+ lines=12,
581
+ interactive=False,
582
+ show_copy_button=True,
583
+ )
584
+
585
+ # Click event
586
+ transcribe_btn.click(
587
+ fn=transcribe_comparison,
588
+ inputs=[audio_input],
589
+ outputs=[original_output, finetuned_output],
590
+ show_progress=True,
591
+ )
592
+
593
+ return demo
594
+
595
+
596
+ if __name__ == "__main__":
597
+ # Configure logging
598
+ logging.basicConfig(
599
+ level=logging.INFO,
600
+ format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
601
+ )
602
+
603
+ demo = create_interface()
604
+ # Launch configuration for Hugging Face Spaces
605
+ demo.launch(
606
+ server_name="0.0.0.0",
607
+ server_port=7860,
608
+ share=False,
609
+ show_error=True,
610
+ inbrowser=False,
611
+ quiet=False,
612
+ )
assets/ScribeAId.svg ADDED
requirements.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio==5.45.0
2
+ transformers>=4.41.0
3
+ accelerate>=0.31.0
4
+ torch>=2.1.0
5
+ torchaudio>=2.1.0
6
+ numpy>=1.24.0
7
+ requests>=2.31.0
8
+ soundfile>=0.12.0
9
+ librosa>=0.10.0
10
+ pydub>=0.25.0
11
+ psutil>=5.9.0
12
+ python-dotenv>=1.0.0
13
+ datasets>=2.14.0
14
+ huggingface-hub>=0.17.0