--- title: Political News AI Service emoji: πŸ€– colorFrom: blue colorTo: green sdk: docker pinned: false --- # Political News AI Service AI-powered **multi-model processing service** for Korean political news: Summarization + Embedding + Topic Generation + Stance Analysis. ## 🌐 Overview Unified batch processing service using ML models deployed on Hugging Face Spaces. Integrates with backend via HTTP API for asynchronous news processing. **Deployed at**: https://zedwrkc-news-stance-detection.hf.space ### πŸ“Š Current Performance (2025-12-04) - **Summarization Improvements**: Enhanced quality and coverage ⭐ NEW! - **Anti-repetition**: `no_repeat_ngram_size=3`, `repetition_penalty=1.5` prevent word/sentence duplication - **Max token increase**: 300 β†’ 512 tokens for richer summaries (~8-10 sentences) - **Token allocation fix**: `chunk_max = max_length // len(chunks)` ensures proper 512 token limit - **Chunk size reduction**: 2000 β†’ 1000 chars prevents rear content loss - **Full article coverage**: Improved chunking strategy covers entire article (not just first part) - **CUDA Error Fixed**: 100% success rate after tokenization fix ⭐ - Changed from `tokenizer.encode()` to `tokenizer()` with safe parameters - Prevents CUDA index out-of-bounds during embedding lookup - Production verified: 55 articles, 0 errors (2025-11-29) - **Articles Processed**: 1,000+ articles successfully processed - **Average Processing Time**: 30-50 seconds per 5-article batch - **Batch Size**: Optimized at 5 articles (configurable up to 50) - **Cold Start Handling**: Automatic 60s warmup with 3 retries - **Backend Integration**: βœ… FastAPI endpoints operational - **Memory Management**: βœ… Enhanced memory leak fixes + psutil monitoring + pre-cleanup at >85% ⭐ - **Recommended Timeout**: 600s (for backend client) - **OOM Prevention**: Auto gc.collect() per article + pre-processing cleanup - **Backend Worker**: Sequential processing (concurrency=1) to prevent simultaneous requests ⭐ ## ✨ Features ### Currently Deployed - βœ… **Summary Generation**: KoBART model (gogamza/kobart-summarization) ⭐ IMPROVED! - **Safe tokenization**: `tokenizer()` with `return_tensors='pt'`, `truncation=True` ⭐ - **Anti-repetition**: `no_repeat_ngram_size=3`, `repetition_penalty=1.5` ⭐ NEW! - **Improved output**: 150-512 token range (~8-10 sentences) ⭐ NEW! - **Dynamic chunking**: Smart splitting for long articles (prevents rear content loss) ⭐ NEW! - 600-1200 chars: 2 chunks (256 tokens each) - 1200-1800 chars: 3 chunks (170 tokens each) - 1800+ chars: 4 chunks (128 tokens each) - **Token-aware allocation**: Ensures total output ≀ 512 tokens ⭐ NEW! - Prevents CUDA index errors during embedding lookup - βœ… **Embedding Generation**: ko-sroberta-multitask (768-dim) ⭐ - **From Title + Summary** (not just summary) - Normalized for cosine similarity - Used for BERTopic clustering in backend - βœ… **Batch Processing**: Up to 50 articles per request - βœ… **Error Handling**: Partial failures supported - βœ… **GPU/CPU Auto-detection** ### BERTopic Clustering (HF Spaces) ⭐ - βœ… **Improved BERTopic Clustering**: Noun-only tokenization with 6-word topic titles ⭐ NEW! - **ImprovedNounTokenizer**: Extracts only nouns (NNG, NNP, NNB, NR) using Mecab - **Optimized vectorizer**: ngram_range=(1,2), min_df=2, max_df=0.90 - **6-word topic titles**: Consistent, detailed, no duplicates - **5x faster processing**: 28.6s β†’ 5.6s per clustering task - Uses pre-computed embeddings from Backend DB - sklearn-based implementation (BERTopic 0.17.3) - **Full article clustering**: Processes ALL articles with embeddings (no limit) ⭐ - **Integrated visualization**: DataMapPlot generated in same API call ⭐ - **Coverage improvement**: 38.9% β†’ 92.2% (2025-11-27) ⭐ - **Keyword extraction**: Top 10 keywords per topic with c-TF-IDF scores ⭐ - βœ… **Cosine Similarity Calculation**: Real similarity scores (article ↔ topic centroid) - Calculated in HF Spaces using sklearn - Range: 0.33-0.93 (verified 2025-11-11) - Stored with topic centroids for ranking - βœ… **Topic Ranking**: Article count-based ranking (verified 2025-11-14) - Top 10 topics automatically ranked 1-10 - cluster_score = article_count - No DB triggers (manual management) - βœ… **Visualization**: DataMapPlot with Korean font support (NanumGothic) ⭐ - 1400x1400 px PNG image - Generated during clustering (prevents topic name mismatch) - Returned as base64-encoded string in API response ### In Development πŸ†• - ⏳ **Topic Generation**: KeyBERT (keyword extraction) - Extract meaningful topic titles from clustered articles - API: `POST /generate-topics` - ⏳ **Topic-based Stance Analysis**: Fine-tuned KoBERT - Classify article stance towards specific topics: support/neutral/oppose - Training dataset: GPT-5-mini few-shot labeling - API: `POST /analyze-stance-batch` ## πŸ“Š API Endpoints ### GET / Root endpoint with service information **Response:** ```json { "service": "Political News AI Service", "version": "1.0", "endpoints": { "health": "/health", "batch_process": "/batch-process-articles", "docs": "/docs" } } ``` ### GET /health Health check with model status **Response:** ```json { "status": "healthy", "summarization_model": "gogamza/kobart-summarization", "embedding_model": "jhgan/ko-sroberta-multitask", "stance_model": null, "device": "cpu" } ``` ### POST /batch-process-articles **Primary endpoint** - Unified 3-in-1 processing (Summary + Embedding + Stance) **Request:** ```json { "articles": [ { "article_id": 1, "title": "μ •λΆ€, 뢀동산 규제 μ™„ν™” λ°©μ•ˆ λ°œν‘œ", "content": "μ •λΆ€κ°€ 였늘 뢀동산 규제 μ™„ν™” λ°©μ•ˆμ„ λ°œν‘œν–ˆλ‹€..." }, { "article_id": 2, "title": "μ•Όλ‹Ή, μ •λΆ€ μ •μ±… λΉ„νŒ", "content": "야당은 μ •λΆ€ 정책에 λŒ€ν•΄ κ°•ν•˜κ²Œ λΉ„νŒν–ˆλ‹€..." } ], "max_summary_length": 300, // Optional, default: 300 "min_summary_length": 150 // Optional, default: 150 } ``` **Response:** ```json { "results": [ { "article_id": 1, "summary": "μ •λΆ€κ°€ 뢀동산 규제λ₯Ό μ™„ν™”ν•˜λŠ” λ°©μ•ˆμ„ λ°œν‘œν–ˆλ‹€...", "embedding": [0.123, -0.456, 0.789, ...], // 768-dim vector "stance": { "stance_label": "neutral", "prob_positive": 0.3, "prob_neutral": 0.5, "prob_negative": 0.2, "stance_score": 0.1 }, // null if stance model not available "error": null }, { "article_id": 2, "summary": "야당이 μ •λΆ€ 정책을 κ°•ν•˜κ²Œ λΉ„νŒν–ˆλ‹€...", "embedding": [0.234, -0.567, 0.890, ...], "stance": { "stance_label": "oppose", "prob_positive": 0.1, "prob_neutral": 0.2, "prob_negative": 0.7, "stance_score": -0.6 }, "error": null } ], "total_processed": 2, "successful": 2, "failed": 0, "processing_time_seconds": 3.5 } ``` **Processing Flow:** 1. Content β†’ Summary (KoBART) 2. Title + Summary β†’ Embedding (ko-sroberta-multitask, 768-dim) ⭐ 3. Summary β†’ Stance (if model available) 4. Return all results in single response **Note**: Embeddings are generated from "Title + Summary" (not just summary) for consistency with BERTopic clustering in backend. ### POST /cluster-topics-mecab ⭐ **BERTopic clustering endpoint** with Mecab tokenizer and integrated visualization **Request:** ```json { "embeddings": [[0.123, -0.456, ...], ...], // List of 768-dim embeddings "texts": ["μ •λΆ€, 뢀동산 규제 μ™„ν™”...", ...], // List of "title. summary" texts "article_ids": [1, 2, 3, ...], // List of article IDs "news_date": "2025-11-27", "min_topic_size": 5, // Optional, default: 5 "nr_topics": "auto", // Optional, "auto" or integer "include_visualization": true, // Optional, default: false ⭐ "viz_dpi": 150, // Optional, default: 150 "viz_width": 1400, // Optional, default: 1400 "viz_height": 1400 // Optional, default: 1400 } ``` **Response:** ```json { "success": true, "topics": [ { "topic_id": 0, "topic_title": "μ •λΆ€ μ •μ±… λ°œν‘œ", "article_count": 25, "article_ids": [1, 5, 12, ...], "centroid": [0.234, -0.567, ...], // 768-dim centroid "similarity_scores": {"1": 0.85, "5": 0.78, ...}, // Article ID β†’ similarity "topic_rank": 1, // Rank 1-10 or null "cluster_score": 25 // Same as article_count } ], "total_topics": 8, "total_articles": 150, "outliers": 10, "news_date": "2025-11-27", "visualization": "iVBORw0KGgoAAAANSUhEUgAA..." // Base64-encoded PNG if include_visualization=true ⭐ } ``` **Features:** - **CustomTokenizer**: Regex-based Korean text processing - **Real cosine similarity**: Article ↔ centroid similarity scores - **Topic centroids**: Mean of article embeddings - **Integrated visualization**: DataMapPlot generated in same call (prevents topic name mismatch) ⭐ - **Full article clustering**: Processes all articles provided (no internal limit) ⭐ - **Korean font support**: NanumGothic for visualization text ### POST /batch-summarize (Legacy) Legacy endpoint for backward compatibility. Use `/batch-process-articles` instead. ## πŸ› οΈ Tech Stack - **Framework**: FastAPI 0.115.5 + Uvicorn - **ML Framework**: HuggingFace Transformers 4.46.3 + PyTorch 2.5.1 - **Clustering**: BERTopic 0.17.3 + DataMapPlot 0.4.1 + matplotlib 3.9.3 ⭐ - **Models**: - Summarization: `gogamza/kobart-summarization` - Embedding: `jhgan/ko-sroberta-multitask` - BERTopic: sklearn-based with CustomTokenizer (regex) ⭐ - Topic Generation: KeyBERT (TODO) πŸ†• - Stance: Fine-tuned KoBERT (Training in progress) πŸ†• - **Training Dataset**: GPT-5-mini few-shot labeling πŸ†• - **Deployment**: Hugging Face Spaces (Docker SDK, 16GB RAM) ### New API Endpoints (Coming Soon) πŸ†• **POST /generate-topics** - KeyBERT topic generation **POST /analyze-stance-batch** - Topic-based stance analysis See [HF_SPACES_API_SPEC.md](./HF_SPACES_API_SPEC.md) for detailed API specifications. ## πŸ“ Project Structure ``` AI/ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ api/ # FastAPI application β”‚ β”‚ β”œβ”€β”€ main.py # Main API app β”‚ β”‚ └── schemas.py # Pydantic models β”‚ β”‚ β”‚ β”œβ”€β”€ models/ # ML models β”‚ β”‚ β”œβ”€β”€ summarizer.py # KoBART wrapper β”‚ β”‚ └── embedder.py # ko-sroberta wrapper β”‚ β”‚ β”‚ β”œβ”€β”€ services/ # BERTopic clustering ⭐ β”‚ β”‚ β”œβ”€β”€ bertopic_clustering_mecab.py # BERTopic with CustomTokenizer β”‚ β”‚ └── visualization.py # DataMapPlot visualization (if separate) β”‚ β”‚ β”‚ └── utils/ # Utilities β”‚ β”œβ”€β”€ config.py # Configuration β”‚ β”œβ”€β”€ logger.py # Logging setup β”‚ └── custom_tokenizer.py # Regex-based Korean tokenizer ⭐ β”‚ β”œβ”€β”€ app.py # HF Spaces entry point β”œβ”€β”€ test_api.py # API test script β”œβ”€β”€ Dockerfile # Docker configuration β”œβ”€β”€ start.sh # Startup script β”œβ”€β”€ requirements.txt # Dependencies (includes BERTopic, DataMapPlot) β”œβ”€β”€ .env.example # Environment template β”œβ”€β”€ README.md # This file └── DEPLOYMENT.md # Deployment guide ``` ## πŸš€ Local Development ### Prerequisites - Python 3.10+ - CUDA (optional, for GPU) ### Setup ```bash # 1. Navigate to AI directory cd AI # 2. Create virtual environment python -m venv venv source venv/bin/activate # Linux/Mac # or venv\Scripts\activate # Windows # 3. Install dependencies pip install -r requirements.txt # 4. Configure environment (optional) cp .env.example .env # Edit .env if needed # 5. Run development server python app.py # or uvicorn src.api.main:app --reload --host 0.0.0.0 --port 7860 ``` ### Test API ```bash # Run test script python test_api.py # Or use curl curl -X POST http://localhost:7860/batch-process-articles \ -H "Content-Type: application/json" \ -d '{ "articles": [ { "article_id": 1, "title": "μ •λΆ€, 뢀동산 규제 μ™„ν™” λ°©μ•ˆ λ°œν‘œ", "content": "μ •λΆ€κ°€ 였늘 뢀동산 규제 μ™„ν™” λ°©μ•ˆμ„ λ°œν‘œν–ˆλ‹€..." } ] }' ``` ## 🐳 Docker ### Build ```bash docker build -t politics-news-ai . ``` ### Run ```bash docker run -p 7860:7860 politics-news-ai ``` ## 🌐 Deployment (Hugging Face Spaces) ### Prerequisites 1. Hugging Face account 2. Git ### Steps ```bash # 1. Create new Space on HF # - SDK: Docker # - Hardware: CPU Basic (or GPU if needed) # 2. Clone Space repository git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME cd YOUR_SPACE_NAME # 3. Copy AI service files cp -r ../AI/* . # 4. Commit and push git add . git commit -m "Initial deployment" git push ``` ### Environment Variables (HF Spaces) - `TRANSFORMERS_CACHE`: Model cache directory (auto-configured) - `HF_HOME`: Hugging Face home directory (auto-configured) ## πŸ“Š Performance ### Model Sizes - KoBART: ~450MB - ko-sroberta-multitask: ~440MB - BERTopic + DataMapPlot: ~100MB ⭐ - Total: ~1GB RAM required ### Processing Speed - **CPU**: ~6 seconds per article (summarization + embedding) - **GPU (T4)**: ~1-2 seconds per article - **Batch (50 articles, CPU)**: ~5 minutes - **Improved BERTopic Clustering** (333 articles): ~5.6 seconds (5x faster!) ⭐ NEW! - **Visualization Generation**: ~5-10 seconds (included in clustering) ⭐ ### Optimization - Model loaded once at startup - Batch processing for efficiency - CPU/GPU auto-detection - Partial failure handling - **CUDA Error Prevention** (2025-11-29) ⭐ NEW! - **Safe tokenization**: Use `tokenizer()` instead of `tokenizer.encode()` - **Parameters**: `return_tensors='pt'`, `truncation=True`, `add_special_tokens=True`, `max_length=1024` - **Prevents**: CUDA index out-of-bounds during embedding lookup - **Result**: 100% success rate (55 articles, 0 errors in production) - **Memory Management** (2025-11-23) ⭐ - psutil monitoring before/after processing - Automatic tensor cleanup (`del tensor`, `torch.cuda.empty_cache()`) - gc.collect() for garbage collection - Pre-processing cleanup when memory > 85% - Logger handler duplication prevention - Matplotlib figure cleanup (`plt.close()`) - critical for visualization ⭐ - BERTopic model cleanup after clustering (del embeddings_array, topic_model) ⭐ - Prevents OOM errors on long-running instances - **BERTopic Performance** (2025-11-28) ⭐ - **Improved clustering**: ImprovedNounTokenizer with noun-only extraction (NNG, NNP, NNB, NR) - **6-word topic titles**: Consistent, detailed, no duplicates - **5x faster processing**: 28.6s β†’ 5.6s (optimized vectorizer) - Full article clustering (no limit, processes all articles with embeddings) - Integrated visualization (prevents duplicate clustering, saves API calls) - Real cosine similarity calculation (article ↔ centroid) ## πŸ”§ Configuration ### Environment Variables ```bash # Optional PORT=7860 # API port LOG_LEVEL=INFO # Logging level MAX_SUMMARY_LENGTH=300 # Default max summary length MIN_SUMMARY_LENGTH=150 # Default min summary length ``` ## πŸ§ͺ Testing ### Unit Tests ```bash pytest tests/ # TODO: Add tests ``` ### Integration Test ```bash # Test with backend python test_api.py ``` ### Manual Test ```bash # Health check curl http://localhost:7860/health # Process single article curl -X POST http://localhost:7860/batch-process-articles \ -H "Content-Type: application/json" \ -d '{ "articles": [{"article_id": 1, "title": "ν…ŒμŠ€νŠΈ 기사 제λͺ©", "content": "ν…ŒμŠ€νŠΈ 기사 λ‚΄μš©..."}] }' ``` ## πŸ› Troubleshooting ### Model Download Issues ```bash # Clear cache rm -rf ~/.cache/huggingface # Re-download models python app.py ``` ### Memory Issues ```bash # Memory is now auto-managed with comprehensive cleanup (2025-11-23) # If issues persist: # 1. Reduce batch size (in backend config) # 2. Check psutil logs for memory usage patterns # 3. Upgrade to GPU instance if needed # Memory cleanup is automatic via: # - Logger handler duplication prevention # - del tensor after each summarization # - torch.cuda.empty_cache() after encoding # - gc.collect() after batch processing # - plt.close() for matplotlib figures # - BERTopic model cleanup after clustering # - Pre-processing cleanup when memory > 85% # Note: Backend Worker uses concurrency=1 to prevent # simultaneous HF Spaces requests (sequential processing) ``` ### Permission Issues (HF Spaces) ```bash # Ensure cache directories have write permissions # (handled automatically in Dockerfile) ``` ## πŸ“š Model Details ### KoBART (Summarization) - **Model**: gogamza/kobart-summarization - **Type**: Encoder-Decoder Transformer - **Language**: Korean - **Task**: Abstractive summarization - **Output**: 3-5 sentences (dynamic length) ### ko-sroberta-multitask (Embedding) - **Model**: jhgan/ko-sroberta-multitask - **Type**: Sentence-BERT - **Language**: Korean - **Output**: 768-dimensional embeddings - **Normalized**: For cosine similarity ### Stance Model (TODO) - **Status**: Under development (ML engineer) - **Task**: 3-class classification (support/neutral/oppose) - **Input**: Title + Summary - **Output**: Probabilities + label + score ## πŸ”— Links - **Deployed Service**: https://zedwrkc-news-stance-detection.hf.space - **API Docs**: https://zedwrkc-news-stance-detection.hf.space/docs - **Backend Repo**: (link to backend repository) ## πŸ“„ License MIT ## πŸ‘₯ Contributors - Backend Developer: Integration & deployment - ML Engineer: Stance model development - Frontend Developer: Consumer of API results --- **Status**: βœ… Deployed | πŸš€ Production Ready