---
title: Political News AI Service
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
---

# Political News AI Service

AI-powered **multi-model processing service** for Korean political news: Summarization + Embedding + Topic Generation + Stance Analysis.

## 🌐 Overview

Unified batch processing service using ML models deployed on Hugging Face Spaces. Integrates with backend via HTTP API for asynchronous news processing.

**Deployed at**: https://zedwrkc-news-stance-detection.hf.space

### 📊 Current Performance (2025-12-04)

- **Summarization Improvements**: Enhanced quality and coverage ⭐ NEW!
  - **Anti-repetition**: `no_repeat_ngram_size=3`, `repetition_penalty=1.5` prevent word/sentence duplication
  - **Max token increase**: 300 → 512 tokens for richer summaries (~8-10 sentences)
  - **Token allocation fix**: `chunk_max = max_length // len(chunks)` ensures proper 512 token limit
  - **Chunk size reduction**: 2000 → 1000 chars prevents rear content loss
  - **Full article coverage**: Improved chunking strategy covers entire article (not just first part)
- **CUDA Error Fixed**: 100% success rate after tokenization fix ⭐
  - Changed from `tokenizer.encode()` to `tokenizer()` with safe parameters
  - Prevents CUDA index out-of-bounds during embedding lookup
  - Production verified: 55 articles, 0 errors (2025-11-29)
- **Articles Processed**: 1,000+ articles successfully processed
- **Average Processing Time**: 30-50 seconds per 5-article batch
- **Batch Size**: Optimized at 5 articles (configurable up to 50)
- **Cold Start Handling**: Automatic 60s warmup with 3 retries
- **Backend Integration**: ✅ FastAPI endpoints operational
- **Memory Management**: ✅ Enhanced memory leak fixes + psutil monitoring + pre-cleanup at >85% ⭐
- **Recommended Timeout**: 600s (for backend client)
- **OOM Prevention**: Auto gc.collect() per article + pre-processing cleanup
- **Backend Worker**: Sequential processing (concurrency=1) to prevent simultaneous requests ⭐

## ✨ Features

### Currently Deployed
- ✅ **Summary Generation**: KoBART model (gogamza/kobart-summarization) ⭐ IMPROVED!
  - **Safe tokenization**: `tokenizer()` with `return_tensors='pt'`, `truncation=True` ⭐
  - **Anti-repetition**: `no_repeat_ngram_size=3`, `repetition_penalty=1.5` ⭐ NEW!
  - **Improved output**: 150-512 token range (~8-10 sentences) ⭐ NEW!
  - **Dynamic chunking**: Smart splitting for long articles (prevents rear content loss) ⭐ NEW!
    - 600-1200 chars: 2 chunks (256 tokens each)
    - 1200-1800 chars: 3 chunks (170 tokens each)
    - 1800+ chars: 4 chunks (128 tokens each)
  - **Token-aware allocation**: Ensures total output ≤ 512 tokens ⭐ NEW!
  - Prevents CUDA index errors during embedding lookup
- ✅ **Embedding Generation**: ko-sroberta-multitask (768-dim) ⭐
  - **From Title + Summary** (not just summary)
  - Normalized for cosine similarity
  - Used for BERTopic clustering in backend
- ✅ **Batch Processing**: Up to 50 articles per request
- ✅ **Error Handling**: Partial failures supported
- ✅ **GPU/CPU Auto-detection**

### BERTopic Clustering (HF Spaces) ⭐
- ✅ **Improved BERTopic Clustering**: Noun-only tokenization with 6-word topic titles ⭐ NEW!
  - **ImprovedNounTokenizer**: Extracts only nouns (NNG, NNP, NNB, NR) using Mecab
  - **Optimized vectorizer**: ngram_range=(1,2), min_df=2, max_df=0.90
  - **6-word topic titles**: Consistent, detailed, no duplicates
  - **5x faster processing**: 28.6s → 5.6s per clustering task
  - Uses pre-computed embeddings from Backend DB
  - sklearn-based implementation (BERTopic 0.17.3)
  - **Full article clustering**: Processes ALL articles with embeddings (no limit) ⭐
  - **Integrated visualization**: DataMapPlot generated in same API call ⭐
  - **Coverage improvement**: 38.9% → 92.2% (2025-11-27) ⭐
  - **Keyword extraction**: Top 10 keywords per topic with c-TF-IDF scores ⭐
- ✅ **Cosine Similarity Calculation**: Real similarity scores (article ↔ topic centroid)
  - Calculated in HF Spaces using sklearn
  - Range: 0.33-0.93 (verified 2025-11-11)
  - Stored with topic centroids for ranking
- ✅ **Topic Ranking**: Article count-based ranking (verified 2025-11-14)
  - Top 10 topics automatically ranked 1-10
  - cluster_score = article_count
  - No DB triggers (manual management)
- ✅ **Visualization**: DataMapPlot with Korean font support (NanumGothic) ⭐
  - 1400x1400 px PNG image
  - Generated during clustering (prevents topic name mismatch)
  - Returned as base64-encoded string in API response

### In Development 🆕
- ⏳ **Topic Generation**: KeyBERT (keyword extraction)
  - Extract meaningful topic titles from clustered articles
  - API: `POST /generate-topics`
- ⏳ **Topic-based Stance Analysis**: Fine-tuned KoBERT
  - Classify article stance towards specific topics: support/neutral/oppose
  - Training dataset: GPT-5-mini few-shot labeling
  - API: `POST /analyze-stance-batch`

## 📊 API Endpoints

### GET /

Root endpoint with service information

**Response:**
```json
{
  "service": "Political News AI Service",
  "version": "1.0",
  "endpoints": {
    "health": "/health",
    "batch_process": "/batch-process-articles",
    "docs": "/docs"
  }
}
```

### GET /health

Health check with model status

**Response:**
```json
{
  "status": "healthy",
  "summarization_model": "gogamza/kobart-summarization",
  "embedding_model": "jhgan/ko-sroberta-multitask",
  "stance_model": null,
  "device": "cpu"
}
```

### POST /batch-process-articles

**Primary endpoint** - Unified 3-in-1 processing (Summary + Embedding + Stance)

**Request:**
```json
{
  "articles": [
    {
      "article_id": 1,
      "title": "정부, 부동산 규제 완화 방안 발표",
      "content": "정부가 오늘 부동산 규제 완화 방안을 발표했다..."
    },
    {
      "article_id": 2,
      "title": "야당, 정부 정책 비판",
      "content": "야당은 정부 정책에 대해 강하게 비판했다..."
    }
  ],
  "max_summary_length": 300,  // Optional, default: 300
  "min_summary_length": 150   // Optional, default: 150
}
```

**Response:**
```json
{
  "results": [
    {
      "article_id": 1,
      "summary": "정부가 부동산 규제를 완화하는 방안을 발표했다...",
      "embedding": [0.123, -0.456, 0.789, ...],  // 768-dim vector
      "stance": {
        "stance_label": "neutral",
        "prob_positive": 0.3,
        "prob_neutral": 0.5,
        "prob_negative": 0.2,
        "stance_score": 0.1
      },  // null if stance model not available
      "error": null
    },
    {
      "article_id": 2,
      "summary": "야당이 정부 정책을 강하게 비판했다...",
      "embedding": [0.234, -0.567, 0.890, ...],
      "stance": {
        "stance_label": "oppose",
        "prob_positive": 0.1,
        "prob_neutral": 0.2,
        "prob_negative": 0.7,
        "stance_score": -0.6
      },
      "error": null
    }
  ],
  "total_processed": 2,
  "successful": 2,
  "failed": 0,
  "processing_time_seconds": 3.5
}
```

**Processing Flow:**
1. Content → Summary (KoBART)
2. Title + Summary → Embedding (ko-sroberta-multitask, 768-dim) ⭐
3. Summary → Stance (if model available)
4. Return all results in single response

**Note**: Embeddings are generated from "Title + Summary" (not just summary) for consistency with BERTopic clustering in backend.

### POST /cluster-topics-mecab ⭐

**BERTopic clustering endpoint** with Mecab tokenizer and integrated visualization

**Request:**
```json
{
  "embeddings": [[0.123, -0.456, ...], ...],  // List of 768-dim embeddings
  "texts": ["정부, 부동산 규제 완화...", ...],  // List of "title. summary" texts
  "article_ids": [1, 2, 3, ...],  // List of article IDs
  "news_date": "2025-11-27",
  "min_topic_size": 5,  // Optional, default: 5
  "nr_topics": "auto",  // Optional, "auto" or integer
  "include_visualization": true,  // Optional, default: false ⭐
  "viz_dpi": 150,  // Optional, default: 150
  "viz_width": 1400,  // Optional, default: 1400
  "viz_height": 1400  // Optional, default: 1400
}
```

**Response:**
```json
{
  "success": true,
  "topics": [
    {
      "topic_id": 0,
      "topic_title": "정부 정책 발표",
      "article_count": 25,
      "article_ids": [1, 5, 12, ...],
      "centroid": [0.234, -0.567, ...],  // 768-dim centroid
      "similarity_scores": {"1": 0.85, "5": 0.78, ...},  // Article ID → similarity
      "topic_rank": 1,  // Rank 1-10 or null
      "cluster_score": 25  // Same as article_count
    }
  ],
  "total_topics": 8,
  "total_articles": 150,
  "outliers": 10,
  "news_date": "2025-11-27",
  "visualization": "iVBORw0KGgoAAAANSUhEUgAA..."  // Base64-encoded PNG if include_visualization=true ⭐
}
```

**Features:**
- **CustomTokenizer**: Regex-based Korean text processing
- **Real cosine similarity**: Article ↔ centroid similarity scores
- **Topic centroids**: Mean of article embeddings
- **Integrated visualization**: DataMapPlot generated in same call (prevents topic name mismatch) ⭐
- **Full article clustering**: Processes all articles provided (no internal limit) ⭐
- **Korean font support**: NanumGothic for visualization text

### POST /batch-summarize (Legacy)

Legacy endpoint for backward compatibility. Use `/batch-process-articles` instead.

## 🛠️ Tech Stack

- **Framework**: FastAPI 0.115.5 + Uvicorn
- **ML Framework**: HuggingFace Transformers 4.46.3 + PyTorch 2.5.1
- **Clustering**: BERTopic 0.17.3 + DataMapPlot 0.4.1 + matplotlib 3.9.3 ⭐
- **Models**:
  - Summarization: `gogamza/kobart-summarization`
  - Embedding: `jhgan/ko-sroberta-multitask`
  - BERTopic: sklearn-based with CustomTokenizer (regex) ⭐
  - Topic Generation: KeyBERT (TODO) 🆕
  - Stance: Fine-tuned KoBERT (Training in progress) 🆕
- **Training Dataset**: GPT-5-mini few-shot labeling 🆕
- **Deployment**: Hugging Face Spaces (Docker SDK, 16GB RAM)

### New API Endpoints (Coming Soon) 🆕

**POST /generate-topics** - KeyBERT topic generation

**POST /analyze-stance-batch** - Topic-based stance analysis

See [HF_SPACES_API_SPEC.md](./HF_SPACES_API_SPEC.md) for detailed API specifications.

## 📁 Project Structure

```
AI/
├── src/
│   ├── api/                    # FastAPI application
│   │   ├── main.py             # Main API app
│   │   └── schemas.py          # Pydantic models
│   │
│   ├── models/                 # ML models
│   │   ├── summarizer.py       # KoBART wrapper
│   │   └── embedder.py         # ko-sroberta wrapper
│   │
│   ├── services/               # BERTopic clustering ⭐
│   │   ├── bertopic_clustering_mecab.py  # BERTopic with CustomTokenizer
│   │   └── visualization.py              # DataMapPlot visualization (if separate)
│   │
│   └── utils/                  # Utilities
│       ├── config.py           # Configuration
│       ├── logger.py           # Logging setup
│       └── custom_tokenizer.py # Regex-based Korean tokenizer ⭐
│
├── app.py                      # HF Spaces entry point
├── test_api.py                 # API test script
├── Dockerfile                  # Docker configuration
├── start.sh                    # Startup script
├── requirements.txt            # Dependencies (includes BERTopic, DataMapPlot)
├── .env.example                # Environment template
├── README.md                   # This file
└── DEPLOYMENT.md               # Deployment guide
```

## 🚀 Local Development

### Prerequisites

- Python 3.10+
- CUDA (optional, for GPU)

### Setup

```bash
# 1. Navigate to AI directory
cd AI

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate  # Windows

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure environment (optional)
cp .env.example .env
# Edit .env if needed

# 5. Run development server
python app.py
# or
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 7860
```

### Test API

```bash
# Run test script
python test_api.py

# Or use curl
curl -X POST http://localhost:7860/batch-process-articles \
  -H "Content-Type: application/json" \
  -d '{
    "articles": [
      {
        "article_id": 1,
        "title": "정부, 부동산 규제 완화 방안 발표",
        "content": "정부가 오늘 부동산 규제 완화 방안을 발표했다..."
      }
    ]
  }'
```

## 🐳 Docker

### Build

```bash
docker build -t politics-news-ai .
```

### Run

```bash
docker run -p 7860:7860 politics-news-ai
```

## 🌐 Deployment (Hugging Face Spaces)

### Prerequisites

1. Hugging Face account
2. Git

### Steps

```bash
# 1. Create new Space on HF
#    - SDK: Docker
#    - Hardware: CPU Basic (or GPU if needed)

# 2. Clone Space repository
git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
cd YOUR_SPACE_NAME

# 3. Copy AI service files
cp -r ../AI/* .

# 4. Commit and push
git add .
git commit -m "Initial deployment"
git push
```

### Environment Variables (HF Spaces)

- `TRANSFORMERS_CACHE`: Model cache directory (auto-configured)
- `HF_HOME`: Hugging Face home directory (auto-configured)

## 📊 Performance

### Model Sizes

- KoBART: ~450MB
- ko-sroberta-multitask: ~440MB
- BERTopic + DataMapPlot: ~100MB ⭐
- Total: ~1GB RAM required

### Processing Speed

- **CPU**: ~6 seconds per article (summarization + embedding)
- **GPU (T4)**: ~1-2 seconds per article
- **Batch (50 articles, CPU)**: ~5 minutes
- **Improved BERTopic Clustering** (333 articles): ~5.6 seconds (5x faster!) ⭐ NEW!
- **Visualization Generation**: ~5-10 seconds (included in clustering) ⭐

### Optimization

- Model loaded once at startup
- Batch processing for efficiency
- CPU/GPU auto-detection
- Partial failure handling
- **CUDA Error Prevention** (2025-11-29) ⭐ NEW!
  - **Safe tokenization**: Use `tokenizer()` instead of `tokenizer.encode()`
  - **Parameters**: `return_tensors='pt'`, `truncation=True`, `add_special_tokens=True`, `max_length=1024`
  - **Prevents**: CUDA index out-of-bounds during embedding lookup
  - **Result**: 100% success rate (55 articles, 0 errors in production)
- **Memory Management** (2025-11-23) ⭐
  - psutil monitoring before/after processing
  - Automatic tensor cleanup (`del tensor`, `torch.cuda.empty_cache()`)
  - gc.collect() for garbage collection
  - Pre-processing cleanup when memory > 85%
  - Logger handler duplication prevention
  - Matplotlib figure cleanup (`plt.close()`) - critical for visualization ⭐
  - BERTopic model cleanup after clustering (del embeddings_array, topic_model) ⭐
  - Prevents OOM errors on long-running instances
- **BERTopic Performance** (2025-11-28) ⭐
  - **Improved clustering**: ImprovedNounTokenizer with noun-only extraction (NNG, NNP, NNB, NR)
  - **6-word topic titles**: Consistent, detailed, no duplicates
  - **5x faster processing**: 28.6s → 5.6s (optimized vectorizer)
  - Full article clustering (no limit, processes all articles with embeddings)
  - Integrated visualization (prevents duplicate clustering, saves API calls)
  - Real cosine similarity calculation (article ↔ centroid)

## 🔧 Configuration

### Environment Variables

```bash
# Optional
PORT=7860                    # API port
LOG_LEVEL=INFO              # Logging level
MAX_SUMMARY_LENGTH=300      # Default max summary length
MIN_SUMMARY_LENGTH=150      # Default min summary length
```

## 🧪 Testing

### Unit Tests

```bash
pytest tests/  # TODO: Add tests
```

### Integration Test

```bash
# Test with backend
python test_api.py
```

### Manual Test

```bash
# Health check
curl http://localhost:7860/health

# Process single article
curl -X POST http://localhost:7860/batch-process-articles \
  -H "Content-Type: application/json" \
  -d '{
    "articles": [{"article_id": 1, "title": "테스트 기사 제목", "content": "테스트 기사 내용..."}]
  }'
```

## 🐛 Troubleshooting

### Model Download Issues

```bash
# Clear cache
rm -rf ~/.cache/huggingface

# Re-download models
python app.py
```

### Memory Issues

```bash
# Memory is now auto-managed with comprehensive cleanup (2025-11-23)
# If issues persist:
# 1. Reduce batch size (in backend config)
# 2. Check psutil logs for memory usage patterns
# 3. Upgrade to GPU instance if needed

# Memory cleanup is automatic via:
# - Logger handler duplication prevention
# - del tensor after each summarization
# - torch.cuda.empty_cache() after encoding
# - gc.collect() after batch processing
# - plt.close() for matplotlib figures
# - BERTopic model cleanup after clustering
# - Pre-processing cleanup when memory > 85%

# Note: Backend Worker uses concurrency=1 to prevent
# simultaneous HF Spaces requests (sequential processing)
```

### Permission Issues (HF Spaces)

```bash
# Ensure cache directories have write permissions
# (handled automatically in Dockerfile)
```

## 📚 Model Details

### KoBART (Summarization)

- **Model**: gogamza/kobart-summarization
- **Type**: Encoder-Decoder Transformer
- **Language**: Korean
- **Task**: Abstractive summarization
- **Output**: 3-5 sentences (dynamic length)

### ko-sroberta-multitask (Embedding)

- **Model**: jhgan/ko-sroberta-multitask
- **Type**: Sentence-BERT
- **Language**: Korean
- **Output**: 768-dimensional embeddings
- **Normalized**: For cosine similarity

### Stance Model (TODO)

- **Status**: Under development (ML engineer)
- **Task**: 3-class classification (support/neutral/oppose)
- **Input**: Title + Summary
- **Output**: Probabilities + label + score

## 🔗 Links

- **Deployed Service**: https://zedwrkc-news-stance-detection.hf.space
- **API Docs**: https://zedwrkc-news-stance-detection.hf.space/docs
- **Backend Repo**: (link to backend repository)

## 📄 License

MIT

## 👥 Contributors

- Backend Developer: Integration & deployment
- ML Engineer: Stance model development
- Frontend Developer: Consumer of API results

---

**Status**: ✅ Deployed | 🚀 Production Ready