Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- text-classification
|
| 4 |
+
- scientific-abstract
|
| 5 |
+
- multi-label
|
| 6 |
+
- sentiment-analysis
|
| 7 |
+
- distilbert
|
| 8 |
+
datasets:
|
| 9 |
+
- SciTopicSentimentDataset
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# SciTopicSentimentClassifier
|
| 14 |
+
|
| 15 |
+
## 🔬 Overview
|
| 16 |
+
|
| 17 |
+
SciTopicSentimentClassifier is a **multi-label classification** model fine-tuned to simultaneously predict the **primary scientific topic** and the **underlying sentiment** (high-positive or low-negative) from a research paper's abstract text. This model is ideal for automated paper categorization, literature review triage, and scientific trend analysis.
|
| 18 |
+
|
| 19 |
+
The model was trained on the SciTopicSentimentDataset (a proprietary dataset similar to the generated Dataset 1), which links abstract text to predefined scientific topics and a binarized sentiment score derived from the original continuous value.
|
| 20 |
+
|
| 21 |
+
## 🧠 Model Architecture
|
| 22 |
+
|
| 23 |
+
This model is an adaptation of **DistilBERT**, a smaller, faster, and lighter version of BERT.
|
| 24 |
+
|
| 25 |
+
* **Base Model:** `distilbert-base-uncased`
|
| 26 |
+
* **Modification:** A custom classification head is added on top of the DistilBERT pooled output.
|
| 27 |
+
* **Output Layer:** The final layer is a dense layer with **12 outputs** (10 for scientific topics + 2 for sentiment classes), followed by a Sigmoid activation function to allow for multi-label prediction (an abstract can belong to multiple topics/sentiments).
|
| 28 |
+
* **Input:** Tokenized abstract text (up to 512 tokens).
|
| 29 |
+
* **Task:** Multi-Label Text Classification.
|
| 30 |
+
|
| 31 |
+
## 🚀 Intended Use
|
| 32 |
+
|
| 33 |
+
* **Automated Labeling:** Automatically assign relevant topic tags to new scientific publication abstracts.
|
| 34 |
+
* **Research Triage:** Quickly filter papers based on subject matter and the perceived 'success' or 'novelty' indicated by the abstract's sentiment.
|
| 35 |
+
* **Scientific Landscape Mapping:** Analyze large corpora of papers to track emerging positive/negative trends in specific research areas.
|
| 36 |
+
* **Indexing Systems:** Integration into library or repository indexing services.
|
| 37 |
+
|
| 38 |
+
## ⚠️ Limitations
|
| 39 |
+
|
| 40 |
+
* **Topic Granularity:** The model is limited to the 10 predefined topics in its training set. It may perform poorly on highly niche or interdisciplinary topics outside this scope.
|
| 41 |
+
* **Sentiment Scope:** The sentiment is coarse-grained (high vs. low) based on a metric derived from the abstract's language (e.g., using words like "novel," "significant," "limitations," "challenges"). It does not capture nuanced human-level emotional sentiment.
|
| 42 |
+
* **Language:** Trained exclusively on English abstracts.
|
| 43 |
+
* **Max Length:** Input texts longer than 512 tokens are truncated.
|
| 44 |
+
|
| 45 |
+
## 💻 Example Code
|
| 46 |
+
|
| 47 |
+
To use the model for prediction:
|
| 48 |
+
|
| 49 |
+
```python
|
| 50 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 51 |
+
import torch
|
| 52 |
+
|
| 53 |
+
# Load the model and tokenizer
|
| 54 |
+
model_name = "your-username/SciTopicSentimentClassifier" # Replace with actual HuggingFace path
|
| 55 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 56 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 57 |
+
|
| 58 |
+
# Sample Abstract
|
| 59 |
+
abstract = "We propose a novel architecture combining convolutional and recurrent neural networks for multi-modal data fusion, demonstrating significant performance gains in complex classification tasks, overcoming prior limitations."
|
| 60 |
+
|
| 61 |
+
# Preprocess the input
|
| 62 |
+
inputs = tokenizer(abstract, return_tensors="pt", truncation=True, padding=True)
|
| 63 |
+
|
| 64 |
+
# Run inference
|
| 65 |
+
with torch.no_grad():
|
| 66 |
+
logits = model(**inputs).logits
|
| 67 |
+
|
| 68 |
+
# Apply sigmoid for multi-label scores
|
| 69 |
+
probs = torch.sigmoid(logits)
|
| 70 |
+
|
| 71 |
+
# Get predicted labels (e.g., probability > 0.5)
|
| 72 |
+
labels = model.config.id2label
|
| 73 |
+
predictions = []
|
| 74 |
+
for i, prob in enumerate(probs[0]):
|
| 75 |
+
if prob > 0.5:
|
| 76 |
+
predictions.append(labels[i])
|
| 77 |
+
|
| 78 |
+
print(f"Abstract: {abstract[:80]}...")
|
| 79 |
+
print(f"Predicted Labels: {predictions}")
|
| 80 |
+
# Expected Output: ['Deep Learning/AI', 'High-Positive-Sentiment']
|