Alpie-Core / README.md

Update README.md

64a951c verified 4 months ago

12.3 kB

	---
	tags:
	- text-generation
	- reasoning
	- coding
	- mathematics
	- quantization
	license: apache-2.0
	datasets:
	- synthetic
	base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
	language:
	- en
	- hi
	library_name: transformers
	pipeline_tag: text-generation
	---
	# Alpie-Core: 4-bit Quantized Reasoning Model

	---
	<p align="center">
	<img src="./Frame%202018777151.png" alt="Alpie-Core Architecture" width="700"/>
	</p>
	[Space reserved for blog paper, technical report links]
	---

	## 1. Introduction

	Alpie-Core is one of the world's first fine-tuned 4-bit reasoning models, proving that aggressive quantization can surpass full-precision baselines in reasoning, mathematics, and coding. By combining cutting-edge quantization-aware training with synthetic STEM-rich datasets, Alpie-Core achieves frontier-level reasoning while being practical for real-world deployment at scale.

	## 2. Model Summary

	- Base Architecture: DeepSeek-R1-Distill-Qwen-32B
	- Parameters: 32 billion (quantized to 4-bit)
	- Training Method: Supervised Fine-Tuning (SFT) using LoRA/QLoRA techniques
	- Quantization: 4-bit NF4 with double quantization
	- Context Length: 65,536 tokens
	- Max Output Length: 16,384 tokens
	- License: Apache 2.0


	## 3. Approach

	Alpie-Core has undergone extensive supervised fine-tuning (SFT) to strengthen reasoning, robustness, and safety. The training leveraged a diverse mixture of curated open-source datasets and proprietary synthetic data, optimized with high-quality LLM-generated responses. The fine-tuning process emphasized adherence to rigorous safety and usability standards, including:

	1)User Understanding and Clarity – ensuring outputs are direct, interpretable, and pedagogically sound.

	2)Security and Ethical Guidelines – filtering unsafe or harmful generations during and after training.

	3)Limitations, Disclaimers, and Knowledge Boundaries – transparently communicating uncertainty and scope.

	4)Handling Complex and Sensitive Topics – balancing informativeness with responsible guardrails.

	5)Safety and Respectful Engagement – maintaining politeness, inclusivity, and cultural sensitivity.

	6)Confidentiality and Responsible Use – preventing leakage of private training data, proprietary prompts, or internal reasoning traces.

	This SFT approach enables Alpie-Core to deliver reliable, aligned, and context-aware responses while maintaining safety across a broad range of use cases.

	## 4. Model Features

	1. Supports Streaming – Real-time token-level responses
	2. OpenAI-Compatible API – Seamless integration with OpenAI client libraries
	3. 65K Context Length – Handles very large inputs and conversations
	4. 16,384 Max Output Length – Enables extremely long generations
	5. 4-Bit Quantization – Memory-efficient and optimized for deployment
	6. High Throughput Inference – Powered by vLLM for efficient large-scale serving
	7. Low Latency Inference – Fast response times optimized for production
	8. Customizable Safety & Moderation Filters – Built-in guardrails for safer outputs
	9. Supports Function Calling / Tool Use – Enables structured outputs and external API integration

	## 5. Key Highlights

	1. Frontier Performance in 4-bit: 81.28% MMLU, 92.75% GSM8K, 57.8% SWE-Bench Verified

	2) STEM + Coding Excellence: Outperforms full-precision peers in mathematics and programming

	3) Enhanced Content Access: Provides factual responses to geopolitically sensitive topics

	4) Quantization Efficiency: A 4-bit quantized variant achieves competitive performance retention compared to full-precision models, demonstrating that aggressive quantization can preserve task accuracy while substantially reducing hardware requirements.

	5) Benchmark Competitiveness: Across more than ten standard evaluation benchmarks, the model demonstrates performance on par with or exceeding that of larger 70B+ parameter systems, highlighting the effectiveness of our training and optimization strategies.

	6) Environmental Benefits: Through quantization and efficiency-focused design, the model requires significantly fewer computational resources. This translates into lower energy consumption and reduced carbon footprint relative to full-precision deployments.

	## 6. Benchmark Results

	\| Benchmark \| Alpie-Core (32B-4bit) \| DeepSeek-V2 (236B) \| Qwen2.5 72B \| Llama 3.1 405B \| Llama 3.1 70B \| Gemma-3 27B-PT \| Mistral-Small-24B-Base-2501 \|
	\|-----------\|----------------------\|-------------------\|-------------\|---------------\|---------------\|----------------\|----------------------------\|
	\| MMLU (5-shot) \| 81.28% \| 78.4% \| 85.0% \| 84.4% \| 79.3% \| 78.6% \| 80.73% \|
	\| GSM8K (8-shot) \| 92.75% \| 81.6% \| 88.3% \| 83.5% \| - \| 82.2% \| 80.73% \|
	\| BBH (3-shot) \| 85.12% \| 78.8% \| 79.8% \| 82.9% \| 81.6% \| 77.7% \| - \|
	\| MMLU-Pro (5-shot) \| 64.78% \| 51.4% \| 58.3% \| 52.8% \| 53.8% \| 52.2% \| 54.37% \|
	\| MBPP (pass@1) \| 75.20% \| 65.0% \| 72.6% \| 68.4% \| - \| 65.6% \| 69.64% \|
	\| HumanEval (pass@1) \| 57.23% \| 43.3% \| 53.0% \| 54.9% \| - \| 48.8% \| = \|

	### SWE-Bench Verified Performance

	\| Rank \| Model \| Accuracy (%) \| Performance vs Alpie \|
	\|------\|-------\|-------------\|---------------------\|
	\| 1 \| Alpie Core \| 57.8 \| Alpie \|
	\| 2 \| Qwen3-Coder-30B-A3B-Instruct \| 51.6 \| Below Alpie \|
	\| 3 \| o1 \| 48.9 \| Below Alpie \|
	\| 4 \| o3-mini (high) \| 49.3 \| Below Alpie \|
	\| 5 \| Claude 3.5 Sonnet \| 49.0 \| Below Alpie \|
	\| 6 \| DeepSeek R1 \| 49.2 \| Below Alpie \|
	\| 7 \| Devstral \| 46.8 \| Below Alpie \|

	### Humanity's Last Exam Leaderboard Performance

	\| Rank \| Model \| Accuracy (%) \| Performance vs Alpie \|
	\|------\|-------\|-------------\|---------------------\|
	\| 1 \| GPT 4.5 Preview \| 5.8 \| Above Alpie \|
	\| 2 \| Claude Sonnet 4 \| 5.42 \| Above Alpie \|
	\| 3 \| Alpie Core 32B (4-bit) \| 5.41 \| Alpie \|
	\| 4 \| Llama 4 Maverik \| 5.34 \| Below Alpie \|
	\| 5 \| GPT 4.1 \| 4.97 \| Below Alpie \|
	\| 6 \| Kimi K2 Instruct \| 4.68 \| Below Alpie \|
	\| 7 \| DeepSeek V3 \| 4.55 \| Below Alpie \|
	\| 8 \| Gemini 1.5 Pro 002 \| 4.55 \| Below Alpie \|

	### Additional Benchmarks

	\| Benchmark \| Alpie-Core (32B-4bit) \| Category \|
	\|-----------\|----------------------\|----------\|
	\| AIME \| 47.34% \| Advanced Mathematics \|
	\| GPQA (Diamond) \| 40.91% \| Graduate-level QA \|
	\| TruthfulQA (MC2) \| 60.05% \| Truthfulness \|
	\| HellaSwag \| 84.66% \| Commonsense \|
	\| PIQA \| 83.24% \| Physical Reasoning \|
	\| ARC Challenge \| 67.58% \| Science QA \|
	\| CommonSenseQA \| 87.06% \| Commonsense \|
	\| AGIEval \| 64.98% \| General Intelligence \|
	\| Winogrande \| 79.53% \| Commonsense Reasoning \|

	## 7. Training Details

	- Hardware: 8× NVIDIA H100-80GB GPUs
	- Training Duration: 408 hours
	- Fine-tuning Method: LoRA/QLoRA with the following configuration:
	- LoRA Alpha: 8
	- LoRA Dropout: 0.05
	- LoRA Rank: 8
	- Quantization: 4-bit NF4 + Double Quantization + FP16 compute
	- Dataset Domains: Mathematics, coding, reasoning, science, general knowledge, competitive exams, Indian context + law, multilingual (Hindi and Hinglish)
	- Synthetic Data Advantage: +15-20% performance boost in STEM & coding domains

	## 8. Environmental Impact

	Carbon Footprint: We estimated the environmental impact of training Alpie-Core (32B) on 8× NVIDIA H100-80GB GPUs by calculating carbon emissions from GPU energy consumption. The calculation follows the formula:
	CO₂e (kg) = Grid CO₂ Factor (kg/kWh) × Runtime (hours) × Power per GPU (kW) × Number of GPUs

	Training Parameters:
	Grid CO₂ Factor (Azure average): 0.364 kg CO₂e per kWh
	Runtime: 408 hours
	GPUs: 8× H100-80GB
	We report results under two assumption modes:

	Realistic mode (average training draw ≈ 250 W per GPU = 0.25 kWh/hr): 0.364 × 408 × 0.25 × 8 ≈ 298 kg CO₂e


	Conservative mode (near TDP ≈ 700 W per GPU = 0.70 kWh/hr): 0.364 × 408 × 0.70 × 8 ≈ 835 kg CO₂e


	Total training footprint ranges from ~298 kg CO₂e (realistic) to ~835 kg CO₂e (conservative worst-case)




	## 9. Use Cases

	Best for STEM, complex mathematical reasoning, coding, and Indian context

	1)STEM: Excels at solving advanced problems in science, technology, engineering, and mathematics with high accuracy.

	2)Complex Mathematical Reasoning: Handles multi-step logical and quantitative reasoning tasks with strong reliability.

	3)Coding: Supports software development, debugging, and algorithmic problem-solving across multiple programming languages.

	4)Indian Context: Provides culturally aware insights, competitive exam assistance (JEE, NEET, UPSC), and multilingual support in Hindi/Hinglish.


	## 10. Safety and Limitations

	### Enhanced Content Access
	Unlike the base DeepSeek model, Alpie-Core provides factual, balanced responses to geopolitically sensitive questions, offering global accessibility and factual accuracy on topics like Taiwan's status, Arunachal Pradesh sovereignty, and other sensitive geopolitical issues.

	### Current Limitations
	- Multilingual reasoning in Hindi/Hinglish shows room for improvement
	- Fixed knowledge cutoff without real-time information retrieval
	- Occasional struggles with complex multi-hop mathematical reasoning
	- Potential hallucinations in factual question-answering

	### Mitigations
	- Safety classifiers and output filtering systems
	- Model-assisted safety pipeline using RLHF
	- Comprehensive adversarial testing by domain experts

	## 11. How to Use

	### Non-Streaming Inference
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel, PeftConfig
	import torch

	# Load LoRA adapter configuration to find the base model
	peft_model_id = "169Pi/Alpie-core"
	config = PeftConfig.from_pretrained(peft_model_id)

	# Load the base model
	base_model = AutoModelForCausalLM.from_pretrained(
	config.base_model_name_or_path,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

	# Load LoRA weights
	model = PeftModel.from_pretrained(base_model, peft_model_id)

	# Ensure evaluation mode
	model.eval()

	# Sample inference
	prompt = "Solve the Riemann Hypothesis and provide a final answer?"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	with torch.no_grad():
	outputs = model.generate(**inputs, max_new_tokens=1000)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)

	print("Response:\n", response)
	```

	### Streaming Inference
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
	from peft import PeftModel, PeftConfig
	import torch

	# Load LoRA adapter configuration to find the base model
	peft_model_id = "169Pi/Alpie-core"
	config = PeftConfig.from_pretrained(peft_model_id)

	# Load the base model
	base_model = AutoModelForCausalLM.from_pretrained(
	config.base_model_name_or_path,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

	# Load LoRA weights
	model = PeftModel.from_pretrained(base_model, peft_model_id)

	# Ensure evaluation mode
	model.eval()

	# Initialize streamer
	streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

	# Sample streaming inference
	prompt = "Solve the Riemann Hypothesis and provide a final answer?"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	print("Streaming Response:")
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=1000,
	streamer=streamer,
	do_sample=True,
	temperature=0.7,
	top_p=0.9
	)
	```

	### Deployment Options
	- Transformers: Python, PyTorch integration
	- vLLM: High-throughput inference
	- LMDeploy/Ollama/TensorRT-LLM: Production deployments

	## 12. Citation

	```bibtex
	@misc{alpie2025core,
	title = {Alpie-Core: A 4-bit Quantized Reasoning Model Surpassing Full-Precision Benchmarks},
	author = {Alpie AI},
	year = {2025},
	url = {https://huggingface.co/alpie/Alpie-Core-4bit}
	}
	```

	## 13. License

	Apache 2.0 – Free for research and commercial use

	---

	For technical details, training methodology, and comprehensive evaluation results, please refer to our technical report.