File size: 2,523 Bytes
5506306
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7acc317
5506306
 
7acc317
5506306
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# Week 2: Publish a Hub Dataset

Create and share high-quality datasets on the Hub. Good data is the foundation of good models—help the community by contributing datasets others can train on.

## Why This Matters

The best open source models are built on openly available datasets. By publishing well-documented, properly structured datasets, you're directly enabling the next generation of model development. Quality matters more than quantity.

## The Skill

Use `hf_dataset_creator/` for this quest. Key capabilities:

- Initialize dataset repos with proper structure
- Multi-format support: chat, classification, QA, completion, tabular
- Template-based validation for data quality
- Streaming uploads without downloading entire datasets

```bash
# Quick setup with a template
python hf_dataset_creator/scripts/dataset_manager.py quick_setup \
  --repo_id "your-username/dataset-name" --template chat
```

## XP Tiers

### 🐢 Starter — 50 XP

**Upload a small, clean dataset with a complete dataset card.**

1. Create a dataset with ≤1,000 rows
2. Write a dataset card covering: license, splits, and data provenance
3. Upload to the Hub under the hackathon organization (or your own account)

**What counts:** Clean data, clear documentation, proper licensing.

```bash
python hf_dataset_creator/scripts/dataset_manager.py init \
  --repo_id "hf-skills/your-dataset-name"

python hf_dataset_creator/scripts/dataset_manager.py add_rows \
  --repo_id "hf-skills/your-dataset-name" \
  --template classification \
  --rows_json "$(cat your_data.json)"
```

### 🐕 Standard — 100 XP

**Publish a conversational dataset with a complete dataset card.**

1. Create a dataset with ≤1,000 rows
2. Write a dataset card covering: license and splits.
3. Upload to the Hub under the hackathon organization.

**What counts:** Clean data, clear documentation, proper licensing.

### 🦁 Advanced — 200 XP

**Translate a dataset into multiple languages and publish it on the Hub.**

1. Find a dataset on the Hub
2. Translate the dataset into multiple languages
3. Publish the translated datasets on the Hub under the hackathon organization

**What counts:** Translated datasets and merged PRs.

## Resources

- [SKILL.md](../hf_dataset_creator/SKILL.md) — Full skill documentation
- [Templates](../hf_dataset_creator/templates/) — JSON templates for each format
- [Examples](../hf_dataset_creator/examples/) — Sample data and system prompts

---

**Next Quest:** [Supervised Fine-Tuning](04_sft-finetune-hub.md)