# Training Procedure

## Data Sources
- **Summarization** – expects JSONL files with `source` and `summary` fields under
  `data/processed/summarization`.
- **Emotion Classification** – multi-label samples loaded from JSONL files with
  `text` and `emotions` arrays. The dataset owns a `MultiLabelBinarizer` for consistent encoding.
- **Topic Classification** – single-label categorical samples with `text` and `topic` fields, encoded via `LabelEncoder`.

Paths and tokenizer defaults are configured in `configs/data/datasets.yaml`. The tokenizer section chooses the Hugging Face backbone (`facebook/bart-base` by default) and maximum length. Gutenberg book downloads are controlled via the `downloads.books` list (each entry includes `name`, `url`, and `output`).

## Dataloaders & Collators
- `SummarizationCollator` encodes encoder/decoder inputs, prepares decoder input IDs via `Tokenizer.prepare_decoder_inputs`, and masks padding tokens with `-100` for loss computation.
- `EmotionCollator` applies the dataset's `MultiLabelBinarizer`, returning dense float tensors suitable for `BCEWithLogitsLoss`.
- `TopicCollator` emits integer class IDs via the dataset's `LabelEncoder` for `CrossEntropyLoss`.

These collators keep all tokenization centralized, reducing duplication and making it easy to swap in additional sklearn transformations through `TextPreprocessor` should we wish to extend cleaning or normalization.

## Model Assembly
- `src/models/factory.build_multitask_model` rebuilds the encoder, decoder, and heads from the tokenizer metadata and YAML config. This factory is used both during training and inference to eliminate drift between environments.
- The model wraps:
  - Transformer encoder/decoder stacks with shared positional encodings.
  - LM head tied to decoder embeddings for summarization.
  - Mean-pooled classification heads for emotion and topic tasks.

## Optimisation Loop
- `src/training/trainer.Trainer` orchestrates multi-task training.
  - Cross-entropy is used for summarization (seq2seq logits vs. shifted labels).
  - `BCEWithLogitsLoss` handles multi-label emotions.
  - `CrossEntropyLoss` handles topic classification.
- Gradient clipping ensures stability, and per-task weights can be configured via
  `TrainerConfig.task_weights` to balance gradients if needed.
- Metrics tracked per task:
  - **Summarization** – ROUGE-like overlap metric (`training.metrics.rouge_like`).
  - **Emotion** – micro F1 score for multi-label predictions.
  - **Topic** – categorical accuracy.

## Checkpoints & Artifacts
- `src/utils/io.save_state` stores model weights; checkpoints live under `checkpoints/`.
- `artifacts/labels.json` captures the ordered emotion/topic vocabularies immediately after
  training. This file is required for inference so class indices map back to human-readable labels.
- The tokenizer is exported to `artifacts/hf_tokenizer/` for reproducible vocabularies.

## Running Training
1. Ensure processed datasets are available (see `data/processed/` structure).
2. Choose a configuration (e.g., `configs/training/default.yaml`) for hyperparameters and data splits.
3. Instantiate the tokenizer via `TokenizerConfig` and build datasets/dataloaders.
4. Use `build_multitask_model` to construct the model, create an optimizer, and run
   `Trainer.fit(train_loaders, val_loaders)`.
5. Save checkpoints and update `artifacts/labels.json` with the dataset label order.

> **Note:** A full CLI for training is forthcoming. The scripts in `scripts/` currently act as
> scaffolding; once the Gradio UI is introduced we will extend these utilities to launch
> training jobs with configuration files directly.

## Future Enhancements
- Integrate curriculum scheduling or task-balanced sampling once empirical results dictate.
- Capture attention maps during training to support visualization in the planned Gradio UI.
- Leverage the optional `sklearn_transformer` hook in `TextPreprocessor` for lemmatization or domain-specific normalization when datasets require it.