DistilBERT for Host-based Intrusion Detection System (HIDS)

This model is a fine-tuned DistilBERT model for binary classification of system call sequences to detect intrusions in the ADFA-LD dataset. The model was trained through hyperparameter tuning to achieve optimal performance for host-based intrusion detection.

Model Details

Base Model

Architecture: DistilBERT (DistilBertForSequenceClassification)
Base Model: distilbert-base-uncased
Task: Binary Sequence Classification (Normal vs Attack)
Number of Labels: 2

Training Configuration

Training Epochs: 8
Batch Size: 32
Learning Rate: 2e-05
Weight Decay: 0.0
Warmup Ratio: 0.1
Optimizer: AdamW
Scheduler: LinearLR

Dataset

Dataset: ADFA-LD (Australian Defence Force Academy Linux Dataset)
Preprocessing: 18-gram sequences

Performance

Validation Metrics

Accuracy: 94.03%
F1 Score: 94.50%
Precision: 92.45%
Recall: 96.64%
AUC-ROC: 96.30%

Usage

You can use this model directly with a pipeline for text classification:

>>> from transformers import pipeline

>>> classifier = pipeline('text-classification', model='salsazufar/distilbert-base-hids-adfa')
>>> classifier("1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18")

[{'label': 'LABEL_0',
  'score': 0.9876},
 {'label': 'LABEL_1',
  'score': 0.0124}]

Here is how to use this model to get the classification of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained('salsazufar/distilbert-base-hids-adfa')
model = AutoModelForSequenceClassification.from_pretrained('salsazufar/distilbert-base-hids-adfa')

# Prepare input (18-gram system call sequence)
text = "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18"
encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=20)

# Forward pass
with torch.no_grad():
    output = model(**encoded_input)
    logits = output.logits
    probabilities = torch.softmax(logits, dim=-1)
    predicted_class = torch.argmax(logits, dim=-1).item()

# Interpret results
class_names = ["Normal", "Attack"]
print(f"Predicted class: {class_names[predicted_class]}")
print(f"Confidence: {probabilities[0][predicted_class].item():.4f}")
print(f"Probabilities: Normal={probabilities[0][0].item():.4f}, Attack={probabilities[0][1].item():.4f}")

Data Preprocessing

This model expects input in 18-gram format. If you have raw system call traces, you need to:

Extract system calls from trace files
Convert to n-grams (n=18)
Format as space-separated string
Ensure sequences are exactly 18 tokens (pad or truncate if necessary)

Example preprocessing pipeline:

def create_ngrams(trace, n=18):
    """Convert system call trace to n-grams"""
    ngrams = []
    for i in range(len(trace) - n + 1):
        ngram = trace[i:i+n]
        ngrams.append(" ".join(map(str, ngram)))
    return ngrams

Limitations and Considerations

Domain Specific: This model is trained specifically on ADFA-LD dataset and may not generalize well to other system call datasets without retraining.
Input Format: The model expects 18-gram sequences. Raw system calls must be preprocessed accordingly.
Binary Classification: The model only distinguishes between "Normal" and "Attack" classes. It does not classify specific attack types.

BibTeX entry and citation info

@misc{distilbert-hids-adfa,
  title={DistilBERT for Host-based Intrusion Detection on ADFA-LD Dataset},
  author={salsazufar},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/salsazufar/distilbert-base-hids-adfa}}
}

References

ADFA-LD Dataset: ADFA-LD: An Anomaly Detection Dataset for Linux-based Host Intrusion Detection Systems
DistilBERT: DistilBERT, a distilled version of BERT

License

This model is licensed under the Apache 2.0 license.

Downloads last month: 6

Safetensors

Model size

67M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for salsazufar/distilbert-base-hids-adfa

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Paper • 1910.01108 • Published Oct 2, 2019 • 21

Evaluation results

accuracy on ADFA-LD
self-reported

0.940
f1 on ADFA-LD
self-reported

0.945
precision on ADFA-LD
self-reported

0.924
recall on ADFA-LD
self-reported

0.966
auc on ADFA-LD
self-reported

0.963