DistilBERT for Host-based Intrusion Detection System (HIDS)

This model is a fine-tuned DistilBERT model for binary classification of system call sequences to detect intrusions in the ADFA-LD dataset. The model was trained through hyperparameter tuning to achieve optimal performance for host-based intrusion detection.

Model Details

Base Model

  • Architecture: DistilBERT (DistilBertForSequenceClassification)
  • Base Model: distilbert-base-uncased
  • Task: Binary Sequence Classification (Normal vs Attack)
  • Number of Labels: 2

Training Configuration

  • Training Epochs: 8
  • Batch Size: 32
  • Learning Rate: 2e-05
  • Weight Decay: 0.0
  • Warmup Ratio: 0.1
  • Optimizer: AdamW
  • Scheduler: LinearLR

Dataset

  • Dataset: ADFA-LD (Australian Defence Force Academy Linux Dataset)
  • Preprocessing: 18-gram sequences

Performance

Validation Metrics

  • Accuracy: 94.03%
  • F1 Score: 94.50%
  • Precision: 92.45%
  • Recall: 96.64%
  • AUC-ROC: 96.30%

Usage

You can use this model directly with a pipeline for text classification:

>>> from transformers import pipeline

>>> classifier = pipeline('text-classification', model='salsazufar/distilbert-base-hids-adfa')
>>> classifier("1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18")

[{'label': 'LABEL_0',
  'score': 0.9876},
 {'label': 'LABEL_1',
  'score': 0.0124}]

Here is how to use this model to get the classification of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained('salsazufar/distilbert-base-hids-adfa')
model = AutoModelForSequenceClassification.from_pretrained('salsazufar/distilbert-base-hids-adfa')

# Prepare input (18-gram system call sequence)
text = "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18"
encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=20)

# Forward pass
with torch.no_grad():
    output = model(**encoded_input)
    logits = output.logits
    probabilities = torch.softmax(logits, dim=-1)
    predicted_class = torch.argmax(logits, dim=-1).item()

# Interpret results
class_names = ["Normal", "Attack"]
print(f"Predicted class: {class_names[predicted_class]}")
print(f"Confidence: {probabilities[0][predicted_class].item():.4f}")
print(f"Probabilities: Normal={probabilities[0][0].item():.4f}, Attack={probabilities[0][1].item():.4f}")

Data Preprocessing

This model expects input in 18-gram format. If you have raw system call traces, you need to:

  1. Extract system calls from trace files
  2. Convert to n-grams (n=18)
  3. Format as space-separated string
  4. Ensure sequences are exactly 18 tokens (pad or truncate if necessary)

Example preprocessing pipeline:

def create_ngrams(trace, n=18):
    """Convert system call trace to n-grams"""
    ngrams = []
    for i in range(len(trace) - n + 1):
        ngram = trace[i:i+n]
        ngrams.append(" ".join(map(str, ngram)))
    return ngrams

Limitations and Considerations

  1. Domain Specific: This model is trained specifically on ADFA-LD dataset and may not generalize well to other system call datasets without retraining.

  2. Input Format: The model expects 18-gram sequences. Raw system calls must be preprocessed accordingly.

  3. Binary Classification: The model only distinguishes between "Normal" and "Attack" classes. It does not classify specific attack types.

BibTeX entry and citation info

@misc{distilbert-hids-adfa,
  title={DistilBERT for Host-based Intrusion Detection on ADFA-LD Dataset},
  author={salsazufar},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/salsazufar/distilbert-base-hids-adfa}}
}

References

License

This model is licensed under the Apache 2.0 license.

Downloads last month
7
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for salsazufar/distilbert-base-hids-adfa

Evaluation results