Arabic Misogyny Detection: Hybrid CNN-BiGRU Model

This repository hosts a high-performance deep learning model designed to identify misogynistic content in Arabic social media text. Developed as an academic project, the model successfully overcomes significant linguistic challenges, such as dialectal variation and high Out-of-Vocabulary (OOV) rates.

Performance Summary

Final Test Accuracy: 83.53%
Mean Stability Accuracy: 82.82% (over 7 independent runs)
Recall (Misogyny Class): 84.35%
F1-Score: 0.8599

Architecture: The Hybrid Breakthrough

The model utilizes a Functional API structure for parallel feature extraction, combining local and sequential context:

FastText Embeddings: Leverages sub-word information to solve the 37.99% OOV issue inherent in dialectal Arabic.
1D Convolutional Layer: Captures local n-gram patterns and misogynistic keywords.
Bidirectional GRU: Learns long-term dependencies and the global context of the tweet.
Dual Pooling (Global Average & Max): Simultaneously captures the "overall tone" and "peak intensity" of the text.

Specialized Preprocessing

To handle the "noise" of Arabic Twitter, a custom "Semantic Preservation" pipeline was developed (see preprocess.py):

Emoji-to-Text: Converts visual icons (e.g., 🤮) into text tokens to retain emotional intensity.
Orthographic Normalization: Standardizes different forms of Alif, Hamza, and Ya to reduce lexical sparsity.
Hashtag Normalization: Converts hashtags into plain text to extract hidden sentiment (e.g., #يا_واطية).
Trigger Weighting: Identifies [QUESTION] and [EXCLAMATION] marks as semantic features to detect mockery.

Data Attribution

The model was trained on a consolidated corpus of ~13,000 tweets from the following benchmarks:

LeT-Mi: Levantine Twitter Misogyny Dataset (Mulki & Ghanem, 2021).
ArMI: Arabic Misogyny Identification Shared Task (FIRE 2021).

Note: Raw datasets are not hosted here out of respect for the authors' distribution policies. Please contact the original researchers via their official request forms to access the data.

Usage

To use this model, ensure you apply the data_cleaning function from preprocess.py before passing text to the model.

import tensorflow as tf
from preprocess import prepare_input

# 1. Load the Hybrid CNN-BiGRU Model
# Ensure 'tokenizer.json' and 'preprocess.py' are in the same directory
model = tf.keras.models.load_model("arabic_misogyny_hybrid_model.keras")

def predict_misogyny(tweets):
    """
    Cleans, tokenizes, and predicts for a list of raw Arabic strings.
    """
    # Unified pipeline: Clean -> Tokenize -> Pad
    # Returns a shape of (len(tweets), 48)
    processed_data = [prepare_input(tweet) for tweet in tweets]
    
    # Reshape for bulk prediction if necessary, or predict individually
    for i, tweet in enumerate(tweets):
        input_row = prepare_input(tweet)
        prediction = model.predict(input_row, verbose=0)[0][0]
        
        label = "MISOGYNY" if prediction > 0.5 else "NEUTRAL"
        confidence = prediction if prediction > 0.5 else 1 - prediction
        
        print(f"Tweet: {tweet}")
        print(f"Result: {label} | Confidence: {confidence*100:.2f}%\n")

# --- Example Run ---
examples = [
    "أنتِ رائعة حقاً",          # Neutral
    "أنتِ رائعة حقاً؟ 🤮",       # Misogynistic (Sarcasm/Emoji)
    "سوف تندمين #يا_واطية"     # Misogynistic (Hashtag)
]

predict_misogyny(examples)

License

This project is licensed under the MIT License.

Downloads last month: 42

Evaluation results

Test Accuracy on LeT-Mi & ArMI Consolidated Corpus
self-reported

83.530
F1 Score on LeT-Mi & ArMI Consolidated Corpus
self-reported

0.860