Arabic Misogyny Detection: Hybrid CNN-BiGRU Model
This repository hosts a high-performance deep learning model designed to identify misogynistic content in Arabic social media text. Developed as an academic project, the model successfully overcomes significant linguistic challenges, such as dialectal variation and high Out-of-Vocabulary (OOV) rates.
Performance Summary
- Final Test Accuracy: 83.53%
- Mean Stability Accuracy: 82.82% (over 7 independent runs)
- Recall (Misogyny Class): 84.35%
- F1-Score: 0.8599
Architecture: The Hybrid Breakthrough
The model utilizes a Functional API structure for parallel feature extraction, combining local and sequential context:
- FastText Embeddings: Leverages sub-word information to solve the 37.99% OOV issue inherent in dialectal Arabic.
- 1D Convolutional Layer: Captures local n-gram patterns and misogynistic keywords.
- Bidirectional GRU: Learns long-term dependencies and the global context of the tweet.
- Dual Pooling (Global Average & Max): Simultaneously captures the "overall tone" and "peak intensity" of the text.
Specialized Preprocessing
To handle the "noise" of Arabic Twitter, a custom "Semantic Preservation" pipeline was developed (see preprocess.py):
- Emoji-to-Text: Converts visual icons (e.g., 🤮) into text tokens to retain emotional intensity.
- Orthographic Normalization: Standardizes different forms of Alif, Hamza, and Ya to reduce lexical sparsity.
- Hashtag Normalization: Converts hashtags into plain text to extract hidden sentiment (e.g., #يا_واطية).
- Trigger Weighting: Identifies [QUESTION] and [EXCLAMATION] marks as semantic features to detect mockery.
Data Attribution
The model was trained on a consolidated corpus of ~13,000 tweets from the following benchmarks:
- LeT-Mi: Levantine Twitter Misogyny Dataset (Mulki & Ghanem, 2021).
- ArMI: Arabic Misogyny Identification Shared Task (FIRE 2021).
Note: Raw datasets are not hosted here out of respect for the authors' distribution policies. Please contact the original researchers via their official request forms to access the data.
Usage
To use this model, ensure you apply the data_cleaning function from preprocess.py before passing text to the model.
import tensorflow as tf
from preprocess import prepare_input
# 1. Load the Hybrid CNN-BiGRU Model
# Ensure 'tokenizer.json' and 'preprocess.py' are in the same directory
model = tf.keras.models.load_model("arabic_misogyny_hybrid_model.keras")
def predict_misogyny(tweets):
"""
Cleans, tokenizes, and predicts for a list of raw Arabic strings.
"""
# Unified pipeline: Clean -> Tokenize -> Pad
# Returns a shape of (len(tweets), 48)
processed_data = [prepare_input(tweet) for tweet in tweets]
# Reshape for bulk prediction if necessary, or predict individually
for i, tweet in enumerate(tweets):
input_row = prepare_input(tweet)
prediction = model.predict(input_row, verbose=0)[0][0]
label = "MISOGYNY" if prediction > 0.5 else "NEUTRAL"
confidence = prediction if prediction > 0.5 else 1 - prediction
print(f"Tweet: {tweet}")
print(f"Result: {label} | Confidence: {confidence*100:.2f}%\n")
# --- Example Run ---
examples = [
"أنتِ رائعة حقاً", # Neutral
"أنتِ رائعة حقاً؟ 🤮", # Misogynistic (Sarcasm/Emoji)
"سوف تندمين #يا_واطية" # Misogynistic (Hashtag)
]
predict_misogyny(examples)
License
This project is licensed under the MIT License.
- Downloads last month
- 42
Evaluation results
- Test Accuracy on LeT-Mi & ArMI Consolidated Corpusself-reported83.530
- F1 Score on LeT-Mi & ArMI Consolidated Corpusself-reported0.860