InesPL84's picture
Update README.md
fa527af verified
metadata
license: mit
language:
  - fr
base_model:
  - cmarkea/distilcamembert-base
datasets:
  - Crysy-rthomas/T-AIA-CLASSIFICATION-DATASET

Model Overview

This model is a fine-tuned version of the cmarkea/distilcamembert-base, adapted for binary text classification in French.

Model Type

  • Architecture: CamembertForSequenceClassification
  • Base Model: DistilCamemBERT
  • Number of Layers: 6 hidden layers, 12 attention heads
  • Tokenizer: Based on CamemBERT's tokenizer
  • Vocab Size: 32,005 tokens

Intended Use

This model is designed for classifying sentences as either travel-related or non-travel-related, with high accuracy on French datasets.

Example Use Case:

Given a sentence such as "Je veux aller de Paris à Lyon", the model will detect and label:

  • POSITIVE as label
  • 0.9999655485153198 as score

Given a sentence such as "Je veux acheter du pain", the model will detect and label:

  • NEGATIVE as label
  • 0.9999724626541138 as score

Limitations:

  • Language: Optimized for French text, performance on other languages is not guaranteed.
  • Performance: Specifically trained for binary classification. Performance may degrade on multi-class or unrelated tasks.

Labels

The model uses the following entity labels:

  • POSITIVE: Travel-related sentences
  • NEGATIVE: Non-travel-related sentences

Training Data

The model was fine-tuned using a proprietary French dataset: Crysy-rthomas/T-AIA-CLASSIFICATION-DATASET. This dataset contains thousands of labeled examples for travel and non-travel sentences.

Hyperparameters and Fine-Tuning:

  • Learning Rate: 5e-5
  • Batch Size: 16
  • Epochs: 3
  • Evaluation Strategy: Epoch-based
  • Optimizer: AdamW

Tokenizer

The tokenizer is based on the pre-trained CamemBERT tokenizer, adapted for the specific entity-labeling task. It uses subword tokenization based on the BPE (Byte-Pair Encoding) approach, which splits words into smaller units.

Tokenizer special settings:

  • Max Length: 128
  • Padding: Right-padded to 128 tokens
  • Truncation: Longest-first strategy, truncating tokens beyond 128.

How to Use

You can load and use this model with Hugging Face’s transformers library and use pipeline function for creating a text classification pipeline as follows:

from transformers import pipeline

model_path = "InesPL84/T-AIA-DISTILCAMEMBERT-BASE-TEXT-CLASSIFICATION"
classifier = pipeline("text-classification", model=model_path, tokenizer=model_path)

sentence = "Je veux aller de Paris à Lyon"
result = classifier(sentence)
print(result)

Limitations and Bias

While the model performs well on the training and test datasets, there are some known limitations:

  • Bias in Dataset: Performance may reflect the biases in the training data.
  • Generalization: Results may be biased towards specific named entities frequently seen in the training data (such as city names).

License

This model is released under the Apache 2.0 License.