Mira190 commited on
Commit
492c757
·
verified ·
1 Parent(s): df607ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -18
README.md CHANGED
@@ -20,16 +20,16 @@ language:
20
  - multilingual
21
  extra_gated_eu_disallowed: true
22
  ---
 
23
  <h1 align="center">Euler-Legal-Embedding-V1</h1>
24
  <p align="center">
25
- <a href="https://huggingface.co/LawRank/Euler-Legal-Embedding-V1">
26
  <img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
27
  </a>
28
-
29
  </p>
30
 
31
  ## Short Description
32
- Euler-Legal-Embedding-V1 is a specialized embedding model for the legal domain, fine-tuned on [Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B). It achieves strong performance on legal retrieval and reasoning tasks within the MTEB benchmark.
33
 
34
  ## Model Details
35
  - **Base Model**: Qwen/Qwen3-Embedding-8B
@@ -38,26 +38,36 @@ Euler-Legal-Embedding-V1 is a specialized embedding model for the legal domain,
38
  - **Max Input Tokens**: 1536
39
  - **Pooling**: Last token pooling (Standard for Qwen-Embedding)
40
  - **Training Data**: Legal domain specific dataset (`final-data-new-anonymized-grok4-filtered.jsonl`)
 
41
  ## Usage
 
42
  ### sentence-transformers support
 
43
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 
44
  ```bash
45
  pip install -U sentence-transformers
 
 
46
  You can use the model like this:
47
 
 
48
  from sentence_transformers import SentenceTransformer
49
  import torch
 
50
  # Load the model
51
  # trust_remote_code=True is required for Qwen-based models
52
  model = SentenceTransformer(
53
- "LawRank/Euler-Legal-Embedding-V1",
54
  trust_remote_code=True,
55
  model_kwargs={
56
  "torch_dtype": torch.bfloat16,
57
  "attn_implementation": "flash_attention_2", # Optional, requires flash-attn installed
58
  },
59
  )
 
60
  model.max_seq_length = 1536
 
61
  sentences = [
62
  "The plaintiff filed a motion for summary judgment.",
63
  "The court granted the motion based on lack of genuine dispute of material fact."
@@ -70,13 +80,22 @@ embeddings = model.encode(
70
  batch_size=16,
71
  show_progress_bar=True,
72
  )
 
73
  print(embeddings.shape)
74
- Transformers support
75
- You can also use the model directly with the transformers library:
 
 
76
 
 
 
 
77
  import torch
78
  from transformers import AutoModel, AutoTokenizer
79
- model_id = "LawRank/Euler-Legal-Embedding-V1"
 
 
 
80
  tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
81
  model = AutoModel.from_pretrained(
82
  model_id,
@@ -84,30 +103,51 @@ model = AutoModel.from_pretrained(
84
  torch_dtype=torch.bfloat16,
85
  device_map="auto"
86
  )
 
87
  sentences = ["This is a legal document.", "This is another legal document."]
88
- inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True, max_length=1536)
 
 
 
 
 
 
 
 
 
 
 
 
89
  with torch.no_grad():
90
  outputs = model(**inputs)
91
- # Last token pooling
 
92
  embeddings = outputs.last_hidden_state[:, -1]
 
93
  # Normalize embeddings
94
  embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
95
- print(embeddings)
96
 
97
- Training Details
 
 
 
 
98
  The model was fine-tuned using LoRA (Low-Rank Adaptation) via the Swift framework.
99
 
100
- Framework: Swift
101
- Loss Function: InfoNCE (Temperature: 0.03)
102
- Batch Size: 4 (per device)
103
- Learning Rate: 2e-5
104
- LoRA Config: Rank 8, Alpha 32, Dropout 0.05
105
- Citation
 
106
  If you find this model useful, please consider citing:
107
 
 
108
  @misc{euler2025legal,
109
  title={Euler-Legal-Embedding: Advanced Legal Representation Learning},
110
  author={LawRank Team},
111
  year={2025},
112
  publisher={Hugging Face}
113
- }
 
 
20
  - multilingual
21
  extra_gated_eu_disallowed: true
22
  ---
23
+
24
  <h1 align="center">Euler-Legal-Embedding-V1</h1>
25
  <p align="center">
26
+ <a href="https://huggingface.co/Mira190/Euler-Legal-Embedding-V1">
27
  <img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
28
  </a>
 
29
  </p>
30
 
31
  ## Short Description
32
+ Euler-Legal-Embedding-V1 is a specialized embedding model for the legal domain, fine-tuned on [Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B). It achieves strong performance on legal retrieval and reasoning tasks within the MTEB benchmark.
33
 
34
  ## Model Details
35
  - **Base Model**: Qwen/Qwen3-Embedding-8B
 
38
  - **Max Input Tokens**: 1536
39
  - **Pooling**: Last token pooling (Standard for Qwen-Embedding)
40
  - **Training Data**: Legal domain specific dataset (`final-data-new-anonymized-grok4-filtered.jsonl`)
41
+
42
  ## Usage
43
+
44
  ### sentence-transformers support
45
+
46
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
47
+
48
  ```bash
49
  pip install -U sentence-transformers
50
+ ```
51
+
52
  You can use the model like this:
53
 
54
+ ```python
55
  from sentence_transformers import SentenceTransformer
56
  import torch
57
+
58
  # Load the model
59
  # trust_remote_code=True is required for Qwen-based models
60
  model = SentenceTransformer(
61
+ "Mira190/Euler-Legal-Embedding-V1",
62
  trust_remote_code=True,
63
  model_kwargs={
64
  "torch_dtype": torch.bfloat16,
65
  "attn_implementation": "flash_attention_2", # Optional, requires flash-attn installed
66
  },
67
  )
68
+
69
  model.max_seq_length = 1536
70
+
71
  sentences = [
72
  "The plaintiff filed a motion for summary judgment.",
73
  "The court granted the motion based on lack of genuine dispute of material fact."
 
80
  batch_size=16,
81
  show_progress_bar=True,
82
  )
83
+
84
  print(embeddings.shape)
85
+ # Output: (2, 4096)
86
+ ```
87
+
88
+ ### Transformers support
89
 
90
+ You can also use the model directly with the `transformers` library:
91
+
92
+ ```python
93
  import torch
94
  from transformers import AutoModel, AutoTokenizer
95
+
96
+ model_id = "Mira190/Euler-Legal-Embedding-V1"
97
+
98
+ # Load tokenizer and model
99
  tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
100
  model = AutoModel.from_pretrained(
101
  model_id,
 
103
  torch_dtype=torch.bfloat16,
104
  device_map="auto"
105
  )
106
+
107
  sentences = ["This is a legal document.", "This is another legal document."]
108
+
109
+ # Tokenize sentences
110
+ inputs = tokenizer(
111
+ sentences,
112
+ return_tensors="pt",
113
+ padding=True,
114
+ truncation=True,
115
+ max_length=1536
116
+ )
117
+
118
+ # Move inputs to the same device as the model
119
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
120
+
121
  with torch.no_grad():
122
  outputs = model(**inputs)
123
+ # Last token pooling (Standard for Qwen-Embedding)
124
+ # Note: Qwen embeddings typically use the last hidden state of the last token (EOS or specific token)
125
  embeddings = outputs.last_hidden_state[:, -1]
126
+
127
  # Normalize embeddings
128
  embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
 
129
 
130
+ print(embeddings.shape)
131
+ # Output: (2, 4096)
132
+ ```
133
+
134
+ ## Training Details
135
  The model was fine-tuned using LoRA (Low-Rank Adaptation) via the Swift framework.
136
 
137
+ - **Framework**: Swift
138
+ - **Loss Function**: InfoNCE (Temperature: 0.03)
139
+ - **Batch Size**: 4 (per device)
140
+ - **Learning Rate**: 2e-5
141
+ - **LoRA Config**: Rank 8, Alpha 32, Dropout 0.05
142
+
143
+ ## Citation
144
  If you find this model useful, please consider citing:
145
 
146
+ ```bibtex
147
  @misc{euler2025legal,
148
  title={Euler-Legal-Embedding: Advanced Legal Representation Learning},
149
  author={LawRank Team},
150
  year={2025},
151
  publisher={Hugging Face}
152
+ }
153
+ ```