charent
/

Phi2-Chinese-0.2B

@@ -7,13 +7,14 @@ language:
 library_name: transformers
 tags:
 - text-generation-inference
 ---
 # Phi2-Chinese-0.2B 从0开始训练自己的Phi2中文小模型
 **本项目为实验项目，开源代码及模型权重，预训练数据较少，如果需要效果更好的中文小模型，可以参考项目[ChatLM-mini-Chinese](https://github.com/charent/ChatLM-mini-Chinese)**
 # 1. ⚗️数据清洗
-代码：[dataset.ipynb](./0.dataset.ipynb)。
 比如句末添加句号、繁体转简体、全角转半角、删除重复的标点符号（比如有些对话语料非常多`"。。。。。"`）等等。
 具体的数据清洗过程请参考项目[ChatLM-mini-Chinese](https://github.com/charent/ChatLM-mini-Chinese)。
@@ -33,7 +34,7 @@ tokenizer训练非常吃内存：
 # 3. ⛏️CLM因果模型预训练
-代码：[pretrain.ipynb](./2.pretrain.ipynb)
 用大量文本进行无监督预训练，我这里除了基本的数据集外，还加入`wiki`百科的数据。
@@ -45,7 +46,7 @@ CLM预训练过程中，模型输入和输出是一样的，计算交叉熵损
 # 4. ⚒️SFT指令微调
-代码：[sft.ipynb](./3.sft.ipynb)
 主要使用`bell open source`的数据集。感谢大佬[BELLE](https://github.com/LianjiaTech/BELLE)。
@@ -60,7 +61,7 @@ text = f"##提问:\n{example['instruction']}\n##回答:\n{example['output'][EOS]
 # 5. 📝dpo偏好优化
-代码：[dpo.ipynb](./4.dpo.ipynb)
 根据个人喜好对SFT模型微调，数据集要构造三列`prompt`、`chosen`和 `rejected`，`rejected`这一列有部分数据我是从sft阶段初级模型（比如sft训练4个`epoch`，取0.5个`epoch`检查点的模型）生成，如果生成的`rejected`和`chosen`相似度在0.9以上，则不要这条数据。
@@ -68,14 +69,17 @@ DPO过程中要有两个模型，一个是要训练的模型，一个是参考
 # 6. 📑本项目模型使用方法
 模型权重`huggingface`仓库：[Phi2-Chinese-0.2B](https://huggingface.co/charent/Phi2-Chinese-0.2B)
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
 import torch
-tokenizer = AutoTokenizer.from_pretrained('charent/Phi2-Chinese-0.2B')
-model = AutoModelForCausalLM.from_pretrained('charent/Phi2-Chinese-0.2B')
 device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
 txt = '感冒了要怎么办？'
 prompt = f"##提问:\n{txt}\n##回答:\n"

 library_name: transformers
 tags:
 - text-generation-inference
+pipeline_tag: text-generation
 ---
 # Phi2-Chinese-0.2B 从0开始训练自己的Phi2中文小模型
 **本项目为实验项目，开源代码及模型权重，预训练数据较少，如果需要效果更好的中文小模型，可以参考项目[ChatLM-mini-Chinese](https://github.com/charent/ChatLM-mini-Chinese)**
 # 1. ⚗️数据清洗
+代码：[dataset.ipynb](https://github.com/charent/Phi2-mini-Chinese/blob/main/0.dataset.ipynb)。
 比如句末添加句号、繁体转简体、全角转半角、删除重复的标点符号（比如有些对话语料非常多`"。。。。。"`）等等。
 具体的数据清洗过程请参考项目[ChatLM-mini-Chinese](https://github.com/charent/ChatLM-mini-Chinese)。
 # 3. ⛏️CLM因果模型预训练
+代码：[pretrain.ipynb](https://github.com/charent/Phi2-mini-Chinese/blob/main/2.pretrain.ipynb)
 用大量文本进行无监督预训练，我这里除了基本的数据集外，还加入`wiki`百科的数据。
 # 4. ⚒️SFT指令微调
+代码：[sft.ipynb](https://github.com/charent/Phi2-mini-Chinese/blob/main/3.sft.ipynb)
 主要使用`bell open source`的数据集。感谢大佬[BELLE](https://github.com/LianjiaTech/BELLE)。
 # 5. 📝dpo偏好优化
+代码：[dpo.ipynb](https://github.com/charent/Phi2-mini-Chinese/blob/main/4.dpo.ipynb)
 根据个人喜好对SFT模型微调，数据集要构造三列`prompt`、`chosen`和 `rejected`，`rejected`这一列有部分数据我是从sft阶段初级模型（比如sft训练4个`epoch`，取0.5个`epoch`检查点的模型）生成，如果生成的`rejected`和`chosen`相似度在0.9以上，则不要这条数据。
 # 6. 📑本项目模型使用方法
 模型权重`huggingface`仓库：[Phi2-Chinese-0.2B](https://huggingface.co/charent/Phi2-Chinese-0.2B)
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
 import torch
 device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+tokenizer = AutoTokenizer.from_pretrained('charent/Phi2-Chinese-0.2B')
+model = AutoModelForCausalLM.from_pretrained('charent/Phi2-Chinese-0.2B').to(device)
 txt = '感冒了要怎么办？'
 prompt = f"##提问:\n{txt}\n##回答:\n"