# my_ml_text_classfication

**Repository Path**: allenchennn/my_ml_text_classfication

## Basic Information

- **Project Name**: my_ml_text_classfication
- **Description**: label studio 文本分类：label studio ml+deepseek
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2025-07-23
- **Last Updated**: 2025-08-15

## Categories & Tags

**Categories**: Uncategorized

**Tags**: Python, labelstudio

## README

# Label Studio 文本分类集成 DeepSeek LLM 推理引擎

# 一、项目简介

基于 [Label Studio](https://labelstud.io/) 的文本分类任务，基于[label-studio-ml-backend](https://github.com/HumanSignal/label-studio-ml-backend)自建推理逻辑，集成 OpenAI 接口兼容的大模型服务 [DeepSeek](https://deepseek.com/)，实现自动化文本分类预测，并以标准格式回传结果至 Label Studio，可用于主动学习、批量辅助标注等场景。

# 二、模型开发

**备注：部分信息做了隐私处理用“`*`”代替。**

## 1、本demo目录结构

```
my_ml_text_classfication/
├── Dockerfile
├── docker-compose.yml
├── model.py
├── _wsgi.py
├── README.md
├── .env
└── requirements.txt
```

## 2、安装（忽略）

安装label-studio-ml-backend，见官网

```
git clone https://github.com/HumanSignal/label-studio-ml-backend.git
cd label-studio-ml-backend/
pip install -e .
```

## 3、新建ML后端

```
label-studio-ml create my_ml_text_classfication
```

## 4、添加环境变量

**备注：**添加环境变量是为了后续模型启动后可以读取远端Label Studio的部署地址来获取内部文件信息。

1）第一种方式：直接在终端命令中设置环境变量（暂时）

```
# Label Studio的部署地址和端口
set LABEL_STUDIO_URL=http://***.***.***.***:8080
# Label Studio的Legacy Token
set LABEL_STUDIO_API_KEY=*****************************
```

2）第二种方式：创建`.env`文件，并加载`.env`

.env文件：

```
LABEL_STUDIO_URL=http://***.***.***.***:8080
LABEL_STUDIO_API_KEY=*****************************
```

model.py

```
from dotenv import load_dotenv

# 确保环境变量可读取
load_dotenv()
```

## 5、主要文件结构介绍

打开 .\my_ml_text_classfication\model.py 文件，根据需要进行修改：

- `predict()`：在这里定义你的预测逻辑  
- `fit()`：在这里定义你的训练逻辑（可选）=====>由于我这里只是针对预测，所以这里保持不动。

## 5、predict逻辑

1）初始化客户端

```
client = OpenAI(
            api_key="xxxx",
            base_url="https://api.deepseek.com")
```

2）获取标签定义与输入字段映射

从 `label_config` 中提取：

- `from_name`：选项标签组件名称（如 `category`）
- `to_name`：绑定文本组件名称（如 `text`）
- `value`：实际字段名（即输入数据的键，如 `"text"`）

3）提取任务文本内容

判断是上传文件还是文本输入，读取内容

4）构造 Prompt 提示语

构造 prompt，指定标签分类任务

5）调用 DeepSeek 推理模型

调用 `client.chat.completions.create(...)`：

6）解析返回 JSON，获取 label 与 score

- 读取返回的字符串内容 `result_text`；
- 使用 `json.loads()` 解析为字典；
- 提取 `label` 与 `score`；
- 校验 `label` 是否在预定义标签列表中，避免无效输出。

7）构造符合 Label Studio 格式的预测结果

- 使用 UUID 生成预测条目的唯一标识；
- 填入模型版本、得分、预测标签等；
- 添加到 `predictions` 列表中。

8）返回 ModelResponse

- 返回所有预测结果；
- 接口响应格式符合 Label Studio 要求。

## 6、启动模型命令

```
label-studio-ml start .\my_ml_text_classfication
```

```
(label-studio-ml) D:\v_desktop\Annotation_platform\label-studio-ml-backend>label-studio-ml start .\my_ml_text_classfication
[2025-07-23 16:36:56,361] [INFO] [model::<module>::16] 读取的 LS 地址是：http://***.***.**.***:***
 * Serving Flask app 'label_studio_ml.api'
 * Debug mode: off
[2025-07-23 16:36:56,957] [INFO] [werkzeug::_log::97] WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:9090
 * Running on http://***.***.*.***:****
[2025-07-23 16:36:56,957] [INFO] [werkzeug::_log::97] Press CTRL+C to quit
```

# 三、备注

## 1、示例标注模板说明

```
<View>
  <Text name="text" value="$text" valueType="url"/>
  <View style="box-shadow: 2px 2px 5px #999;padding: 20px; margin-top: 2em;border-radius: 5px;">
    <Choices name="category" toName="text" choice="single" showInLine="true">
      <Choice value="财经"/>
      <Choice value="科技"/>
      <Choice value="餐饮评价"/>
      <Choice value="健康"/>
      <Choice value="娱乐"/>
    </Choices>
  </View>
</View>
```

当valueType="url"则适用于整个文件（如 `/data/upload/xx/xxx.txt`）。

当无valueType="url"则适用于划分文本的模板。

## 2、文件说明

Treat CSV/TSV as 

- List of tasks      # 文本按换行符分割====>对应文本file_use_split
- Time Series or Whole Text File   # 按整个文本加载====>需要valueType="url"

在test_files文件夹中：

```
test_files/
├── file_use_split.txt       
└── flie_use_whole.txt
```