# node-tarpit

**Repository Path**: qingfeng0512/node-tarpit

## Basic Information

- **Project Name**: node-tarpit
- **Description**: # Node Tarpit - 爬虫蜜罐/焦油坑

🕷️ 一个使用 Node.js 实现的反爬虫蜜罐系统，通过生成无限的垃圾链接和内容，消耗爬虫的资源。
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-05-25
- **Last Updated**: 2026-05-25

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Node Tarpit - 爬虫蜜罐/焦油坑

🕷️ 一个使用 Node.js 实现的反爬虫蜜罐系统，通过逼真的支付平台页面引诱爬虫注册，并生成无限的垃圾链接消耗爬虫资源。

## 📋 目录

- [功能特性](#-功能特性)
- [访问流程](#-访问流程)
- [快速开始](#-快速开始)
- [配置说明](#-配置说明)
- [Nginx 集成](#-nginx-集成)
- [生产部署](#-生产部署)
- [进阶用法](#-进阶用法)
- [注意事项](#-注意事项)

## ✨ 功能特性

| 特性 | 说明 |
|------|------|
| 🎣 诱导注册 | 精美支付平台落地页，引诱爬虫填写敏感信息 |
| 📝 数据收集 | 记录公司名、法人、身份证、手机号等 |
| 🎲 随机延迟 | 2-5 秒随机响应时间，拖慢爬虫速度 |
| 🔗 无限链接 | 焦油坑页面 90 个随机链接，无限循环 |
| 📊 请求监控 | 自动识别并记录爬虫请求 |
| 🔄 永不重复 | 使用 crypto 生成唯一路径 |

## 🔄 访问流程

```
┌─────────────┐
│   /index    │  精美的支付平台落地页
│  (首页入口)  │  吸引爬虫点击注册
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  /register  │  注册表单蜜罐
│  (注册页面)  │  收集敏感信息
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   /news     │  焦油坑页面
│  (蜜罐链接)  │  90 个循环链接消耗爬虫
└──────┬──────┘
       │
       └───────→ 无限循环...
```

### 路由说明

| 路由 | 功能 | 描述 |
|------|------|------|
| `/` | 重定向 | 自动跳转到 `/index` |
| `/index` | 首页 | 支付平台落地页，诱导注册 |
| `/register` | 注册页 | 多字段表单，收集敏感信息 |
| `/news` | 焦油坑 | 90 个随机链接，无限循环 |

## 🚀 快速开始

### 1. 安装依赖

```bash
cd node-tarpit
npm install
```

### 2. 启动服务

```bash
# 开发模式
npm run dev

# 生产模式
npm start
```

### 3. 验证运行

访问 `http://localhost:3000` 即可看到蜜罐页面。

## ⚙️ 配置说明

### 环境变量

| 变量名 | 默认值 | 说明 |
|--------|--------|------|
| `PORT` | `3000` | 服务监听端口 |

### 核心参数调整

编辑 `app.js` 可调整以下参数：

```javascript
// 延迟时间 (毫秒)
const delay = Math.floor(Math.random() * 3000) + 2000;  // 2-5 秒

// 垃圾内容字数
generateGibberish(80, 200);  // 80-200 字

// 链接数量
generateSpamLinks(req.path, 80);  // 30-80 个链接
```

## 🌐 Nginx 集成

### 基础配置

创建 Nginx 配置文件 `/etc/nginx/sites-available/tarpit`：

```nginx
server {
    listen 80;
    server_name your_domain.com;  # 替换为你的域名或 IP

    # 蜜罐入口
    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        
        # 蜜罐优化：增加超时时间
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
        proxy_connect_timeout 60s;
    }
}
```

### 启用配置

```bash
# 创建软链接
sudo ln -s /etc/nginx/sites-available/tarpit /etc/nginx/sites-enabled/

# 测试配置
sudo nginx -t

# 重载 Nginx
sudo systemctl reload nginx
```

### 精准打击：仅针对爬虫

```nginx
server {
    listen 80;
    server_name your_domain.com;

    # 正常用户访问真实网站
    location / {
        proxy_pass http://127.0.0.1:8080;  # 真实网站端口
    }

    # 针对已知爬虫 UA 转发到蜜罐
    location @tarpit {
        proxy_pass http://127.0.0.1:3000;
        proxy_read_timeout 300s;
    }

    # 匹配常见爬虫 UA
    if ($http_user_agent ~* "(GPTBot|ClaudeBot|Googlebot|Bingbot|Bytespider|AppleBot)") {
        rewrite ^(.*)$ / break;
        proxy_pass http://127.0.0.1:3000;
    }
}
```

## 📦 生产部署

### 使用 PM2

```bash
# 全局安装 PM2
sudo npm install -g pm2

# 启动蜜罐
pm2 start app.js --name "crawler-trap"

# 设置开机自启
pm2 startup
pm2 save

# 查看状态
pm2 status

# 查看日志
pm2 logs crawler-trap
```

### 集群模式（多核 CPU）

```bash
# 启动多个实例，充分利用 CPU
pm2 start app.js --name "crawler-trap" -i max
```

### Docker 部署

创建 `Dockerfile`：

```dockerfile
FROM node:18-alpine

WORKDIR /app
COPY package*.json ./
RUN npm install --production

COPY app.js ./

EXPOSE 3000

CMD ["node", "app.js"]
```

构建并运行：

```bash
docker build -t node-tarpit .
docker run -d -p 3000:3000 --name tarpit node-tarpit
```

## 🔧 进阶用法

### 自定义词库

编辑 `app.js` 中的 `WORDS` 数组，添加特定领域词汇：

```javascript
const WORDS = [
  '你的', '自定义', '词汇',
  // ... 现有词汇
];
```

### 添加更多路由

```javascript
// API 蜜罐
app.get('/api/*', (req, res) => {
  setTimeout(() => {
    res.json({
      status: 'loading',
      data: generateGibberish(),
      next_page: `/api/${generateRandomPath()}`
    });
  }, 3000);
});

// JSON 响应蜜罐
app.get('/data/:id', (req, res) => {
  res.json({
    id: req.params.id,
    content: generateGibberish(),
    links: Array(50).fill(null).map(() => ({
      url: `/data/${generateRandomPath()}`,
      title: generateGibberish(2, 5)
    }))
  });
});
```

### 监控与日志

```javascript
// 添加日志中间件
const fs = require('fs');

app.use((req, res, next) => {
  const log = {
    timestamp: new Date().toISOString(),
    ip: req.get('x-forwarded-for') || req.ip,
    path: req.path,
    userAgent: req.get('user-agent'),
    isCrawler: /bot|spider|crawler/i.test(req.get('user-agent') || '')
  };
  
  fs.appendFileSync('logs/requests.log', JSON.stringify(log) + '\n');
  next();
});
```

## ⚠️ 注意事项

### 🚨 重要警告

1. **SEO 影响**：部署蜜罐的域名可能被搜索引擎降权或除名
   - ✅ 建议使用专门子域名（如 `trap.example.com`）
   - ❌ 不要直接在主站域名上部署

2. **爬虫识别**：可能误伤正常爬虫
   - Googlebot、Bingbot 等合法爬虫可能被误导
   - 建议配合 `robots.txt` 使用

3. **资源消耗**：
   - 每个请求占用连接 2-5 秒
   - 高流量下可能需要调整服务器配置

### robots.txt 示例

```txt
User-agent: *
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /
```

### 法律合规

- 确保符合当地法律法规
- 不要在 `robots.txt` 中故意误导（某些地区可能违法）
- 建议仅用于安全防护目的

## 📊 监控指标

建议监控以下指标：

| 指标 | 说明 |
|------|------|
| 请求数/分钟 | 评估爬虫流量 |
| 平均响应时间 | 监控延迟效果 |
| 连接数 | 评估资源消耗 |
| 爬虫 UA 分布 | 识别主要爬虫来源 |

## 📝 许可证

MIT License

## 🤝 贡献

欢迎提交 Issue 和 Pull Request！