基于机器学习的 WebShell 入侵检测系统设计与实现-CSDN博客

一、课题名称

基于机器学习的 WebShell 入侵检测系统设计与实现

二、研究背景与意义

随着互联网的高速发展，Web 应用成为政府、企业、教育等各行业信息系统的核心组成部分。然而，Web 应用的开放性和动态特性也使其成为网络攻击的主要目标，其中 WebShell 攻击作为一种高隐蔽性、高危害性的攻击方式，近年来频繁出现在各类安全事件中。

WebShell 是攻击者上传到 Web 服务器上的一段恶意脚本文件，通常采用 PHP、JSP、ASP 等脚本语言编写，可绕过登录认证，通过浏览器远程执行命令、窃取数据、上传恶意程序，甚至控制整个服务器。尤其是在弱口令、文件上传漏洞频发的情况下，WebShell 成为攻击链条中的关键环节。

传统的 WebShell 检测方式大多依赖特征匹配或正则规则，这类方法对于已知攻击具有一定效果，但对变形、混淆、加密处理后的 WebShell 几乎无能为力。此外，随着攻击者技术的演进，WebShell 行为越来越“正常化”，更难以通过静态规则进行判别。因此，如何实现更智能、更动态、更高准确率的 WebShell 检测，成为网络安全领域的研究热点和技术难题。

本课题拟利用机器学习技术，通过对 Web 请求和日志行为数据进行特征提取与模型学习，建立一个智能化入侵检测系统。该系统不仅具备一定的泛化能力，可识别变种 WebShell，还具备实时性与可扩展性，适用于政务网站、企业服务器等实际部署场景。通过本项目的实施，能够有效提升 Web 应用安全防御水平，为网络空间安全保驾护航，具有重要的理论研究价值与工程应用意义。

三、研究内容

本研究将围绕 WebShell 检测中的关键问题展开，设计并实现一个基于机器学习的入侵检测系统，研究内容涵盖数据层、算法层、系统层三个方面，具体包括以下几点：

数据采集与清洗

从实际 Web 应用中收集服务器访问日志、请求报文（如 Apache/Nginx access log、HTTP 请求体）；
整合公开 WebShell 数据集（如 C99、WSO、China Chopper 等）；
利用 Pandas 等工具进行数据清洗，处理缺失值、异常值，统一字段格式，构造标注数据集；
区分正常请求与 WebShell 恶意请求，形成监督学习所需的样本结构。

特征工程与向量化处理

静态特征提取：如请求路径长度、参数个数、是否包含危险函数（eval、assert、exec 等）；
内容特征：使用正则提取 base64 编码、字符熵、Shell 特有关键字等；
动态行为特征：分析用户访问频率、IP 分布、连续访问次数、异常时段行为等；
文本向量化处理：利用 TF-IDF、词袋模型、n-gram 分词将文本转化为向量形式，支持机器学习算法输入。

模型训练与评估优化

选择合适的分类模型：包括决策树、随机森林、支持向量机（SVM）、XGBoost、逻辑回归等；
使用交叉验证和网格搜索进行超参数调优；
比较不同模型在准确率、召回率、F1 值、AUC 曲线等指标下的性能；
尝试集成学习或深度学习（如 TextCNN）进一步提升检测准确率。

模型部署与系统集成

使用 Flask 框架搭建简洁的 Web 服务接口，实现模型在线调用；
用户可通过 Web 页面上传请求日志、文本文件或手动输入请求参数，系统返回是否为恶意请求；
将模型保存为 pickle 或 joblib 格式，支持快速加载与预测；
接入日志数据库，结合 loguru 日志框架，对每一次预测行为进行审计与记录。

系统测试与实际应用验证

构建测试集对系统进行黑盒/白盒测试，模拟多种攻击场景；
比较机器学习模型与传统正则规则法在实际环境中的检测能力；
根据误报率和漏报率不断优化特征与模型参数；
撰写系统使用说明，准备项目部署文档与运维指南。

四. 环境配置

4.1 系统环境

Python >= 3.7
pip（Python 包管理工具）

4.2 安装依赖

pip install -r requirements.txt

`requirements.txt` 内容如下：

flask
scikit-learn
joblib

4.3. 项目结构说明

webshell_detection/
├── data/
│   ├── webshell/               # 存放 WebShell 样本（用于训练）
│   ├── normal/                 # 存放正常网页文件（用于训练）
├── model/
│   └── webshell_detector.pkl   # 训练好的机器学习模型
├── extract_features.py         # 特征提取脚本
├── train_model.py              # 模型训练脚本
├── app.py                      # Flask Web 检测服务端代码
├── requirements.txt            # Python 依赖列表
├── utils.py                    # 工具函数（可选扩展）

五. 各模块说明

5.1 `extract_features.py` - 特征提取模块

作用：

从 PHP 文件中提取静态代码特征，例如敏感函数调用次数（如 eval, base64_decode, assert 等）。

主要特征包括：

文件长度
eval 函数出现次数
base64 编码函数次数
system, exec, passthru, assert, preg_replace, str_rot13 等关键词出现频次

示例函数：

def extract_features_from_file(filepath):
    ...
    features = {
        "length": len(content),
        "num_eval": content.count("eval"),
        ...
    }
    return list(features.values()), features

5.2 `train_model.py` - 模型训练模块

作用：

加载数据集并提取特征，通过 Random Forest 分类器训练一个 WebShell 检测模型。

训练流程：

加载 WebShell 和正常脚本样本
对所有文件进行特征提取
使用 RandomForestClassifier 进行训练
保存模型至 model/webshell_detector.pkl

运行命令：

python train_model.py

5.3 `app.py` - Web 检测服务

作用：

提供 HTTP API，用于上传 PHP 文件并返回是否为 WebShell 的判断结果。

接口说明：

URL: /detect
Method: POST
参数: 上传的 file（PHP 脚本）
返回:

{
  "is_webshell": true,
  "features": {
    "length": 1234,
    "num_eval": 2,
    ...
  }
}

启动服务：

python app.py

六. 使用说明

6.1 数据准备

将已收集到的样本文件分别放入以下目录：

恶意 WebShell 样本放入：data/webshell/
正常网页脚本放入：data/normal/

✅ 可从 GitHub、PHP 项目、CTF 题库等来源收集样本。

6.2 模型训练

python train_model.py

输出：

✅ 模型训练完成并保存。

生成文件：model/webshell_detector.pkl

6.3 启动 Web 服务

python app.py

默认监听地址为 http://127.0.0.1:5000

6.4 接口调用示例

使用 curl 测试：

curl -F "file=@test.php" http://127.0.0.1:5000/detect

返回示例：

{
  "is_webshell": true,
  "features": {
    "length": 250,
    "num_eval": 1,
    "num_base64": 2,
    "num_exec": 0,
    "num_shell": 0,
    "num_str_rot13": 0,
    "num_preg_replace": 0,
    "num_assert": 0,
    "num_system": 0,
    "num_passthru": 0
  }
}

七. 结语

本系统是一个轻量级的 WebShell 检测方案，适合用作毕业设计、网络安全实验、或企业初步接入的静态检测模块。你可以在此基础上逐步扩展为更复杂的动态检测系统。

以下是详细的代码详细的代码示例：

📌 1. `extract_features.py` - 特征提取脚本

# extract_features.py
import os
import re

def extract_features_from_file(filepath):
    with open(filepath, 'r', errors='ignore') as f:
        content = f.read()

    features = {
        "length": len(content),
        "num_eval": content.count("eval"),
        "num_base64": content.count("base64"),
        "num_exec": content.count("exec"),
        "num_shell": content.count("shell"),
        "num_str_rot13": content.count("str_rot13"),
        "num_preg_replace": content.count("preg_replace"),
        "num_assert": content.count("assert"),
        "num_system": content.count("system"),
        "num_passthru": content.count("passthru"),
    }
    return list(features.values()), features

📌 2. `train_model.py` - 模型训练脚本

# train_model.py
import os
import joblib
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from extract_features import extract_features_from_file

def load_dataset(webshell_dir, normal_dir):
    X = []
    y = []
    for filepath in list_files(webshell_dir):
        feats, _ = extract_features_from_file(filepath)
        X.append(feats)
        y.append(1)  # WebShell = 1

    for filepath in list_files(normal_dir):
        feats, _ = extract_features_from_file(filepath)
        X.append(feats)
        y.append(0)  # Normal = 0

    return np.array(X), np.array(y)

def list_files(directory):
    return [os.path.join(directory, f) for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))]

if __name__ == '__main__':
    X, y = load_dataset('data/webshell', 'data/normal')
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X, y)
    os.makedirs('model', exist_ok=True)
    joblib.dump(clf, 'model/webshell_detector.pkl')
    print("✅ 模型训练完成并保存。")

📌 3. `app.py` - Flask Web 检测服务

# app.py
import os
import joblib
from flask import Flask, request, jsonify
from extract_features import extract_features_from_file

app = Flask(__name__)
model = joblib.load('model/webshell_detector.pkl')

@app.route('/detect', methods=['POST'])
def detect():
    if 'file' not in request.files:
        return jsonify({'error': '未检测到文件'}), 400

    file = request.files['file']
    filepath = os.path.join('/tmp', file.filename)
    file.save(filepath)

    features, feature_dict = extract_features_from_file(filepath)
    prediction = model.predict([features])[0]
    os.remove(filepath)

    return jsonify({
        'is_webshell': bool(prediction),
        'features': feature_dict
    })

if __name__ == '__main__':
    app.run(debug=True)

📌 4. `requirements.txt`

flask
scikit-learn
joblib

✅ 使用说明

准备数据集：

data/webshell/        # 存放 PHP/ASP/JSP WebShell 样本
data/normal/          # 存放正常网页脚本，如 WordPress、Laravel 等

安装依赖：

pip install -r requirements.txt

训练模型：

python train_model.py

启动服务：

python app.py

检测 WebShell（使用 curl 或 Postman）：

curl -F "file=@example.php" http://127.0.0.1:5000/detect

基于机器学习的 WebShell 入侵检测系统设计与实现

一、课题名称

二、研究背景与意义

四. 环境配置

4.1 系统环境

4.2 安装依赖

requirements.txt 内容如下：

4.3. 项目结构说明

五. 各模块说明

5.1 extract_features.py - 特征提取模块

5.2 train_model.py - 模型训练模块

5.3 app.py - Web 检测服务

六. 使用说明

6.1 数据准备

6.2 模型训练

6.3 启动 Web 服务

6.4 接口调用示例

七. 结语

📌 1. extract_features.py - 特征提取脚本

📌 2. train_model.py - 模型训练脚本

📌 3. app.py - Flask Web 检测服务

📌 4. requirements.txt

✅ 使用说明

`requirements.txt` 内容如下：

5.1 `extract_features.py` - 特征提取模块

5.2 `train_model.py` - 模型训练模块

5.3 `app.py` - Web 检测服务

📌 1. `extract_features.py` - 特征提取脚本

📌 2. `train_model.py` - 模型训练脚本

📌 3. `app.py` - Flask Web 检测服务

📌 4. `requirements.txt`