php_webshell_det
php_webshell_det
Abstract—Webshell is a type of backdoor, and web applications complex dynamic encryption and decryption scripts or large
arXiv:2502.19257v2 [cs.CR] 27 Feb 2025
are widely exposed to webshell injection attacks. Therefore, it is files) is quite challenging. Methods such as sample slicing [3]
important to study webshell detection techniques. In this study, or TextRank [5] are often used to reduce data size, which
we propose a webshell detection method. We first convert PHP
source code to opcodes and then extract Opcode Double-Tuples may result in some loss of code information or disruption of
(ODTs). Next, we combine CodeBert and FastText models for contextual relationships.
feature representation and classification. To address the challenge This study focuses on the PHP language because PHP
that deep learning methods have difficulty detecting long webshell is used by 75.1% of all the websites whose server-side
files, we introduce a sliding window attention mechanism. This programming language [7]. To address the challenges, this
approach effectively captures malicious behavior within long
files. Experimental results show that our method reaches high study contribution includes (1) collating a new high-quality
accuracy in webshell detection, solving the problem of traditional Webshell dataset, (2) proposing a PHP code data processing
methods that struggle to address new webshell variants and anti- method to extract Opcode Double-Tuples(ODTs) including
detection techniques. opcode instructions and operands instead of Opcode Single-
Tuples(OSTs), (3) introducing a window attention mechanism
I. I NTRODUCTION
to solve the long text problem.
The webshell injection plays a vital role in the hacker
attack chain, enabling the attacker to remotely control devices, II. METHODOLOGY
acquire sensitive data, and further expand attack activities. The detection method consists of two steps. First, the
Therefore, Detecting and removing webshells is an effective PHP source code in the dataset is processed into ODTs.
way to defend against attacks and ensure web security. Second, using a sliding window attention mechanism,
Traditional webshell detection methods [1], [2] based on we combine the CodeBert model [8] and the Fasttext
pattern matching usually rely on recognizing known features, model [9] for feature representation and binary classifica-
including source code features, traffic features, dynamic func- tion of the ODTs. Our dataset and processing code are
tion calls and other relevant features. However, as attack publicly available: https://github.com/w-32768/PHP-Webshell-
techniques evolve, the variability and obfuscation of webshells Detection-via-Opcode-Analysis
have become more prevalent. Attackers often use obfuscation,
dynamic loading, encryption and decryption techniques to
evade detection, making traditional detection methods inad-
equate for recognizing new types of webshells.
In this context, webshell detection methods using deep
learning [3], [4], [5], including those based on source code
or opcode, have become a research hotspot and have shown
promising results. However, current deep learning-based web-
shell detection methods still face challenges [6]. For datasets,
publicly available datasets are outdated and do not contain
the latest samples. Therefore, their performance in real-world
environments for detecting may not be good. For data process-
Fig. 1. Overview of the detection method.
ing, a good data processing method is often more important
than the detection model. The opcode-based detection methods A. Data processing
typically extract only a single sequence of opcode instructions The dataset consists of PHP source code files containing
(called Opcode Single-Tuples) without effectively capturing 5001 webshell samples and 5936 benign PHP files. Firstly,
low-level code features. The source code-based method is we convert the PHP source code to the opcode. The opcode,
complicated for processing webshells that use anti-detection generated by the Zend Engine in PHP, is a low-level abstrac-
techniques. In addition, detecting long sequence files (such as tion of source code. As anti-detection techniques are mostly
used at the source code level, we have a natural advantage in sequences to be processed. Furthermore, the overlap between
using opcode detection. adjacent windows allows information exchange, making it
After obtaining the opcodes, a series of data processing possible to detect malicious behaviors.
steps are performed. We use expert knowledge to establish The sliding window attention mechanism reduces compu-
fine-grained processing rules, extracting high-value instruc- tational complexity and preserves the contextual information
tions for detection while excluding those of low relevance, of the opcode sequence. Thus, the problem of incomplete
thus reducing opcode length without compromising contextual information caused by other methods is avoided.
semantics. Operands may be encoded by URL or Base64 3) Binary Classification:
encoding, making it difficult to determine their semantics. After getting the global feature representation of the ODTs,
Therefore, we perform the decoding operation. The original we input them into a binary classifier. The classifier consists
string content is restored based on string feature recognition. of fully connected layers and activation functions, trained by
After this extraction, we have the set of opcode instructions minimizing the binary cross-entropy loss function. It distin-
and operands, called Opcode Double-Tuples. Experimental guishes between benign PHP code and malicious webshells.
comparisons show that, under the same detection model train- 4) Model Training and Evaluation:
ing on our dataset, ODTs achieve a 4.6% accuracy improve- We fine-tuned the CodeBert model using the AdamW opti-
ment compared to OSTs, confirming that our data processing mizer. Experimental results show that our proposed optimal
method is advanced and professional. model achieves an accuracy of 99.2% and an F1 score of
99.1% on the test set. Comparative experiments with accessi-
B. Feature Representation and Binary Classification ble state-of-the-art webshell detection methods, including web-
After data processing, this study explores using the Code- shellPub [2] (Acc: 77.3%, F1: 68.5%), PHP Malware Finder
Bert model and various embedding models for feature repre- [1] (Acc:83.4%, F1:78.9%), and MSDetector [3] (Acc:97.1%,
sentation and binary classification of ODTs. The steps are as F1: 97.3%), demonstrate the superiority of our method.
follows:
1) Feature Representation. III. CONCLUSION
• CodeBert Model: The CodeBert Model is a widely used
This study presents a PHP webshell data processing method
pre-trained language model optimized for code under- that extracts ODTs, addressing the limitations of single-tuples
standing tasks and pre-trained on PHP code. We input detection. Additionally, we introduce a sliding window atten-
the ODTs into the CodeBert model to generate high- tion mechanism that effectively mitigates the challenges of
dimensional feature vector representations that capture long text detection. This study offers a new perspective on
the semantic and syntactic information of the opcodes. the field of malicious code detection. In the future, we aim to
• Embedding Models: To enhance opcode feature rep-
continually explore multi-language webshell detection tasks to
resentation, we compared four embedding models: improve detection performance and generalization capabilities.
Word2Vec, FastText, Glove, and Doc2Vec. Experimental ACKNOWLEDGMENT
comparisons show that FastText performs best in the
This work was supported by “the Fundamental Re-
opcode classification task; therefore, we chose FastText
search Funds for the Central Universities” (Grant Num-
as the embedding model.
ber:3282024050).
• Feature Fusion: We fuse the feature vectors generated by
CodeBert with the embedding vectors from FastText to R EFERENCES
form the final feature representation. The specific fusion [1] NBS System, “PHP malware finder,” 2022. [Online]. Available: https:
formula is as follows: //github.com/nbs-system/php-malware-finder.
[2] ShellPub, “PHP webshell detection,” 2024. [Online]. Available: https://n.
E = λECodeBert + (1 − λ)EFastText (1) shellpub.com/en.
[3] B. Cheng, Y. Guo, Y. Ren, G. Yang, and G. Xu, “MSDetector: a static
ECodeBert and EFastText represent the feature vectors PHP webshell detection system based on deep learning,” in Theoretical
generated by CodeBert and FastText, respectively. λ is Aspects of Software Engineering, vol. 13299, 2022, pp. 155–172.
the weight coefficient, and its optimal value is determined [4] A. Hannousse, M. Nait-Hamoud, and S. Yahiouche, “A deep learner
model for multi-language webshell detection,” International Journal of
through experimentation. Information Security, vol. 22, no. 1, pp. 47–61, 2023.
2) Sliding Window Attention Mechanism: [5] T. An, X. Shui, and H. Gao, “Deep learning based webshell detection
coping with long text and lexical ambiguity,” in Information And Com-
We introduce a sliding window attention mechanism to munications Security, 2022, pp. 438–457.
address the high computational complexity of global self- [6] M. Ma, L. Han, and C. Zhou, “Research and application of artificial
attention mechanisms for long opcode sequences. The op- intelligence based webshell detection model: a literature review,” ArXiv,
vol. 2405.00066, 2024.
code sequence is divided into multiple windows of size W [7] W3Techs, “Usage statistics and market share of PHP for websites,” 2025.
with a stride of Sr(Sr < W ). Specifically, Self-attention [Online]. Available: https://w3techs.com/technologies/details/pl-php.
is calculated independently within each window. The global [8] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong et al., “Codebert:
a pre-trained model for programming and natural languages,” ArXiv, vol.
feature representation is obtained by averaging the last hidden 2002.08155, 2020.
states from the CodeBert encoder across all windows. This [9] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for
mechanism reduces memory requirements and allows longer efficient text classification,” ArXiv, vol. 1607.01759, 2016.
Poster: Long PHP webshell files detection
based on sliding window attention
Zhiqiang Wang1✉, Haoyu Wang1✉, Lu Hao2
1Beijing Electronic Science & Technology Institute, 2Beijing Municipal Public Security Bureau, Beijing, China
Email: [email protected], [email protected]
Fig 3. Source code Fig 4. Raw opcode Fig 5. Opcode Double-Tuples* Fig 6. Opcode Single-Tuples
REFERENCES
[1] NBS System, “PHP malware finder,” 2022. [Online]. Available: https://github.com/nbs-system/php-malware-finder.
[2] ShellPub, “PHP webshell detection,” 2024. [Online]. Available: https://n.shellpub.com/en.
[3] B. Cheng, Y. Guo, Y. Ren, G. Yang, and G. Xu, “MSDetector: a static PHP webshell detection system based on deep learning,” in Theoretical Aspects of Software Engineering, vol. 13299, 2022, pp. 155–172.